Build a working Retrieval-Augmented Generation pipeline over a real document set (50 markdown documents on a domain of your choosing — see the kit, or pick your own corpus and link it in the README).
Required
- Pipeline: chunking → embeddings → retrieval → answer-generation. Pick stack freely (any LLM, any vector store).
- Eval set: ≥30 question/answer pairs, sourced from real users where possible. Report Recall@10 *and* MRR (or equivalent ranking metric).
- One iteration: change one design knob (chunk size, embedding model, retriever, reranker, prompt) and report before/after on the same eval set. Honest negatives count — document one ablation that didn't pan out.
- Faithfulness check: any approach is fine — LLM-as-judge with anchors, deterministic check against gold spans, or human spot-check ≥10 outputs. Report the metric you chose and why.
- Latency budget: name a target p95 latency (e.g. ≤2s) and report whether you hit it. If you didn't, name the bottleneck.
- Cost discipline: show one production-grade cost choice. Examples: prompt caching with `cache_control` + reported `cache_read_input_tokens`, batch processing, or a smaller model for retrieval scoring. Tell us what you chose and why.
- At least one native API capability used appropriately for the task (citations, structured outputs, extended thinking, batch, or tool use). Tell us which and why.
- README that runs the system end-to-end in <5 minutes from a clean clone.
- 60-second walkthrough video.
- ≤600-word narrative explaining the choices, the failure modes you found, and what you would change next.
Out of scope
- UI polish — a CLI or minimal web UI is fine.
- Auth, multi-user, or production deploy.
- Fine-tuning. Foundations is about prompting + retrieval discipline.
How it's graded
One rubric — Engineering Foundations — applied at full weight. Four criteria: build quality (30%), end-to-end evals (25%), code clarity (20%), build narrative (25%). Each criterion is scored 0–5 with a written rationale by the grader.