Build a working AI product solo against a real user, end to end. Pick from the kit's three feature briefs (Documents Q&A with grounded answers, Personal-data assistant, or Domain agent) — or bring your own one-paragraph brief approved at intake. The same project is graded against three rubrics: Engineering, Product, and Design, each at 1/3 weight.
Required deliverables
- Working pipeline: chunking → retrieval → generation (or agent loop, if your brief calls for it) → user surface. Public GitHub repo that runs end-to-end from a clean clone in < 5 minutes.
- Eval set: ≥ 30 question / answer pairs sourced from real or realistic user queries, plus ≥ 5 adversarial cases. Report Recall@10 *and* MRR (or equivalent ranking metric) for retrieval, plus a faithfulness metric (LLM-as-judge with anchors, deterministic check, or human spot-check ≥ 10 outputs).
- One iteration with numbers: change one design knob (chunk size, model, retriever, prompt) and report before/after on the same eval set. Document one ablation that didn't pan out.
- Latency budget: name a target p95 latency. Report whether you hit it. If not, name the bottleneck.
- Cost discipline: show one production-grade cost choice — prompt caching with reported `cache_read_input_tokens`, batch processing for an async path, model-tier routing, etc. Tell us what you chose and why.
- At least one Anthropic native API capability used appropriately: citations, structured outputs, extended thinking, batch, tool use, or MCP. Tell us which and why.
- Trust UX: inline citations or a paired uncertainty signal. One refusal-as-redirect for an out-of-scope action (don't ship a flat "I can't help with that"). For long-running tasks, a visible agentic-UX surface — named steps + cancel.
- Pre-mortem + kill criteria: top 5 failure modes (hallucination, refusal, latency, cost, adversarial use) with severity / likelihood / detection / mitigation. Pre-committed kill thresholds with a named owner.
- 60-second walkthrough video showing the user flow + one rejected design call.
- ≤ 600-word narrative that reads to all three audiences — engineer, PM, designer — without dilution. Name the trust principle the feature rests on, two real tradeoffs (with the five-part structure: considered / picked / data / failure / fallback), and the cross-functional ownership of the riskiest call.
Out of scope
- UI polish — a CLI plus a minimal web UI is fine.
- Auth, multi-user, deploy.
- Fine-tuning. Foundations is about prompting + retrieval + evals + product judgment + design surfacing.
How it's graded
Three rubrics — Engineering, Product, and Design — each applied independently at 1/3 weight. Your verified profile shows all three breakdowns with the grader's per-criterion rationale. Twelve criteria total, each scored 0–5.