AI Builder · Foundations

How it works

Two parts. About 18 hours of focused work, spread over a few days.

Part 1 · Quiz

18 adaptive questions

~30 min · in one sitting

Multiple-choice and short numeric questions sampled across 32 microskills: engineering (LLMs, RAG, agents, tool use, prompt caching, MCP, evals, observability), product (framing, evals, agentic AI, governance, discovery, launch), design (prompt UI, uncertainty, generative UI, agentic UX, onboarding, capability surfacing), and AI-Builder synthesis (shipping loop, joint tradeoffs, audience-spanning narrative).

Part 2 · Project

Ship the brief below

~18h · over a few days

A real AI build, end to end — pipeline + eval set + trust UX + pre-mortem + kill criteria. Submit a repo, a 60-second walkthrough, and a 600-word narrative three audiences would forward.

Score

Composite, owned by you

0 to 100 · valid 18 months

Quiz contributes 40%, project contributes 60%. Each of the three rubrics weights 1/3 of the project score. Published to your verified profile.

The project

Documents Q&A with grounded answers

Build a working AI product solo against a real user, end to end. Pick from the kit's three feature briefs (Documents Q&A with grounded answers, Personal-data assistant, or Domain agent) — or bring your own one-paragraph brief approved at intake. The same project is graded against three rubrics: Engineering, Product, and Design, each at 1/3 weight.

Required deliverables

Working pipeline: chunking → retrieval → generation (or agent loop, if your brief calls for it) → user surface. Public GitHub repo that runs end-to-end from a clean clone in < 5 minutes.
Eval set: ≥ 30 question / answer pairs sourced from real or realistic user queries, plus ≥ 5 adversarial cases. Report Recall@10 *and* MRR (or equivalent ranking metric) for retrieval, plus a faithfulness metric (LLM-as-judge with anchors, deterministic check, or human spot-check ≥ 10 outputs).
One iteration with numbers: change one design knob (chunk size, model, retriever, prompt) and report before/after on the same eval set. Document one ablation that didn't pan out.
Latency budget: name a target p95 latency. Report whether you hit it. If not, name the bottleneck.
Cost discipline: show one production-grade cost choice — prompt caching with reported `cache_read_input_tokens`, batch processing for an async path, model-tier routing, etc. Tell us what you chose and why.
At least one Anthropic native API capability used appropriately: citations, structured outputs, extended thinking, batch, tool use, or MCP. Tell us which and why.
Trust UX: inline citations or a paired uncertainty signal. One refusal-as-redirect for an out-of-scope action (don't ship a flat "I can't help with that"). For long-running tasks, a visible agentic-UX surface — named steps + cancel.
Pre-mortem + kill criteria: top 5 failure modes (hallucination, refusal, latency, cost, adversarial use) with severity / likelihood / detection / mitigation. Pre-committed kill thresholds with a named owner.
60-second walkthrough video showing the user flow + one rejected design call.
≤ 600-word narrative that reads to all three audiences — engineer, PM, designer — without dilution. Name the trust principle the feature rests on, two real tradeoffs (with the five-part structure: considered / picked / data / failure / fallback), and the cross-functional ownership of the riskiest call.

Out of scope

UI polish — a CLI plus a minimal web UI is fine.
Auth, multi-user, deploy.
Fine-tuning. Foundations is about prompting + retrieval + evals + product judgment + design surfacing.

How it's graded

Three rubrics — Engineering, Product, and Design — each applied independently at 1/3 weight. Your verified profile shows all three breakdowns with the grader's per-criterion rationale. Twelve criteria total, each scored 0–5.

How we grade

Three rubrics. One project.

Your project is graded independently by Engineering, Product, and Design rubrics. Each weights 1/3 of the project score. The rubrics live below — these are the criteria your submission will be measured against.

Engineering

Foundations rubric

Build quality30%
End-to-end evals25%
Code clarity20%
Build narrative25%

Product

Foundations rubric

Problem framing25%
Metrics design30%
Tradeoff analysis25%
Narrative quality20%

Design

Foundations rubric

Interface clarity25%
Uncertainty handling25%
Feedback and recovery25%
Design narrative25%

Anti-gaming

We measure thinking, not speed.

Every quiz answer has a 4-second minimum review time. Anything faster is recorded but doesn't affect your rating. Each question caps at 90 seconds. The whole quiz session has a 30-minute wall clock — once it expires, you finish what you have. Retakes for any track open after 14 days.