AI Engineering · Foundations

AI Engineering · Foundations

A two-part assessment for engineers shipping AI into production. First a short adaptive quiz across LLMs, RAG, agents, evals, tool use, prompt caching, and MCP. Then a real RAG pipeline you build, graded against the Engineering Foundations rubric.

Sign in to startAbout the Engineering track

How it works

Two parts. About 12 hours of focused work, spread over a few days.

Part 1 · Quiz

18 adaptive questions

~30 min · in one sitting

Multiple-choice and short numeric questions across AI engineering microskills: LLM fundamentals, prompt patterns, RAG basics, RAG evals, agent loops, end-to-end evals, tool use, prompt caching, and MCP. Difficulty adapts to your answers.

Part 2 · Project

Ship the RAG pipeline

~12h · over a few days

Build a working RAG pipeline against a fixed corpus, with a real eval harness. Submit your repo, a 600-word narrative, and a 60-second walkthrough.

Score

Composite, owned by you

0 to 100 · valid 18 months

Quiz contributes 40%, project contributes 60%. The Engineering Foundations rubric scores all four dimensions. Published to your verified profile.

The project

Build a working RAG pipeline

Build a working Retrieval-Augmented Generation pipeline over a real document set (50 markdown documents on a domain of your choosing — see the kit, or pick your own corpus and link it in the README).

Required

  • Pipeline: chunking → embeddings → retrieval → answer-generation. Pick stack freely (any LLM, any vector store).
  • Eval set: ≥30 question/answer pairs, sourced from real users where possible. Report Recall@10 *and* MRR (or equivalent ranking metric).
  • One iteration: change one design knob (chunk size, embedding model, retriever, reranker, prompt) and report before/after on the same eval set. Honest negatives count — document one ablation that didn't pan out.
  • Faithfulness check: any approach is fine — LLM-as-judge with anchors, deterministic check against gold spans, or human spot-check ≥10 outputs. Report the metric you chose and why.
  • Latency budget: name a target p95 latency (e.g. ≤2s) and report whether you hit it. If you didn't, name the bottleneck.
  • Cost discipline: show one production-grade cost choice. Examples: prompt caching with `cache_control` + reported `cache_read_input_tokens`, batch processing, or a smaller model for retrieval scoring. Tell us what you chose and why.
  • At least one native API capability used appropriately for the task (citations, structured outputs, extended thinking, batch, or tool use). Tell us which and why.
  • README that runs the system end-to-end in <5 minutes from a clean clone.
  • 60-second walkthrough video.
  • ≤600-word narrative explaining the choices, the failure modes you found, and what you would change next.

Out of scope

  • UI polish — a CLI or minimal web UI is fine.
  • Auth, multi-user, or production deploy.
  • Fine-tuning. Foundations is about prompting + retrieval discipline.

How it's graded

One rubric — Engineering Foundations — applied at full weight. Four criteria: build quality (30%), end-to-end evals (25%), code clarity (20%), build narrative (25%). Each criterion is scored 0–5 with a written rationale by the grader.

How we grade

One rubric. Four dimensions.

Your project is graded against the Engineering Foundations rubric. Each criterion is scored 0–5 with a written rationale, then weighted to a 0–100 project score.

Engineering · Foundations

Criteria & weights

  • Build quality30%
  • End-to-end evals25%
  • Code clarity20%
  • Build narrative25%

Anti-gaming

We measure thinking, not speed.

Every quiz answer has a 4-second minimum review time. Anything faster is recorded but doesn't affect your rating. Each question caps at 90 seconds. The whole quiz session has a 30-minute wall clock — once it expires, you finish what you have. Retakes for any track open after 14 days.