AI Product · Foundations

AI Product · Foundations

A two-part assessment for product managers shipping AI features. First a short adaptive quiz across problem framing, evals design, agentic AI patterns, AI discovery, governance, metrics tradeoffs, human-AI UX, and AI launch. Then a real evals-scorecard artifact you author, graded against the Product Foundations rubric.

Sign in to startAbout the Product track

How it works

Two parts. About 10 hours of focused work, spread over a few days.

Part 1 · Quiz

18 adaptive questions

~30 min · in one sitting

Multiple-choice questions across AI PM microskills: problem framing, evals design, agentic AI patterns, AI discovery, governance, metric tradeoffs, AI product patterns, human-AI UX, and AI launch.

Part 2 · Project

Author the scorecard

~8h · over a few days

Author the evals scorecard a launch review would actually use for a real AI feature. Submit a doc, a 60-second walkthrough, and an 800-word narrative.

Score

Composite, owned by you

0 to 100 · valid 18 months

Quiz contributes 40%, project contributes 60%. The Product Foundations rubric scores all four dimensions. Published to your verified profile.

The project

Author an evals scorecard for an AI feature

Author a complete evals scorecard for one of three AI feature briefs we provide (see the kit), or pick a real AI feature shipping in a public product. The scorecard is the artifact a hiring PM would attach to a launch review — it should answer "is this safe and useful enough to ship?"

Required deliverables

  • Problem framing (1 page): user, JTBD, success criterion, an explicit non-goal, and a one-sentence trust principle the feature rests on.
  • Metrics design: ≥ 2 leading metrics, ≥ 1 lagging metric, ≥ 2 guardrails with thresholds. For each metric, name how it could be gamed and the paired constraint that prevents the gaming.
  • Eval set design: ≥ 30 representative inputs sourced from real users where possible, plus ≥ 5 adversarial cases. Name the grading approach (programmatic / LLM-as-judge with anchors / human) and the calibration plan (e.g. periodic human-vs-judge audit).
  • Tradeoff analysis: ≥ 2 real tradeoffs (e.g. faithfulness vs latency, autonomy vs oversight, model cost vs quality). Resolve each with concrete numbers and pick a fallback for the chosen failure mode.
  • Risk + governance: a pre-mortem with the top 5 failure modes (hallucination, refusal, latency, cost, adversarial use). Severity / likelihood / detection / mitigation per row.
  • Kill criteria: pre-committed thresholds (e.g. faithfulness < X%, refusal > Y%, p95 latency > Zs, cost > $A/user/day) and a named owner empowered to halt the rollout.
  • Measurement plan: cadence, named owner, dashboard sketch.
  • 60-second walkthrough video.
  • ≤ 800-word narrative tying the scorecard back to the user — what would change about the user's day if this ships.

Out of scope

  • Code. PM Foundations is artifact-graded.
  • A real working dashboard. A sketch is enough.
  • Brand or visual polish. The grader does not score brand.

What we look for

  • Metrics that cannot be gamed without tripping a guardrail.
  • An eval set that catches regressions on real-user input distributions, not just author intuition.
  • A named kill switch with a concrete owner.
  • Honest articulation of one limitation the launch will ship with and what compensates.

How it's graded

One rubric — Product Foundations — applied at full weight. Four criteria: problem framing (25%), metrics design (30%), tradeoff analysis (25%), narrative quality (20%). Each criterion is scored 0–5 with a written rationale by the grader.

How we grade

One rubric. Four dimensions.

Your project is graded against the Product Foundations rubric. Each criterion is scored 0–5 with a written rationale, then weighted to a 0–100 project score.

Product · Foundations

Criteria & weights

  • Problem framing25%
  • Metrics design30%
  • Tradeoff analysis25%
  • Narrative quality20%

Anti-gaming

We measure thinking, not speed.

Every quiz answer has a 4-second minimum review time. Anything faster is recorded but doesn't affect your rating. Each question caps at 90 seconds. The whole quiz session has a 30-minute wall clock — once it expires, you finish what you have. Retakes for any track open after 14 days.