Author a complete evals scorecard for one of three AI feature briefs we provide (see the kit), or pick a real AI feature shipping in a public product. The scorecard is the artifact a hiring PM would attach to a launch review — it should answer "is this safe and useful enough to ship?"
Required deliverables
- Problem framing (1 page): user, JTBD, success criterion, an explicit non-goal, and a one-sentence trust principle the feature rests on.
- Metrics design: ≥ 2 leading metrics, ≥ 1 lagging metric, ≥ 2 guardrails with thresholds. For each metric, name how it could be gamed and the paired constraint that prevents the gaming.
- Eval set design: ≥ 30 representative inputs sourced from real users where possible, plus ≥ 5 adversarial cases. Name the grading approach (programmatic / LLM-as-judge with anchors / human) and the calibration plan (e.g. periodic human-vs-judge audit).
- Tradeoff analysis: ≥ 2 real tradeoffs (e.g. faithfulness vs latency, autonomy vs oversight, model cost vs quality). Resolve each with concrete numbers and pick a fallback for the chosen failure mode.
- Risk + governance: a pre-mortem with the top 5 failure modes (hallucination, refusal, latency, cost, adversarial use). Severity / likelihood / detection / mitigation per row.
- Kill criteria: pre-committed thresholds (e.g. faithfulness < X%, refusal > Y%, p95 latency > Zs, cost > $A/user/day) and a named owner empowered to halt the rollout.
- Measurement plan: cadence, named owner, dashboard sketch.
- 60-second walkthrough video.
- ≤ 800-word narrative tying the scorecard back to the user — what would change about the user's day if this ships.
Out of scope
- Code. PM Foundations is artifact-graded.
- A real working dashboard. A sketch is enough.
- Brand or visual polish. The grader does not score brand.
What we look for
- Metrics that cannot be gamed without tripping a guardrail.
- An eval set that catches regressions on real-user input distributions, not just author intuition.
- A named kill switch with a concrete owner.
- Honest articulation of one limitation the launch will ship with and what compensates.
How it's graded
One rubric — Product Foundations — applied at full weight. Four criteria: problem framing (25%), metrics design (30%), tradeoff analysis (25%), narrative quality (20%). Each criterion is scored 0–5 with a written rationale by the grader.