fix(quiz): tighten AdaptiveDifficultyEvaluator + clarify eval semantics

Jose-Gael-Cruz-Lopez · claude · Jose-Gael-Cruz-Lopez · commit 6493988bea00 · 2026-05-04T22:27:59.000-04:00
Closes the two issues from PR #77's second review. - AdaptiveDifficultyEvaluator now scores per-question (fraction compliant) instead of average rank. The prompt rule is per-question ("Never override the user-requested difficulty by more than one step"); the previous avg-based check let a 2-step outlier slip through if the rest of the mix balanced it out — e.g. requested hard with [easy, hard, hard, hard] used to score 1.0 (avg=1.5, within 1 of target=2) and now correctly scores 0.75. Switched to subscript _DIFF_RANK[q.difficulty] (Literal-constrained, fallback was unreachable). - The two new ADR-0014 cases now carry NOTE comments explaining that recency/staleness state is baked into the user message for replay determinism, while production sources it via read_recent_quiz_attempts. The cases pin the prompt's rule application; live-mode evals are the right place to catch tool-wiring regressions. Smoke-tested: hard + [easy, hard, hard, hard] -> 0.75; hard + all medium -> 1.0 (allowed shift); hard + all easy -> 0.0 (overshoot). 46 quiz unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/backend/tests/evals/quiz_generation.py b/backend/tests/evals/quiz_generation.py
@@ -98,16 +98,15 @@ def evaluate(self, ctx: EvaluatorContext[str, Quiz]) -> float:
 
 @dataclass
 class AdaptiveDifficultyEvaluator(Evaluator[str, Quiz]):
-    """Pin the prompt's adaptive-difficulty bound: the average produced
-    difficulty must be within ±1 step of the user-requested difficulty
-    (in the metadata's `requested_difficulty`). Cases without
-    `requested_difficulty` skip this check.
+    """Fraction of questions whose difficulty is within ±1 step of the
+    user-requested difficulty (in the metadata's `requested_difficulty`).
+    Cases without `requested_difficulty` skip this check.
 
     The agent is *allowed* to step down (struggling student) or step
-    up (consistent high accuracy) by one rank. This evaluator catches
-    the regression where it overshoots — e.g. requested medium and
-    produced all easy AND all hard. The point is the bound, not the
-    direction.
+    up (consistent high accuracy) by one rank per question. This
+    evaluator catches the regression where individual questions
+    overshoot — e.g. requested hard and produced an easy question
+    (a two-step jump). The bound is per-question, not averaged.
     """
 
     def evaluate(self, ctx: EvaluatorContext[str, Quiz]) -> float:
@@ -117,13 +116,12 @@ def evaluate(self, ctx: EvaluatorContext[str, Quiz]) -> float:
         target_rank = _DIFF_RANK.get(requested)
         if target_rank is None:
             return 1.0
-        # Compute average rank across produced questions; the bound
-        # is "within 1 step of requested." This permits the prompt's
-        # one-step adaptive shift in either direction but rejects
-        # anything beyond that.
-        ranks = [_DIFF_RANK.get(q.difficulty, target_rank) for q in ctx.output.questions]
-        avg = sum(ranks) / len(ranks)
-        return 1.0 if abs(avg - target_rank) <= 1.0 else 0.0
+        ok = sum(
+            1
+            for q in ctx.output.questions
+            if abs(_DIFF_RANK[q.difficulty] - target_rank) <= 1
+        )
+        return ok / len(ctx.output.questions)
 
 
 @dataclass
@@ -294,6 +292,11 @@ def evaluate(self, ctx: EvaluatorContext[str, Quiz]) -> float:
     # produce all easy questions for a hard request would slip through.
     Case(
         name="adaptive_downshift_struggling_student",
+        # NOTE: this case bakes the recent-accuracy signal into the user
+        # message for eval determinism. In production, the agent reads it
+        # via a `read_recent_quiz_attempts` tool call (see agents/quiz.py).
+        # This pins the prompt's adaptive-difficulty rule, not the tool-
+        # call data path — live-mode evals catch tool wiring regressions.
         inputs=(
             "Course: CS 201. Generate 3 hard multiple-choice questions "
             "covering Recursion, Dynamic Programming, and Graph Traversal. "
@@ -321,6 +324,12 @@ def evaluate(self, ctx: EvaluatorContext[str, Quiz]) -> float:
     # Evaluator asserts at least one question targets it.
     Case(
         name="spaced_repetition_revives_stale_concept",
+        # NOTE: the staleness signal (`last_reviewed_at` ages, mastery
+        # scores) is inlined into the prompt here for deterministic
+        # replay. Production sources the same data from a
+        # `read_recent_quiz_attempts` tool call (see agents/quiz.py), so
+        # this case validates the spaced-repetition rule's application,
+        # not the tool-call wiring. Use live-mode evals for that path.
         inputs=(
             "Course: BIO 100. Generate 3 medium multiple-choice questions. "
             "The student has been working on Photosynthesis (mastery 0.4) "