Skip to content

Commit 6493988

Browse files
fix(quiz): tighten AdaptiveDifficultyEvaluator + clarify eval semantics
Closes the two issues from PR #77's second review. - AdaptiveDifficultyEvaluator now scores per-question (fraction compliant) instead of average rank. The prompt rule is per-question ("Never override the user-requested difficulty by more than one step"); the previous avg-based check let a 2-step outlier slip through if the rest of the mix balanced it out — e.g. requested hard with [easy, hard, hard, hard] used to score 1.0 (avg=1.5, within 1 of target=2) and now correctly scores 0.75. Switched to subscript _DIFF_RANK[q.difficulty] (Literal-constrained, fallback was unreachable). - The two new ADR-0014 cases now carry NOTE comments explaining that recency/staleness state is baked into the user message for replay determinism, while production sources it via read_recent_quiz_attempts. The cases pin the prompt's rule application; live-mode evals are the right place to catch tool-wiring regressions. Smoke-tested: hard + [easy, hard, hard, hard] -> 0.75; hard + all medium -> 1.0 (allowed shift); hard + all easy -> 0.0 (overshoot). 46 quiz unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 801980d commit 6493988

1 file changed

Lines changed: 24 additions & 15 deletions

File tree

backend/tests/evals/quiz_generation.py

Lines changed: 24 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -98,16 +98,15 @@ def evaluate(self, ctx: EvaluatorContext[str, Quiz]) -> float:
9898

9999
@dataclass
100100
class AdaptiveDifficultyEvaluator(Evaluator[str, Quiz]):
101-
"""Pin the prompt's adaptive-difficulty bound: the average produced
102-
difficulty must be within ±1 step of the user-requested difficulty
103-
(in the metadata's `requested_difficulty`). Cases without
104-
`requested_difficulty` skip this check.
101+
"""Fraction of questions whose difficulty is within ±1 step of the
102+
user-requested difficulty (in the metadata's `requested_difficulty`).
103+
Cases without `requested_difficulty` skip this check.
105104
106105
The agent is *allowed* to step down (struggling student) or step
107-
up (consistent high accuracy) by one rank. This evaluator catches
108-
the regression where it overshoots — e.g. requested medium and
109-
produced all easy AND all hard. The point is the bound, not the
110-
direction.
106+
up (consistent high accuracy) by one rank per question. This
107+
evaluator catches the regression where individual questions
108+
overshoot — e.g. requested hard and produced an easy question
109+
(a two-step jump). The bound is per-question, not averaged.
111110
"""
112111

113112
def evaluate(self, ctx: EvaluatorContext[str, Quiz]) -> float:
@@ -117,13 +116,12 @@ def evaluate(self, ctx: EvaluatorContext[str, Quiz]) -> float:
117116
target_rank = _DIFF_RANK.get(requested)
118117
if target_rank is None:
119118
return 1.0
120-
# Compute average rank across produced questions; the bound
121-
# is "within 1 step of requested." This permits the prompt's
122-
# one-step adaptive shift in either direction but rejects
123-
# anything beyond that.
124-
ranks = [_DIFF_RANK.get(q.difficulty, target_rank) for q in ctx.output.questions]
125-
avg = sum(ranks) / len(ranks)
126-
return 1.0 if abs(avg - target_rank) <= 1.0 else 0.0
119+
ok = sum(
120+
1
121+
for q in ctx.output.questions
122+
if abs(_DIFF_RANK[q.difficulty] - target_rank) <= 1
123+
)
124+
return ok / len(ctx.output.questions)
127125

128126

129127
@dataclass
@@ -294,6 +292,11 @@ def evaluate(self, ctx: EvaluatorContext[str, Quiz]) -> float:
294292
# produce all easy questions for a hard request would slip through.
295293
Case(
296294
name="adaptive_downshift_struggling_student",
295+
# NOTE: this case bakes the recent-accuracy signal into the user
296+
# message for eval determinism. In production, the agent reads it
297+
# via a `read_recent_quiz_attempts` tool call (see agents/quiz.py).
298+
# This pins the prompt's adaptive-difficulty rule, not the tool-
299+
# call data path — live-mode evals catch tool wiring regressions.
297300
inputs=(
298301
"Course: CS 201. Generate 3 hard multiple-choice questions "
299302
"covering Recursion, Dynamic Programming, and Graph Traversal. "
@@ -321,6 +324,12 @@ def evaluate(self, ctx: EvaluatorContext[str, Quiz]) -> float:
321324
# Evaluator asserts at least one question targets it.
322325
Case(
323326
name="spaced_repetition_revives_stale_concept",
327+
# NOTE: the staleness signal (`last_reviewed_at` ages, mastery
328+
# scores) is inlined into the prompt here for deterministic
329+
# replay. Production sources the same data from a
330+
# `read_recent_quiz_attempts` tool call (see agents/quiz.py), so
331+
# this case validates the spaced-repetition rule's application,
332+
# not the tool-call wiring. Use live-mode evals for that path.
324333
inputs=(
325334
"Course: BIO 100. Generate 3 medium multiple-choice questions. "
326335
"The student has been working on Photosynthesis (mastery 0.4) "

0 commit comments

Comments
 (0)