refactor(quiz): convert generate_quiz to quiz_agent (refactor #2)#71
Conversation
…ADR 0005)
Replaces routes/quiz.py:82's manual prompt-string augmentation +
call_gemini_json with a typed Pydantic AI agent. The agent has tools
to pull weak-area + class-misconception data on demand instead of
pre-stuffing them into the user message — those tool calls show up
in Logfire traces, replacing the previously-opaque prompt building.
Wire shape (frontend submit/grade flows) unchanged. Legacy path
preserved per ADR 0001's migration contract. Per-task model routing
via SAPLING_MODEL_QUIZ env var.
What landed
- backend/agents/quiz.py: quiz_agent + Quiz/QuizQuestion outputs
(kept flat per ADR 0003 conv. 4). Difficulty + question type are
Literal types so Gemini's enum constraint applies. Prompt hash
17ab80b30316 baked into the run metadata for span-side versioning.
- backend/agents/tools/graph_read.py: read_concepts_for_user (sorted
ASC by mastery so weakest concepts appear first) and
read_misconceptions_for_course (reads course_concept_stats —
there's no separate misconceptions table; ADR 0004 spec assumed
one). Pure-async + thin Pydantic AI tool wrappers per the
agents/tools/graph.py pattern.
- backend/agents/_providers.py: 'quiz' added to AgentTask + _DEFAULTS;
default gemini-2.5-flash-lite, override SAPLING_MODEL_QUIZ.
- backend/routes/quiz.py: route is now async. _quiz_via_agent +
_legacy_generate_quiz wrap the agent-vs-fallback decision.
_agent_question_to_wire converts the flat agent output to the
legacy {options:[{label,text,correct}]} persisted shape so
submit_quiz's grading loop and the frontend are completely
unaffected. HTTPException re-raised inside the bare-Exception catch
so legitimate 404s during fallback don't recurse.
- backend/tests/evals/quiz_generation.py: 8 cases × 6 evaluators per
ADR 0005. Full 3-difficulty × 2-type grid coverage. Wired through
the existing run_with_cassette replay layer; cassettes recorded
separately via SAPLING_EVAL_MODE=record.
Tests
- tests/test_quiz_routes.py: 19/19 (was 14; +3 agent-success +2
agent-fallback).
- tests/test_quiz_agent_imports.py: 2/2 (new).
- tests/test_graph_read_tools.py: 5/5 (new).
- tests/test_shared_course_context.py: existing tests adapted to drive
the route via TestClient and force the legacy fallback by patching
quiz_agent.run.
- tests/test_documents_routes.py: stale 15 MB size-cap test renamed +
fixed (cap was bumped to 100 MB in commit 9912a25 — the test never
caught up, this restores it).
- Full backend suite: 439 passed, 3 pre-existing live-Supabase
failures unchanged. Was 438/4 before; net +1 pass.
ADR 0013 captures what surprised us during the refactor (no quiz
cache existed; misconceptions table doesn't exist; HTTPException
re-raise gotcha; short-answer grading still TODO) and what to carry
into refactor #3 (chat tutor).
No frontend changes. No env-var changes other than the new optional
SAPLING_MODEL_QUIZ override.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdded a typed quiz-generation agent with Pydantic output models, two async graph-read tools, an async /api/quiz/generate path that prefers the agent and falls back to the legacy generator, plus fingerprinting utilities and comprehensive tests and evals. (50 words) ChangesQuiz Agent Feature
Sequence DiagramsequenceDiagram
participant Client as Client / TestClient
participant Route as POST /api/quiz/generate
participant Agent as quiz_agent
participant ConceptTool as read_concepts_for_user_tool
participant MisconceptionTool as read_misconceptions_for_course_tool
participant DB as Supabase Tables
participant Legacy as _legacy_generate_quiz
participant Response as HTTP Response
Client->>Route: POST /api/quiz/generate (user_id, concept_node_id, difficulty, num_questions)
Route->>Route: validate/extract node, derive request_id
Route->>Agent: run(prompt, SaplingDeps(request_id))
Agent->>ConceptTool: read_concepts_for_user(user_id, course_id)
ConceptTool->>DB: SELECT graph_nodes (filter, order mastery ASC)
DB-->>ConceptTool: [ConceptMastery...]
ConceptTool-->>Agent: [ConceptMastery...]
Agent->>MisconceptionTool: read_misconceptions_for_course(course_id)
MisconceptionTool->>DB: SELECT course_concept_stats (common_misconceptions)
DB-->>MisconceptionTool: [Misconception...]
MisconceptionTool-->>Agent: [Misconception...]
Agent->>Agent: generate Quiz (QuizQuestion[])
Agent-->>Route: Quiz
Route->>Route: convert via _agent_question_to_wire (validate correct_answer in options)
alt valid converted questions exist
Route->>DB: INSERT quiz_attempts (questions_json, metadata)
DB-->>Route: insert OK
Route-->>Response: 200 OK (legacy wire format)
else agent error or all questions dropped
Route->>Legacy: _legacy_generate_quiz(...)
Legacy-->>Response: 200 OK (legacy wire format)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| from pydantic_evals import Case, Dataset | ||
| from pydantic_evals.evaluators import Evaluator, EvaluatorContext | ||
|
|
||
| from agents.quiz import quiz_agent, Quiz, QuizQuestion |
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
frontend | a2fd5cd | Commit Preview URL Branch Preview URL |
May 05 2026, 01:31 AM |
1. Restrict QuizQuestionType to multiple_choice only
- Schema-level rejection: Pydantic ValidationError on
type="short_answer", pinned by
test_short_answer_type_is_rejected_at_schema_layer.
- Old behavior synthesized a single-option grading shim for
short_answer that didn't actually work — submit_quiz grades by
option-label and the frontend has no free-text UI. Real
short-answer support is a future ADR when the frontend's there.
- Prompt updated to reflect MCQ-only.
2. Drop silent first-option-correct fallback in
_agent_question_to_wire. New behavior: if the agent's
correct_answer doesn't match any option verbatim, log a warning
and return None. Caller filters None and re-numbers survivors. If
ALL questions drift, raise and degrade to legacy fallback rather
than serve an empty or mismarked quiz. Pinned by
TestQuizWireFormatContract (well-formed passthrough, drift drops,
whitespace tolerance — 3 tests).
3. Eval set: rewrite the 3 short_answer cases as MCQ at the same
difficulty levels (PHYS 101 definitions, ECON 201, theory of
computation). ShortAnswerShapeEvaluator removed (no cases use it).
Module-level assertion verifies all 8 cases are MCQ.
Bonus: tests/test_documents_routes.py::test_rejects_file_over_max_size
now monkeypatches MAX_FILE_SIZE to 1KB instead of allocating 100MB,
which was flaky in the full-suite run under memory pressure. Same
contract pinned (rejection message says "100 MB").
ADR 0013 addendum captures all three.
Tests
- tests/test_quiz_routes.py: 24/24 (was 19; +3 contract +1 short-answer
rejection +1 net other = 5 net new).
- tests/test_quiz_agent_imports.py: 2/2.
- Full backend suite: 441 passed, 3-4 pre-existing live-Gemini and
live-Supabase failures unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. All-drift cascade test (TestQuizAgentFallback) New `test_falls_back_to_legacy_when_all_questions_drift` pins the path the 3 contract tests don't cover directly: agent returns a schema-valid Quiz where every question's correct_answer doesn't appear in its options → _quiz_via_agent's wire-format filter drops all of them → raises RuntimeError → bare-Exception catch in generate_quiz routes to _legacy_generate_quiz. Asserts the legacy gemini path actually runs and the legacy fallback question is what reaches the client. 2. Drift warning no longer leaks student content to local logs _agent_question_to_wire's drift warning was using %r to dump the raw correct_answer, options, and concept text. Logfire's egress scrubber (PR #67) handled remote ingestion, but Railway's local stdout still saw the unredacted strings. Now we log: n_options=4, canonical_len=18, fp=<sha256[:12]> The fingerprint is stable across recurrences of the same drift, so we still get correlation; the actual content stays out of stdout. Hashlib import hoisted to module scope. Pre-existing transient: tests/test_ocr_pipeline.py::test_gemini_parse that flickered red in the previous review run cleared on re-run (skipped in isolation, passing in full suite). Confirmed transient live-Gemini hiccup, not caused by this branch. Tests - tests/test_quiz_routes.py: 23/23 (the previous "24" was a miscount; net +1 from the new cascade test). - Full backend suite: 443 passed, 3 pre-existing live-Supabase failures unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/routes/quiz.py (1)
165-237:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
async /generatestill performs blocking I/O on the event loop.After making the handler async, synchronous DB/Gemini operations are still called directly (e.g., Line 173, Line 233, Line 243, Line 288). Under load, one slow fallback/DB call can stall unrelated requests.
Please offload blocking calls with
asyncio.to_thread(or switch these dependencies to async clients), especially around node lookup, legacy generation, and quiz insert.🔧 Example direction
+import asyncio ... - node_rows = table("graph_nodes").select( - "*", filters={"id": f"eq.{body.concept_node_id}"} - ) + node_rows = await asyncio.to_thread( + lambda: table("graph_nodes").select( + "*", filters={"id": f"eq.{body.concept_node_id}"} + ) + ) ... - questions = await _legacy_generate_quiz(body, request) + questions = await asyncio.to_thread(_legacy_generate_quiz_sync, body, request) ... - table("quiz_attempts").insert({ - "id": quiz_id, - "user_id": body.user_id, - "concept_node_id": body.concept_node_id, - "difficulty": body.difficulty, - "questions_json": questions, - }) + await asyncio.to_thread( + lambda: table("quiz_attempts").insert({ + "id": quiz_id, + "user_id": body.user_id, + "concept_node_id": body.concept_node_id, + "difficulty": body.difficulty, + "questions_json": questions, + }) + )Also applies to: 240-295
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/routes/quiz.py` around lines 165 - 237, The _legacy_generate_quiz function is performing blocking I/O on the event loop (sync DB and Gemini calls). Change the synchronous calls to run off the loop (use asyncio.to_thread) or swap to async clients: run table("graph_nodes").select(...) and any node_rows access via asyncio.to_thread, run get_graph(...) and get_quiz_context(...) via asyncio.to_thread, call get_course_context(...) via asyncio.to_thread, and call call_gemini_json(...) via asyncio.to_thread (or replace with an async Gemini client). Ensure you await each asyncio.to_thread call and keep the rest of the prompt-building logic unchanged so the function remains async and non-blocking.
🧹 Nitpick comments (1)
backend/tests/evals/quiz_generation.py (1)
15-17: 🏗️ Heavy liftKeep this eval under the pytest test runner.
This module lives under
backend/tests, but it is executed as a standalone script withsys.pathsurgery and a custom__main__entrypoint. That bypasses the repo’s pytest collection/shared-fixture conventions and makes the eval harder to run consistently in CI.As per coding guidelines, Backend tests must live in backend/tests/ and run via pytest. Shared fixtures (mock Supabase, mock Gemini) must be centralized in tests/conftest.py.
Also applies to: 275-276
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/tests/evals/quiz_generation.py` around lines 15 - 17, The file quiz_generation.py currently mutates sys.path (sys.path.insert calls) and provides a standalone __main__ entrypoint so it runs as a script; convert it into a proper pytest test module by removing the sys.path.insert lines and the custom __main__ runner, adding a pytest test function (e.g., def test_quiz_generation(...):) that uses the centralized fixtures from tests/conftest.py (refer to fixture names like mock_supabase and mock_gemini or shared_client fixtures) to set up mocks instead of manual path surgery, and ensure the test name starts with test_ so pytest will discover it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/agents/quiz.py`:
- Around line 69-97: The system prompt (_SYSTEM_PROMPT) tells the agent to call
read_concepts_for_user and read_misconceptions_for_course, but the actual
registered tool functions are named read_concepts_for_user_tool and
read_misconceptions_for_course_tool, so the agent will not find the tools; fix
by making the names consistent: either rename the tool functions
(read_concepts_for_user_tool -> read_concepts_for_user and
read_misconceptions_for_course_tool -> read_misconceptions_for_course) or update
_SYSTEM_PROMPT to reference the actual registered names
(read_concepts_for_user_tool and read_misconceptions_for_course_tool) so the
prompt matches the function symbols used at registration.
In `@backend/agents/tools/graph_read.py`:
- Around line 54-56: Remove the unsupported course_id filter when querying
graph_nodes: delete the line that sets filters["course_id"] = f"eq.{course_id}"
inside the graph_nodes read code so you don't send a non-existent column to the
DB; also stop silently swallowing the resulting exception in the surrounding
try/except (the block that currently catches errors and returns []), instead let
the exception propagate or log and re-raise so callers know the read failed and
course-scoped context isn’t silently dropped. Reference the filters dict used
for graph_nodes reads and the try/except that returns an empty list to locate
the changes.
In `@backend/tests/test_documents_routes.py`:
- Around line 234-248: The test only exercises the sync endpoint; add coverage
for the streaming /upload route or unify the route copy so both endpoints use
the same MAX_FILE_SIZE text. Specifically, update or add a test (e.g., alongside
test_rejects_file_over_max_size) that uses _make_upload to POST to the streaming
endpoint URL ("/api/documents/upload") while monkeypatching
routes.documents.MAX_FILE_SIZE to 1024 and
routes.documents.extract_text_from_file as done now, then assert status_code ==
400 and that the response detail contains "100 MB"; alternatively, refactor the
upload route handler code (the function handling "/upload") to reference the
shared MAX_FILE_SIZE/formatting logic used by the sync handler so the error
message is consistent.
In `@docs/decisions/0013-refactor-2-quiz-shipped.md`:
- Around line 69-73: Update the ADR text so it consistently reflects that
short-answer questions are rejected by the shipped contract rather than being
shimmed; replace the “short-answer shim keeps grading working” wording in the
“What surprised us” section (references: submit_quiz and short_answer) to state
that short_answer is now rejected and that the prior shim behavior was removed,
ensuring the earlier section matches the addendum that describes rejection of
short_answer.
---
Outside diff comments:
In `@backend/routes/quiz.py`:
- Around line 165-237: The _legacy_generate_quiz function is performing blocking
I/O on the event loop (sync DB and Gemini calls). Change the synchronous calls
to run off the loop (use asyncio.to_thread) or swap to async clients: run
table("graph_nodes").select(...) and any node_rows access via asyncio.to_thread,
run get_graph(...) and get_quiz_context(...) via asyncio.to_thread, call
get_course_context(...) via asyncio.to_thread, and call call_gemini_json(...)
via asyncio.to_thread (or replace with an async Gemini client). Ensure you await
each asyncio.to_thread call and keep the rest of the prompt-building logic
unchanged so the function remains async and non-blocking.
---
Nitpick comments:
In `@backend/tests/evals/quiz_generation.py`:
- Around line 15-17: The file quiz_generation.py currently mutates sys.path
(sys.path.insert calls) and provides a standalone __main__ entrypoint so it runs
as a script; convert it into a proper pytest test module by removing the
sys.path.insert lines and the custom __main__ runner, adding a pytest test
function (e.g., def test_quiz_generation(...):) that uses the centralized
fixtures from tests/conftest.py (refer to fixture names like mock_supabase and
mock_gemini or shared_client fixtures) to set up mocks instead of manual path
surgery, and ensure the test name starts with test_ so pytest will discover it.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 838314be-9fa3-4a60-8a22-874b8591d9de
📒 Files selected for processing (11)
backend/agents/_providers.pybackend/agents/quiz.pybackend/agents/tools/graph_read.pybackend/routes/quiz.pybackend/tests/evals/quiz_generation.pybackend/tests/test_documents_routes.pybackend/tests/test_graph_read_tools.pybackend/tests/test_quiz_agent_imports.pybackend/tests/test_quiz_routes.pybackend/tests/test_shared_course_context.pydocs/decisions/0013-refactor-2-quiz-shipped.md
| _SYSTEM_PROMPT = ( | ||
| "You generate adaptive multiple-choice quizzes for a student. Each " | ||
| "question must target a specific concept the student has weak " | ||
| "mastery on, OR address a class-level misconception you've seen.\n\n" | ||
| "Workflow:\n" | ||
| "1. Call `read_concepts_for_user` to see the student's mastery per " | ||
| " concept for this course (returned sorted by mastery ASC — " | ||
| " weakest first).\n" | ||
| "2. Call `read_misconceptions_for_course` to see anonymized class " | ||
| " misconceptions. Use these to phrase distractors and to write " | ||
| " a question that probes the misconception.\n" | ||
| "3. Compose `Quiz.questions` so the WEAKEST concepts get the most " | ||
| " questions, AND each item's `concept` field exactly matches a " | ||
| " concept_name returned by tool 1.\n\n" | ||
| "Per-question rules (multiple-choice only — the type field is " | ||
| "constrained to 'multiple_choice'):\n" | ||
| "- 4 options, exactly one correct. The text in `correct_answer` " | ||
| " MUST appear verbatim in `options` — character-for-character. " | ||
| " Questions that violate this are dropped at the route layer.\n" | ||
| "- Distractors should reflect plausible misconceptions, not random " | ||
| " noise. Use the read_misconceptions_for_course return value.\n" | ||
| "- explanation: 1-3 sentences explaining WHY the correct answer " | ||
| " is correct — used in the post-quiz review screen.\n" | ||
| "- difficulty: align with the student's mastery on the concept; " | ||
| " weakest concepts get easy/medium, strongest get hard.\n\n" | ||
| "Honor the requested num_questions and difficulty distribution " | ||
| "in the user message. Don't invent concepts the student doesn't " | ||
| "have." | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Confirm the exact registered tool names and whether any aliasing is applied.
rg -n --type=py 'def read_concepts_for_user_tool|def read_misconceptions_for_course_tool|read_concepts_for_user\b|read_misconceptions_for_course\b|tools=\[' backend/agentsRepository: SaplingLearn/Sapling
Length of output: 1385
🏁 Script executed:
# Check the tools registration in quiz.py
sed -n '100,130p' backend/agents/quiz.pyRepository: SaplingLearn/Sapling
Length of output: 383
🏁 Script executed:
# Also check if pydantic-ai is imported and how it handles tool names
rg -n "from pydantic_ai|import.*pydantic_ai|agent.*tools" backend/agents/quiz.py -A 2 -B 2Repository: SaplingLearn/Sapling
Length of output: 568
🏁 Script executed:
# Check pydantic-ai documentation or examples in the codebase for tool naming
rg -n "def.*_tool\(" backend/agents --type=py -B 5 | head -40Repository: SaplingLearn/Sapling
Length of output: 1084
🏁 Script executed:
# Check if the _tool functions have any decorators or name parameters
rg -n "@|name\s*=" backend/agents/tools/graph_read.py -B 2 -A 8 | grep -A 10 "read_concepts_for_user_tool\|read_misconceptions_for_course_tool"Repository: SaplingLearn/Sapling
Length of output: 46
🏁 Script executed:
# Search for pydantic_ai documentation or configuration in the codebase
rg -n "pydantic_ai.*tool|Tool.*name|tool.*alias" backend --type=py | head -20Repository: SaplingLearn/Sapling
Length of output: 46
🌐 Web query:
pydantic-ai tool naming convention function name suffix
💡 Result:
Pydantic AI does not enforce a specific function name suffix or strict naming convention for tools beyond requiring unique tool names within a toolset or agent. The tool name defaults to the Python function name (snake_case convention in examples like 'roll_dice', 'get_player_name', 'customer_balance'), but can be explicitly overridden via the 'name' parameter in decorators (@agent.tool(name='custom_name'), Tool constructor, Tool.from_schema) or toolset methods. ToolDefinition names must be unique to avoid conflicts (raises UserError on duplicates), but no suffix like '_tool' is required or conventional. Use descriptive, unique names that clearly communicate purpose, following Python snake_case for functions.
Citations:
- 1: https://pydantic.dev/docs/ai/tools-toolsets/tools/
- 2: https://ai.pydantic.dev/tools-advanced
- 3: https://ai.pydantic.dev/toolsets
- 4: https://github.com/pydantic/pydantic-ai/blob/7f5214c6/pydantic_ai_slim/pydantic_ai/toolsets/function.py
- 5: https://ai.pydantic.dev/api/tools/
Fix tool name mismatch between prompt and registration.
The prompt instructs the agent to call read_concepts_for_user and read_misconceptions_for_course, but the registered tools are read_concepts_for_user_tool and read_misconceptions_for_course_tool. Pydantic-ai uses the function name directly as the tool name without auto-aliasing suffixes, so the agent will fail to locate these tools when following the prompt instructions.
Either rename the functions to remove the _tool suffix or update the prompt to reference the correct tool names.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/agents/quiz.py` around lines 69 - 97, The system prompt
(_SYSTEM_PROMPT) tells the agent to call read_concepts_for_user and
read_misconceptions_for_course, but the actual registered tool functions are
named read_concepts_for_user_tool and read_misconceptions_for_course_tool, so
the agent will not find the tools; fix by making the names consistent: either
rename the tool functions (read_concepts_for_user_tool -> read_concepts_for_user
and read_misconceptions_for_course_tool -> read_misconceptions_for_course) or
update _SYSTEM_PROMPT to reference the actual registered names
(read_concepts_for_user_tool and read_misconceptions_for_course_tool) so the
prompt matches the function symbols used at registration.
| if course_id: | ||
| filters["course_id"] = f"eq.{course_id}" | ||
| try: |
There was a problem hiding this comment.
Remove the unsupported course_id filter from graph_nodes reads.
On Line 55, filters["course_id"] is added, but the provided schema for graph_nodes (in backend/db/archive/schema.sql:11-22) has no course_id column. That causes the select to fail, then Lines 65-71 swallow it and return [], so weak-concept context is silently lost for course-scoped requests.
🔧 Proposed fix
def _fetch() -> list[dict[str, Any]]:
filters = {"user_id": f"eq.{user_id}"}
- if course_id:
- filters["course_id"] = f"eq.{course_id}"
try:
return (
table("graph_nodes").select(
"concept_name,mastery_score,last_studied_at",
filters=filters,
order="mastery_score.asc",
)
or []
)Also applies to: 65-71
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/agents/tools/graph_read.py` around lines 54 - 56, Remove the
unsupported course_id filter when querying graph_nodes: delete the line that
sets filters["course_id"] = f"eq.{course_id}" inside the graph_nodes read code
so you don't send a non-existent column to the DB; also stop silently swallowing
the resulting exception in the surrounding try/except (the block that currently
catches errors and returns []), instead let the exception propagate or log and
re-raise so callers know the read failed and course-scoped context isn’t
silently dropped. Reference the filters dict used for graph_nodes reads and the
try/except that returns an empty list to locate the changes.
| def test_rejects_file_over_max_size(self): | ||
| # Cap was raised from 15 MB → 100 MB in commit 9912a25. Don't | ||
| # allocate 100MB here — it makes the test fragile under full-suite | ||
| # memory pressure. Instead, monkeypatch MAX_FILE_SIZE to 1 KB and | ||
| # pin the rejection-message contract. | ||
| with ( | ||
| _mock_validate_user(), | ||
| patch("routes.documents.MAX_FILE_SIZE", 1024), | ||
| patch("routes.documents.extract_text_from_file", return_value=""), | ||
| ): | ||
| r = _make_upload(content=b"x" * 2048) | ||
| assert r.status_code == 400 | ||
| assert "15 MB" in r.json()["detail"] | ||
| # The error message names the actual cap (100 MB) — verifies the | ||
| # route's hardcoded copy in the detail string didn't drift. | ||
| assert "100 MB" in r.json()["detail"] |
There was a problem hiding this comment.
This only covers /upload/sync, so the stale /upload copy still slips through.
The new test verifies the sync endpoint’s limit text, but _make_upload() posts to /api/documents/upload/sync; the streaming /api/documents/upload path still has the old "15 MB" message in the route snippet, and this change doesn’t protect it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/tests/test_documents_routes.py` around lines 234 - 248, The test only
exercises the sync endpoint; add coverage for the streaming /upload route or
unify the route copy so both endpoints use the same MAX_FILE_SIZE text.
Specifically, update or add a test (e.g., alongside
test_rejects_file_over_max_size) that uses _make_upload to POST to the streaming
endpoint URL ("/api/documents/upload") while monkeypatching
routes.documents.MAX_FILE_SIZE to 1024 and
routes.documents.extract_text_from_file as done now, then assert status_code ==
400 and that the response detail contains "100 MB"; alternatively, refactor the
upload route handler code (the function handling "/upload") to reference the
shared MAX_FILE_SIZE/formatting logic used by the sync handler so the error
message is consistent.
| 5. **Short-answer grading doesn't exist yet.** `submit_quiz` grades by | ||
| `q["options"][i].correct` lookup. For short-answer questions, the wrapper | ||
| synthesizes a single-option `{label:"A", correct:True}` so existing | ||
| grading code keeps working. Real short-answer grading (e.g. fuzzy-match, | ||
| LLM-judged) is its own future scope. |
There was a problem hiding this comment.
ADR contains conflicting statements about short-answer support.
The “What surprised us” text (Line 69-73) still says the short-answer shim keeps grading working, but the addendum (Line 132-140) says that behavior was removed and short_answer is now rejected. Please align the earlier section with the shipped contract.
Also applies to: 132-140
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/decisions/0013-refactor-2-quiz-shipped.md` around lines 69 - 73, Update
the ADR text so it consistently reflects that short-answer questions are
rejected by the shipped contract rather than being shimmed; replace the
“short-answer shim keeps grading working” wording in the “What surprised us”
section (references: submit_quiz and short_answer) to state that short_answer is
now rejected and that the prior shim behavior was removed, ensuring the earlier
section matches the addendum that describes rejection of short_answer.
Three small follow-ups from the previous review pass. 1. Extract shared fingerprint helper (services/fingerprint.py) The sha256[:N] hashing pattern was duplicated in services/logfire_scrubber.py::_fingerprint and routes/quiz.py's new drift warning. Both now route through fingerprint() / fingerprint_text() in a single module so the convention is consistent and a third callsite is one line. Adds 9 unit tests covering deterministic hashing, custom length, list/None/int handling, and the unit-separator collision-resistance contract. 2. Use ASCII unit-separator (\x1f) instead of `|` in fingerprints `'|'.join(options)` would let `["a|b", "c"]` and `["a", "b|c"]` collapse to the same hash body. Real text never contains \x1f so the join is unambiguous. Pinned by test_unit_separator_avoids_pipe_ambiguity. 3. Tighten the all-drift cascade test `test_falls_back_to_legacy_when_all_questions_drift` now asserts BOTH halves of the contract: agent_run_mock.assert_called_once() confirms the agent path was tried first, gemini_mock.assert_called_once() confirms the legacy path then fired. Previously only the second was pinned — a regression that skipped the agent entirely could have slipped through. Tests - tests/test_fingerprint.py: 9/9 (new). - tests/test_logfire_scrubber.py: 9/9 (still pass after switching the underlying _fingerprint to fingerprint_text). - tests/test_quiz_routes.py: 23/23 (cascade test tightened). - Full backend suite: 452 passed (was 443; +9 from the new fingerprint tests). 3 pre-existing live-Supabase failures unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/services/fingerprint.py (1)
24-55:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winValidate fingerprint length bounds explicitly.
lengthis used directly for digest slicing. On Line 44 and Line 54, values like0or negatives silently produce unusable/ambiguous fingerprints, which weakens correlation reliability.Proposed patch
+def _validate_length(length: int) -> int: + if not 1 <= length <= 64: + raise ValueError("length must be between 1 and 64") + return length + + def fingerprint(*parts: str | int | float | bool | None, length: int = 12) -> str: @@ + length = _validate_length(length) flat: list[str] = [] @@ def fingerprint_text(value: str, length: int = 16) -> str: @@ + length = _validate_length(length) return hashlib.sha256( value.encode("utf-8", errors="replace"), ).hexdigest()[:length]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/services/fingerprint.py` around lines 24 - 55, The functions fingerprint and fingerprint_text currently slice the SHA256 hex digest with the provided length which can be 0 or negative; add explicit validation at the start of fingerprint and fingerprint_text to ensure the length is an int between 1 and 64 (since SHA256.hexdigest() is 64 hex chars) and raise a ValueError with a clear message if it isn't; update both functions (fingerprint(...) and fingerprint_text(...)) to perform this check before computing the digest so callers get an immediate, descriptive error for invalid lengths.
🧹 Nitpick comments (1)
backend/services/fingerprint.py (1)
24-39: ⚡ Quick winAlign
partstype hints with actual accepted inputs.Line 24 typing excludes
list/tuple, but Line 38 handles them intentionally. This creates false positives in static checking for valid calls (including your tests).Proposed patch
-def fingerprint(*parts: str | int | float | bool | None, length: int = 12) -> str: +def fingerprint( + *parts: str | int | float | bool | None | list[object] | tuple[object, ...], + length: int = 12, +) -> str:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/services/fingerprint.py` around lines 24 - 39, The function signature for fingerprint does not include list/tuple types even though the implementation flattens them; update the type hint for the varargs parameter `parts` in the `fingerprint` function to also accept list and tuple of the same element types (e.g. add list[str|int|float|bool|None] and tuple[str|int|float|bool|None] to the Union), so static checkers know nested sequences are valid; keep references to SEPARATOR/flat logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@backend/services/fingerprint.py`:
- Around line 24-55: The functions fingerprint and fingerprint_text currently
slice the SHA256 hex digest with the provided length which can be 0 or negative;
add explicit validation at the start of fingerprint and fingerprint_text to
ensure the length is an int between 1 and 64 (since SHA256.hexdigest() is 64 hex
chars) and raise a ValueError with a clear message if it isn't; update both
functions (fingerprint(...) and fingerprint_text(...)) to perform this check
before computing the digest so callers get an immediate, descriptive error for
invalid lengths.
---
Nitpick comments:
In `@backend/services/fingerprint.py`:
- Around line 24-39: The function signature for fingerprint does not include
list/tuple types even though the implementation flattens them; update the type
hint for the varargs parameter `parts` in the `fingerprint` function to also
accept list and tuple of the same element types (e.g. add
list[str|int|float|bool|None] and tuple[str|int|float|bool|None] to the Union),
so static checkers know nested sequences are valid; keep references to
SEPARATOR/flat logic unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 0010ed98-af75-42ac-98f5-64444e5540bb
📒 Files selected for processing (5)
backend/routes/quiz.pybackend/services/fingerprint.pybackend/services/logfire_scrubber.pybackend/tests/test_fingerprint.pybackend/tests/test_quiz_routes.py
🚧 Files skipped from review as they are similar to previous changes (2)
- backend/tests/test_quiz_routes.py
- backend/routes/quiz.py
Closes the asymmetry I flagged in the post-merge review: chat tutor got `model_pref: Literal["fast", "smart"]` per-request via PR #73, quiz had only env-var-driven model selection. Now both routes accept the same body field and route to the same model strings. Wiring - backend/models/__init__.py: GenerateQuizBody.model_pref accepts "fast" | "smart" | None. Default None falls through to the agent's task-default (model_for("quiz") = gemini-2.5-flash-lite per ADR 0008). - backend/routes/quiz.py: * New _resolve_model_pref(pref) builds an optional GoogleModel override from the body's preference, returning None for None/empty/unknown so the agent's default wins on degraded input. * _quiz_via_agent threads `model_pref` through and passes `model=...` to quiz_agent.run only when an override is present (no model kwarg = agent default). * _legacy_generate_quiz now picks MODEL_SMART when pref="smart", falls back to MODEL_LITE otherwise. So the fast/smart toggle works on BOTH paths, including when the agent path trips and we degrade. - _PREF_MODEL_NAMES table at module scope so the {fast→flash, smart→pro} mapping is one place. Mirrors the chat tutor's mapping in services/gemini_service.py (MODEL_DEFAULT/MODEL_SMART). Tests (TestQuizModelPref, 5 new) - model_pref="smart" → quiz_agent.run gets model=GoogleModel('gemini-2.5-pro') - model_pref="fast" → quiz_agent.run gets model=GoogleModel('gemini-2.5-flash') - model_pref=None → quiz_agent.run gets NO model kwarg (agent default wins) - model_pref="auto" → _resolve_model_pref returns None (graceful unknown) - legacy fallback with smart → call_gemini_json gets model=MODEL_SMART Documentation: ADR 0013 addendum captures the design decision (env var for ops defaults + body field for per-request overrides; they compose). Tests - tests/test_quiz_routes.py: 28/28 (was 23, +5). - Full backend suite: 531 passed, 3 pre-existing live-Supabase failures unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the asymmetry I flagged in the most recent review of PR #71. Was: agent path "fast" → gemini-2.5-flash (upgrade), legacy path "fast" → gemini-2.5-flash-lite (no-op). Same body field, different models depending on which path ran. The kind of inconsistency that surfaces six months later as "why does Fast sometimes feel slower than Default?" Now: agent path AND legacy fallback both use the same mapping ("fast" → flash, "smart" → pro, else → lite). _PREF_MODEL_NAMES is the single source of truth on the agent side; the legacy elif-chain in _legacy_generate_quiz mirrors it exactly. Also: - backend/routes/quiz.py: re-imported MODEL_DEFAULT alongside MODEL_LITE/MODEL_SMART. Tightened the lazy-import comment in _resolve_model_pref to describe the actual benefit (deferring GoogleProvider construction to the first override request, not import-path isolation — agents.quiz is already imported at module top). - docs/decisions/0013-refactor-2-quiz-shipped.md: addendum reworded to "two independent layers, not multiplicative" — the env var sets the agent's startup baseline; the body field, when present, fully replaces it for the current call. Adds a paragraph documenting the agent/legacy symmetry and pointing at _PREF_MODEL_NAMES as the single source of truth. Tests (TestQuizModelPref grew to 7) - test_legacy_fallback_uses_default_when_pref_fast (NEW): pins the symmetry — legacy "fast" must call gemini with MODEL_DEFAULT, not MODEL_LITE. Asserts a clear failure message if anyone undoes the fix. - test_legacy_fallback_uses_lite_when_no_pref (NEW): pins the default — no model_pref still uses MODEL_LITE (the cheap baseline that's been there since the route shipped). - Existing 5 tests unchanged. Tests - tests/test_quiz_routes.py: 30/30 (was 28, +2). - Full backend suite: 533 passed, 3 pre-existing live-Supabase failures unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ssion_id Two correctness fixes from the self-review on PR #78. 1. Symmetric model defaults across agent + legacy paths. When body.model_pref is None (the default), the legacy fallback used to return MODEL_DEFAULT (gemini-2.5-flash) while the agent path returned the agent's task-default (gemini-2.5-pro). A user with no explicit pref silently downgraded from Pro -> Flash on fallback. This is the same bug PR #71's commit a2fd5cd fixed for quiz; chat now matches. - routes/learn.py: _resolve_tutor_model -> _resolve_legacy_model. Default fallback flipped from MODEL_DEFAULT to MODEL_SMART. Comment block + three call sites (start_session, _legacy_chat, action) updated. - tests/test_learn_routes.py: TestResolveTutorModel -> TestResolveLegacyModel. Four tests flipped their expected return from MODEL_DEFAULT to MODEL_SMART to pin the new symmetric contract. One new test (test_default_matches_agent_default_for_no_pref) pins agent/legacy parity explicitly. 2. session_id is a declared field on SaplingDeps, not an attribute attach. Before: routes/learn.py constructed SaplingDeps without session_id and then did `deps.session_id = session_id # type: ignore`. The tool wrapper read it via `getattr(..., None)`. Worked at runtime (unfrozen dataclass), but type checkers couldn't validate it, frozen=True would silently break it, and future SaplingDeps consumers had no way to discover the seam. - agents/deps.py: added `session_id: str | None = None` with a docstring entry covering the eval/batch case (legitimately None). - routes/learn.py: pass session_id via the constructor; removed the imperative attach + the obsolete comment + the # type: ignore. - agents/tools/chat_context.py: read_session_history_tool now does a direct `ctx.deps.session_id` read; getattr ceremony gone. - tests/test_chat_context_tools.py: missing-session test sets session_id=None explicitly to mirror the declared default. 55 tests pass across the three touched test files. No regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Refactor #2 from ADR 0005's migration plan. Replaces
routes/quiz.py:82's manual prompt-string augmentation +call_gemini_jsonwith a typed Pydantic AI agent. The agent uses tools (read_concepts_for_user,read_misconceptions_for_courseper ADR 0004) to pull weak-area + class-misconception data on demand instead of pre-stuffing the prompt — Logfire traces now show the actual tool calls + arguments instead of opaque string concatenation.Frontend wire format and grading semantics are unchanged. The agent's flat
Quiz.questions[QuizQuestion]output is converted back to the legacy{options:[{label,text,correct}]}persisted shape via_agent_question_to_wire, sosubmit_quiz's grading loop, thequiz_attemptsrow shape, and frontendsubmitQuiz/scoreQuizflows all keep working without changes.What's in this PR
backend/agents/quiz.py—quiz_agent+Quiz/QuizQuestionoutputs. Flat schema per ADR 0003 conv. 4. Prompt hash17ab80b30316baked into run metadata.backend/agents/tools/graph_read.py—read_concepts_for_user(sorted ASC by mastery) andread_misconceptions_for_course(readscourse_concept_stats— there's no separate misconceptions table; ADR 0004 assumed one). Pure-async + Pydantic AI tool wrappers.backend/agents/_providers.py— addsquiztoAgentTask+_DEFAULTS(defaultgemini-2.5-flash-lite); env overrideSAPLING_MODEL_QUIZ.backend/routes/quiz.py— now async._quiz_via_agent+_legacy_generate_quizwrap the agent-vs-fallback decision.HTTPExceptionre-raised inside the bare-Exception catch so legitimate 404s during fallback don't recurse into legacy infinitely.backend/tests/evals/quiz_generation.py— 8 cases × 6 evaluators per ADR 0005. Full 3-difficulty × 2-type grid coverage. Wired throughtests.evals._replay.docs/decisions/0013-refactor-2-quiz-shipped.md— captures what surprised us (no quiz cache exists; misconceptions table doesn't exist; HTTPException re-raise gotcha; short-answer grading still TODO) and what to carry into refactor Fix the learning loop for the context #3 (chat tutor).Tests
tests/test_quiz_routes.py: 19/19 (was 14; +3 agent-success +2 agent-fallback)tests/test_quiz_agent_imports.py: 2/2 (new)tests/test_graph_read_tools.py: 5/5 (new)tests/test_shared_course_context.py: existing tests adapted (force legacy fallback viaquiz_agent.runpatch)tests/test_documents_routes.py: bonus fix — staletest_rejects_file_over_15mbupdated for the 100 MB cap (was failing for everyone since commit 9912a25)Compatibility
SAPLING_MODEL_QUIZis an optional override.services/gemini_service.py::call_gemini_jsonpath preserved per ADR 0001 — gets exercised when the agent tripsUsageLimitExceeded/UnexpectedModelBehavior/ any other exception.Test plan
submitQuiz/scoreQuiz)read_concepts_for_user_toolandread_misconceptions_for_course_toolcalls appear with the rightrequest_idcorrelationSAPLING_EVAL_MODE=record python tests/evals/quiz_generation.py) when next refreshing the eval suiteOut of scope
submit_quiz's grading loop).🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests