feat: inline language inference in outline generation by cosarah · Pull Request #390 · THU-MAIC/OpenMAIC

cosarah · 2026-04-09T15:42:17Z

Summary

Embed language inference directly into outline generation (replacing the standalone LLM call approach), move agent profile generation after outlines, and remove the manual language selector along with all hardcoded zh-CN/en-US conditionals.

Related Issues

Supersedes feat/language-inference branch (#381)

Changes

Pipeline reorder: outline generation now runs before agent profile generation (both server-side and client-side flows)
Inline language inference: outline LLM output changes from SceneOutline[] to { languageDirective, outlines } wrapper object
Full-pipeline language directive: languageDirective propagates through outline → agent profiles → scene content → scene actions → chat
Remove manual language selection: UserRequirements.language field, toolbar language toggle, normalizeLanguage() function
Remove all zh-CN/en-US hardcodes: prompt construction code uses English only; LLM infers teaching language from user requirement text
Advanced learner recognition: system prompt identifies advanced foreign language learners (TEM-8, DALF C1, etc.) who should be taught in the target language
Eval tests: 22 production-sourced test cases (including 3 cross-language PDF cases), gemini-3-flash-preview inference + gpt-4o-mini judge, 21/21 pass

Type of Change

New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Refactoring (no functional changes)

Verification

Steps to reproduce / test

Create a course with a Chinese requirement (e.g. "从零学 Python"), verify content is generated in Chinese
Create a course with an English requirement (e.g. "Explain photosynthesis"), verify content is in English
Verify generation flow order: PDF analysis → Web search → Outlines → Agent generation → Content
Verify the language toggle button is no longer in the toolbar
Run eval tests: EVAL_INFERENCE_MODEL=google:gemini-3-flash-preview EVAL_JUDGE_MODEL=openai:gpt-4o-mini pnpm vitest run tests/generation/outline-language.eval.test.ts

What you personally verified

Tested Chinese and English requirements in the frontend — language inference correct
Verified advanced English learner case (TEM-8 oral fluency improvement)
Eval tests: 22 cases run concurrently, 21/21 that entered judge all PASS

Evidence

Eval result: 21/21 (100%) with gemini-3-flash-preview + gpt-4o-mini judge
TypeScript: npx tsc --noEmit zero errors
CI passes (pnpm check && pnpm lint && npx tsc --noEmit)
Manually tested locally
Screenshots / recordings attached (if UI changes)

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have added/updated documentation as needed
My changes do not introduce new warnings

…guageNote types Remove the `language` field from UserRequirements and all zh-CN/en-US hardcoded conditionals in prompt construction. Language will be inferred by the LLM during outline generation (handled in subsequent tasks). - Remove language toggle from GenerationToolbar and homepage form - Remove normalizeLanguage helper and language-based prompt branching - Standardize formatImageDescription/formatImagePlaceholder to English only - Add `languageDirective` to Stage, `languageNote` to SceneOutline - Fix generation-preview references to the removed requirements.language Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Change LLM output format from flat JSON array to wrapper object { languageDirective, outlines }. Add language inference instructions to system prompt with signal priority and examples. Replace hardcoded Course Language section in user prompt with Language Context for inference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update generateSceneOutlinesFromRequirements to return a wrapper object { languageDirective, outlines } instead of a flat SceneOutline[]. Parse the new LLM response format with backward compatibility for old flat-array responses. Add pdfLanguageSample template variable for language inference in the prompt. Note: downstream callers (classroom-generation.ts, pipeline-runner.ts) have expected type errors that will be fixed in subsequent tasks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update the scene-outlines-stream API to support the new wrapper response format { languageDirective, outlines: [...] }: - Add pdfLanguageSample template variable to the prompt - Add extractLanguageDirective() to parse directive from partial JSON - Update extractNewOutlines() to handle nested "outlines" array key - Emit languageDirective SSE event as soon as it's parsed - Include languageDirective in the done event payload Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ents Reorders the server-side generation pipeline so outline generation (which now infers languageDirective) happens before agent profile generation. This lets agent names/personas follow the inferred language. Pipeline order: web search → outlines → agents → scenes Also threads languageDirective through to generateSceneContent and generateSceneActions (those functions don't accept the param yet — that's Tasks 8/9). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… and agent profiles API - Agent profiles API accepts languageDirective instead of language - Scene content generation accepts and passes languageDirective to prompts - Scene actions generation accepts and passes languageDirective to prompts - All prompt templates updated with {{languageDirective}} variable - Fix pipeline-runner.ts for new outline return type Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reorder the generation-preview page so outlines are generated before agent profiles, enabling the languageDirective inferred from outlines to flow into agent generation, scene content, and scene actions. - Swap outline and agent-generation step order in ALL_STEPS - Add languageDirective to GenerationSessionState - Capture languageDirective from outline SSE stream events - Pass languageDirective + outlines to agent-profiles API - Pass languageDirective to scene-content and scene-actions APIs - Store languageDirective in sessionStorage for classroom page - Update use-scene-generator GenerationParams with languageDirective - Update classroom page to pass languageDirective to generateRemaining Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- 20 curated test cases from production covering: - Pure Chinese/English requirements - Chinese with English tech terms - Foreign language learning (EN→CN, EN→DE, ZH→EN, AR→EN) - Cross-language locale mismatch - Non-Chinese/English languages (Spanish, German, Arabic) - Short/ambiguous requirements - Uses actual outline system prompt for inference - LLM-as-judge evaluates against human-verified ground truth - Results written to outline-language.eval.result.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Allows running eval tests with different models for inference vs judging: EVAL_INFERENCE_MODEL=google/gemini-3-flash-preview EVAL_JUDGE_MODEL=openai/gpt-4o-mini Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…guage inference - Add special case rules for advanced learners (TEM-8, DALF C1, JLPT N1, etc.) who should be taught in the target language, not their native language - Add example for advanced English learner case - Remove ambiguous LLC test case (no context to disambiguate) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Call generateSceneOutlinesFromRequirements directly instead of a shortened prompt, so the test exercises the exact same code path as production. Each case now generates full outlines + languageDirective. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… prompt examples - Add 3 PDF test cases from production (EN paper + ZH requirement, ESL teacher + EN article, ZH C++ syllabus) - Run tests concurrently with maxConcurrency: 10 (3.5x faster) - Balance system prompt examples: 3 Chinese + 3 English + 1 Spanish (was 3 Chinese + 2 English, causing Chinese bias) - Add "I want to learn German A1" example to clarify English-user foreign language learning should use English instruction Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove the 7-row examples table to: - Reduce token consumption (~500 tokens saved per call) - Eliminate language distribution bias in examples - Eval results: 21/21 (100%) without examples, same or better than with Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

*.eval.test.ts files require real LLM API keys and should only be run locally via explicit file path, not in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The scene-content route was defaulting outline.language to 'zh-CN', which contradicted the new languageDirective for English courses. Remove the legacy language parameter from generateInteractiveContent and use languageDirective as the single source of truth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Pass userProfile template variable in the SSE streaming route so the LLM has the student profile signal for language inference (matching the non-streaming outline-generator.ts behavior) - Fix extractLanguageDirective to handle \n and \t escape sequences Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e-runner - Client SSE handler uses same fallback message as server when languageDirective is missing from stream - pipeline-runner.ts extracts and passes languageDirective to generateFullScenes → generateSingleScene → content/actions generators Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace long positional parameter lists with named options objects (SceneContentOptions, SceneActionsOptions) to eliminate cascading undefined arguments at call sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Expand test cases from 22 to 50 covering language learning, immersive, explicit instruction, code-switching, minimal input, user profiles (teacher/parent/tutor/heritage/professional), and cross-language PDF - Rewrite language inference prompt with clear decision rules for foreign language learning, cross-language PDF, proxy requests, and terminology handling - Remove redundant pdfLanguageSample (duplicated first 200 chars of pdfContent already in Reference Materials section) - Add vitest.eval.config.ts for running eval tests separately - 50/50 pass rate with gemini-3-flash-preview + gpt-4o judge Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cosarah · 2026-04-12T13:46:13Z

Eval test cases expanded: 22 → 50

New test dimensions added (28 cases):

Language learning (5) — Japanese/Korean learners, non-zh/en axis (ja→zh), multi-target language learning

Immersive learning (2) — Advanced learners requesting full target-language immersion (Japanese→English, Chinese→French)

Explicit language instruction (2) — User explicitly overrides default language ("请用英文教我", "explain in Chinese please")

Code-switching & bilingual (2) — Mixed zh/en input, explicit bilingual teaching request

Minimal / ambiguous input (2) — Single-word requirement ("微积分"), pinyin romanized input

User profiles (8) — Teacher designing foreign language lesson, parent proxy for IB student, bilingual heritage speaker, professional business English, immigrant integration, tutor with bilingual student, teacher of Chinese-as-foreign-language

Cross-language PDF (7) — English req + Chinese PDF, Chinese req + Japanese/French PDF, Japanese req + English PDF, bilingual PDF, teacher using foreign-language material

Also simplified the language inference prompt (system.md ~80→30 lines, removed redundant pdfLanguageSample). 50/50 with gemini-3-flash-preview + gpt-4o judge.

- Remove duplicate early agent resolution block (keep post-outline version that uses languageDirective instead of lang) - Adapt eval test to async resolveModel signature from main Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cosarah and others added 17 commits April 9, 2026 23:47

feat: chat prompt builder uses languageDirective from stage

32a4e71

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use it.concurrent for parallel eval, gitignore eval results

cea5d16

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: exclude eval tests from default vitest run

6525c69

*.eval.test.ts files require real LLM API keys and should only be run locally via explicit file path, not in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

style: format test files with prettier

e6d637e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cosarah force-pushed the feat/language-inline branch from 689b862 to e6d637e Compare April 9, 2026 15:48

cosarah and others added 6 commits April 10, 2026 00:13

fix: suppress unused stageInfo warning in scene-content route

8a78693

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cosarah and others added 2 commits April 12, 2026 21:55

Merge branch 'main' into feat/language-inline

4906777

wyuc closed this Apr 12, 2026

cosarah deleted the feat/language-inline branch April 12, 2026 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: inline language inference in outline generation#390

feat: inline language inference in outline generation#390
cosarah wants to merge 25 commits intomainfrom
feat/language-inline

cosarah commented Apr 9, 2026

Uh oh!

cosarah commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cosarah commented Apr 9, 2026

Summary

Related Issues

Changes

Type of Change

Verification

Steps to reproduce / test

What you personally verified

Evidence

Checklist

Uh oh!

cosarah commented Apr 12, 2026

Eval test cases expanded: 22 → 50

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants