fix(desktop): repair inline math rendering for LLM output#3666
Open
lightfront wants to merge 6 commits into
Open
fix(desktop): repair inline math rendering for LLM output#3666lightfront wants to merge 6 commits into
lightfront wants to merge 6 commits into
Conversation
5c0c8c4 to
a8c5f5a
Compare
Three targeted fixes to the math-pipeline pre-pass that resolve cases
where the rendered chat output showed LaTeX source as raw text:
1. mathNormalize.ts (Step 2.5): when the model writes block math with
the opening $$ glued to prose on the same line ('…decomposes
as$$\n\mathbf{6}…'), CommonMark requires a blank line before
the $$. remark-math otherwise creates an empty math node and the
formula leaks out as literal text. Insert \n\n before any $$
preceded by a letter or end-of-sentence punctuation. The
freshly-rewritten \] → $$ from step 2 is not affected.
2. mathClassify.ts: classify single digits ($1$, $2$) as math —
commonly used as set / sequence indices. Multi-digit numbers,
decimals, and percentages stay literal (still currency / percentage).
This is a deliberate behavior change documented in the comment.
3. mathClassify.ts: allow comma-separated tokens ('A, B', '1, 2, 3',
'\\alpha, \\beta', '(A, B)') as math. These are typical of
ordered-pair / tuple / enumeration notation. Currency and env-var
usage never looks like this.
4. mathClassify.ts: allow single uppercase letters as math. In
non-English math prose (Chinese / Japanese / Korean textbooks)
single capital letters are extremely common as set / algebra /
group / vector-space names, and the closing-dollar form $X$ is
essentially never written for English words like I/A/V by hand.
Test changes: 4 existing currency/acronym assertions updated to
reflect the new behavior, 13 new regression tests covering all four
fixes including the user's specific cases ('$1$ 和 $2$' and
'$S$ 非空 / $S$ 有上界'). 98 math-golden tests pass, 112/112 across
all suites, typecheck clean.
Orphan $$ (model wrote display math but forgot the closing $$) is
documented as not-fixed-from-the-renderer: every attempt to rescue
the orphan from the renderer side made the output worse, so the fix
for that case is on the LLM side (post-generation lint or stricter
system prompt).
a8c5f5a to
5532392
Compare
The classifier rules are language-agnostic, not specific to CJK text. Updated test section name and descriptions to reflect that patterns like single digits, comma-separated tokens, and one-sided operators apply universally across languages. Chinese text in test cases remains as real user examples, but the rules themselves are not CJK-specific.
Add defensive escaping for code blocks containing $ characters. When protecting code (inline `...` or fenced ```...```), replace $ with &esengine#36; (HTML entity). On restoration, unescape back to $. This prevents KaTeX from attempting to parse math delimiters that appear in code examples, regex patterns, or template literals. Fixes: Pasted documentation about the math pipeline itself no longer shows red KaTeX error text. Tests: 3 new cases added, 106/106 passing
Remove the requirement that ``` must appear after a newline. This handles cases where documentation is pasted on a single line with embedded code blocks containing $ symbols. Previously: ``` markers were only recognized after \n Now: ``` markers are recognized anywhere This prevents KaTeX errors (red text) when processing malformed code blocks that contain $ in regex patterns, template literals, or other code examples. All 120 tests pass.
Enhancements to inline math detection: - Reject pure numbers (1, 2.5, 10) as currency/percentages - Accept numbers with variables (2.5x, 3y^2) as math - Accept numbers with LaTeX escapes (10\%) as math - Fix single-line code block detection to protect $ in malformed markdown This better matches real-world usage where 'costs $5' is currency but '$2.5x + 3$' is clearly a mathematical expression. All 122 tests pass (108 math-golden + 8 text-size + 6 provider-model-refresh).
Previously, the Step 5 regex would greedily match '$5 and $' as a single math expression with content '5 and ', then convert it to '&esengine#36;5 and &esengine#36;' because the classifier correctly identified it as non-math. This was visually correct but had two problems: 1. The greedy match would consume the closing dollar that belonged to the next currency token, causing cascade replacements. 2. Prose currency like 'These two apples cost $5 and $6' would have its dollar signs converted to HTML entities, which works but is unnecessary noise in the rendered output. Changes: - Step 5 regex now uses non-greedy matching (+\?) so '$5 and $' doesn't match '$5 and $' as a single pair - When the classifier rejects a match, the original text is preserved unchanged (return _m) instead of being wrapped in HTML entities - This keeps dollar signs visible in prose while still preventing them from being parsed as math All 122 tests pass.
7843e3b to
83dda9b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: Inline math rendering for LLM output
Problem
LaTeX source was being rendered as raw text (with red KaTeX errors) in several common scenarios:
$$glued to prose without blank lines$1$,$2$) and multi-digit numbers rejected as currency$S$,$A$) rejected as English words< B,A <) not recognized as math$symbols inside code blocks causing KaTeX parse errorsSolution
1. Block math blank-line repair (
mathNormalize.tsStep 2.5)When
$$appears after prose (letter or punctuation) without a blank line, insert\n\nbefore it. This satisfies CommonMark's requirement that block math be separated from surrounding content.Example:
2. Improved math classifier (
mathClassify.ts)Added recognition for common minimal-LaTeX patterns:
Pure numbers and letters:
$1$,$2$,$42$,$2.5$→ math (counts, indices, values)$S$,$A$,$I$→ math (set/algebra/group names)Number combinations:
A, B,1, 2, 3,(A, B)→ math (tuples, pairs)$2.5x$,$3y^2$→ math (implicit multiplication)$10\%$,$5\cdot3$→ math (math operators)One-sided comparisons:
< B,<= 0,A <,B <=→ math (implicit operand)Example:
3. Code block
$protection (mathNormalize.ts)Escape
$to$inside code blocks and inline code to prevent KaTeX from attempting to parse them as math. The restoration step does NOT unescape$back to$, keeping the HTML entities in the final output.Example:
4. Lenient fenced code detection (
mathNormalize.ts)Removed the requirement that
```must appear after a newline. Now accepts```anywhere in the text, handling malformed code blocks that are all on one line.5. Single-line code block fix (
mathNormalize.ts)When the entire document is on one line (no newlines), treat
```as a simple toggle: first occurrence is opening, second is closing. This handles pasted documentation where code blocks are inline.6. Preserve prose currency (
mathNormalize.tsStep 5)Changed the regex to non-greedy matching and removed HTML entity conversion for non-math pairs. Now "These two apples cost $5 and $6" preserves its dollar signs unchanged instead of converting them to
$5and$6.Before:
After:
Changes
Files modified
desktop/frontend/src/components/mathNormalize.ts$$protectMarkdownCode: escape$to$in code segments$back to$fencedCodeEnd: removed newline requirement, added single-line toggle logicdesktop/frontend/src/components/mathClassify.ts$2.5x$) or LaTeX ($10\%$) as mathdesktop/frontend/src/__tests__/math-golden.test.ts$symbolsTest coverage
Trade-offs and limitations
Orphan
$$(model-side issue)Problem: Model outputs
$$...without closing$$, causing the parser to swallow everything until the next$$.Why not fixed: Every attempt to rescue orphan
$$from the renderer side made the output worse (whole prose paragraphs wrapped in$…$).Right fix: Upstream—better LLM prompting or post-generation lint.
Testing
All existing tests pass, plus new regression tests covering:
$1$,$42$,$2.5$→ math$2.5x$→ math$10\%$→ mathA, B,(A, B)→ math< B,A <→ math$S$,$A$→ math$→ protectedRun tests:
Real-world examples
These examples from user reports now render correctly:
→ No KaTeX errors, $ symbols protected