fix(wiki): render image paths with parens; replace Mistune with markdown-it-py#635
Conversation
The custom marked.js image tokenizer captured URLs with `[^)"]+`, which stops at the first `)` and breaks Frappe uploads named like `/files/image (24).png` (URL-encoded as `/files/image%20(24).png`). The inner `)` was treated as the markdown closer, leaving `.png)` as stray text and the image broken. Allow one level of balanced parens in the URL — matches CommonMark behavior that marked.js had before this custom tokenizer overrode it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same paren bug as the previous commit, but in the server-side Mistune renderer that produces the public page HTML. Mistune 3.2's image parser stops at the first `)`, so `.png)` rendered as `<img src="/files/image%20(14"...>.png)` — broken image plus stray text. Every image on the Frappe Helpdesk docs hit this. - Widen `IMAGE_PATTERN` to allow one level of balanced parens (and the spaces that Frappe URLs already needed) so the pre-encoder sees the whole URL. - Pre-encode `(` and `)` to `%28`/`%29` in image URLs alongside the existing space → `%20` step, since Mistune can't be coaxed into handling literal parens itself. - Rename `_encode_image_url_spaces` → `_encode_image_url_unsafe_chars` to match the broader job. Adds `TestImageUrlParensEncoding` covering literal parens, the encoded-space + literal-paren form Frappe actually emits, title and inline-paragraph variants, and multiple images per document. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WalkthroughThe PR replaces the Mistune renderer with a markdown-it-py pipeline, adds custom token rendering (callouts, images, video), updates IMAGE_PATTERN to allow one level of balanced parentheses in image URLs, rewrites the image URL preprocessor to percent-encode literal spaces while preserving titles, updates imports/dependencies, adjusts heading slug/TOC handling, and adds tests verifying rendered 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsStopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Mistune 3.2's image parser stops at the first `)` in a URL even when
the parens are balanced, which is what bit us on Frappe upload paths
like `/files/image (14).png`. The previous commits worked around it
with a paren pre-encoder; this commit removes the root cause by
switching the public-page renderer to markdown-it-py, a CommonMark-
compliant parser that handles balanced parens natively.
Engine swap:
- `mistune>=3.0` → `markdown-it-py>=3.0`, `mdit-py-plugins>=0.4`
- `WikiRenderer` (mistune.HTMLRenderer subclass) replaced with render-
rule overrides on the markdown-it-py renderer: `fence`/`code_block`
(rstrip phantom trailing whitespace) and `image` (video-block
detection + script-tag stripping on alt/title).
- Heading slug + TOC extraction moved out of the renderer's hidden
state into an explicit token walk before render.
- Plugins: `footnote_plugin`, `tasklists_plugin`; `table` and
`strikethrough` are core in markdown-it-py.
Simplifications now that parens parse natively:
- `_encode_image_url_unsafe_chars` → `_encode_image_url_spaces`,
shrunk to a single `.replace(" ", "%20")`. Parens flow through.
- Tests in `TestImageUrlWithParens` (was `TestImageUrlParensEncoding`)
now assert the natural `(14)` form rather than `%2814%29`.
- `html.unescape` import and call dropped (the mistune-era TOC text
was pre-rendered HTML; `renderInlineAsText` returns plain text).
Stays the same — these operate on raw markdown and are engine-agnostic:
- Callout preprocessor (`:::note` blocks → placeholders → HTML)
- Video preprocessor (full-line image markdown with video extension)
- Image-URL space encoder (CommonMark forbids unescaped whitespace
in URLs; Frappe uploads have them)
- Inline-code pipe sentinel for table rows. Confirmed via repro that
markdown-it-py's table plugin has the same bug as Mistune here —
it counts raw `|` per row and rejects the block on mismatch.
All 57 tests pass (52 existing + 5 paren cases added earlier in this
PR, expectations flipped to match the new parser).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
wiki/wiki/markdown.py (1)
134-145:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep block-video handling on the same path for paren filenames and callout bodies.
At Line 399,
_process_videos_with_placeholdersstill relies onVIDEO_MARKDOWN_PATTERN, which rejects the new balanced-parens URL shape, so.mp4)falls through toimage_renderand ends up wrapped as<p><div ...></div></p>. Line 406 has the same outcome for any full-line video inside a callout becausemd.renderskips the video placeholder prepass entirely. Reuse the balanced-parens URL matcher for videos and render callout bodies through the same preprocessing stack as top-level content.Also applies to: 398-399, 406-406
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@wiki/wiki/markdown.py` around lines 134 - 145, The video-placeholder prepass is still using VIDEO_MARKDOWN_PATTERN which rejects URLs with balanced parens, and callout bodies are not preprocessed the same as top-level content; update _process_videos_with_placeholders to use the balanced-parens URL matcher (the new regex you added for balanced filenames) instead of VIDEO_MARKDOWN_PATTERN, and ensure render_inner in _replace_callout_placeholders runs the same preprocessing/video-placeholder pipeline as top-level rendering (i.e., invoke the same prepass that replaces video placeholders before Markdown rendering so full-line videos inside callouts are handled the same way as top-level content).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@wiki/wiki/markdown.py`:
- Around line 326-344: The image_render function currently interpolates src,
alt, and title directly into HTML attributes; update image_render to HTML-escape
these values (after calling _remove_script_tags) before building the tag so
quotes, ampersands and other special chars are encoded for attribute context;
use a suitable utility (e.g., html.escape(value, quote=True) or existing project
HTML-escape helper) on src, alt and title and then use the escaped variables
when constructing the video block and img tag, leaving _is_video_url and the
surrounding logic unchanged.
---
Outside diff comments:
In `@wiki/wiki/markdown.py`:
- Around line 134-145: The video-placeholder prepass is still using
VIDEO_MARKDOWN_PATTERN which rejects URLs with balanced parens, and callout
bodies are not preprocessed the same as top-level content; update
_process_videos_with_placeholders to use the balanced-parens URL matcher (the
new regex you added for balanced filenames) instead of VIDEO_MARKDOWN_PATTERN,
and ensure render_inner in _replace_callout_placeholders runs the same
preprocessing/video-placeholder pipeline as top-level rendering (i.e., invoke
the same prepass that replaces video placeholders before Markdown rendering so
full-line videos inside callouts are handled the same way as top-level content).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a291d110-881f-40eb-b1d9-daf2b00fc78c
📒 Files selected for processing (3)
pyproject.tomlwiki/wiki/markdown.pywiki/wiki/test_markdown.py
| def image_render(tokens, idx, options, env): | ||
| tok = tokens[idx] | ||
| src = tok.attrGet("src") or "" | ||
| alt = _remove_script_tags(tok.content) | ||
| title = _remove_script_tags(tok.attrGet("title") or "") | ||
|
|
||
| if _is_video_url(url): | ||
| title_attr = f' title="{safe_title}"' if safe_title else "" | ||
| if _is_video_url(src): | ||
| title_attr = f' title="{title}"' if title else "" | ||
| data_alt_attr = f' data-alt="{alt}"' if alt else "" | ||
| return ( | ||
| f'<div data-type="video-block" data-src="{src}"{data_alt_attr}>' | ||
| f'<video src="{src}" controls preload="metadata"{title_attr}>' | ||
| f'<source src="{src}" />' | ||
| "</video></div>" | ||
| ) | ||
|
|
||
| s = f'<img src="{src}" alt="{alt}"' | ||
| if safe_title: | ||
| s += f' title="{safe_title}"' | ||
| if title: | ||
| s += f' title="{title}"' | ||
| return s + " />" |
There was a problem hiding this comment.
Escape attribute values before interpolating them into the custom image/video HTML.
src, alt, and title are written into HTML attributes verbatim here. A quote in alt/title breaks the tag and can leak extra attributes into the rendered markup; & in URLs/titles is also emitted raw. Please escape these values for attribute context before concatenation.
Suggested fix
def image_render(tokens, idx, options, env):
tok = tokens[idx]
- src = tok.attrGet("src") or ""
- alt = _remove_script_tags(tok.content)
- title = _remove_script_tags(tok.attrGet("title") or "")
+ src = escapeHtml(tok.attrGet("src") or "")
+ alt = escapeHtml(_remove_script_tags(tok.content))
+ title = escapeHtml(_remove_script_tags(tok.attrGet("title") or ""))
if _is_video_url(src):
title_attr = f' title="{title}"' if title else ""
data_alt_attr = f' data-alt="{alt}"' if alt else ""
return (🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@wiki/wiki/markdown.py` around lines 326 - 344, The image_render function
currently interpolates src, alt, and title directly into HTML attributes; update
image_render to HTML-escape these values (after calling _remove_script_tags)
before building the tag so quotes, ampersands and other special chars are
encoded for attribute context; use a suitable utility (e.g., html.escape(value,
quote=True) or existing project HTML-escape helper) on src, alt and title and
then use the escaped variables when constructing the video block and img tag,
leaving _is_video_url and the surrounding logic unchanged.
Summary
The user-visible bug: Frappe uploads are named like
image (14).png, so wiki markdown commonly contains.png)(encoded space, literal parens). After the v3 / TipTap migration, both renderers stopped handling those parens — the first)was treated as the markdown URL closer, so the image broke and.png)leaked out as text. Every image on the Frappe Helpdesk docs hit this.This PR ships three commits, narrow → broader:
1. TipTap editor fix —
frontend/src/components/tiptap-extensions/image-extension.jsThe custom marked.js tokenizer captured URLs with
[^)"]+, which stops at the first). Widened the URL group to allow one level of balanced parens (matching CommonMark, which is what marked.js did before this tokenizer overrode it).2. Public-page fix —
wiki/wiki/markdown.py(Mistune workaround)Mistune 3.2 has the same paren bug at the library level. Extended the existing image URL pre-encoder to also percent-encode
(→%28and)→%29(already encoded literal spaces); widenedIMAGE_PATTERNto match URLs with balanced parens so the pre-encoder can see them. Added 5 tests covering the paren cases.3. Replace Mistune with markdown-it-py —
wiki/wiki/markdown.pyRather than keep nursing Mistune's parser quirks, swapped it for markdown-it-py, the Python port of the JS markdown-it library. It's CommonMark-compliant, well-maintained, and handles balanced parens natively.
mistune>=3.0→markdown-it-py>=3.0,mdit-py-plugins>=0.4WikiRenderer(mistune.HTMLRenderer subclass) → render-rule overrides on the markdown-it-py renderer (fence/code_blockfor trailing-whitespace trim;imagefor video-block detection + script-tag stripping on alt/title). Heading slug + TOC extraction moved out of hidden renderer state into an explicit token walk before render. Plugins:footnote_plugin,tasklists_plugin(table + strikethrough are core).:::note→ placeholder → HTML), video preprocessor, image-URL space encoder, inline-code pipe sentinel for table rows (verified by repro that markdown-it-py's table plugin has the same bug as Mistune — it counts raw|per row and drops the block on mismatch).<img src="/files/image%20(14).png">. The image URL encoder shrank back to a single.replace(" ", "%20").TestImageUrlParensEncodingrenamed →TestImageUrlWithParens, assertions flipped to the natural form.Net diff in
markdown.py: −47 lines. All 57 tests pass (52 existing + 5 paren cases).Test plan
.png)(or anyimage (N).pngFrappe upload) and confirm the image renders on the public page.*caption*on the next line) still works.and(no parens) still render.|inside inline code), callouts, code fences with language hints, task lists, footnotes, and h2/h3 headings (TOC anchors).*.mp4etc.) still render as the video block.🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
New Features
Tests