Skip to content

fix(wiki): render image paths with parens; replace Mistune with markdown-it-py#635

Merged
NagariaHussain merged 3 commits into
frappe:developfrom
NagariaHussain:fix/image-paths-with-parens
May 27, 2026
Merged

fix(wiki): render image paths with parens; replace Mistune with markdown-it-py#635
NagariaHussain merged 3 commits into
frappe:developfrom
NagariaHussain:fix/image-paths-with-parens

Conversation

@NagariaHussain
Copy link
Copy Markdown
Collaborator

@NagariaHussain NagariaHussain commented May 27, 2026

Summary

The user-visible bug: Frappe uploads are named like image (14).png, so wiki markdown commonly contains ![](/files/image%20(14).png) (encoded space, literal parens). After the v3 / TipTap migration, both renderers stopped handling those parens — the first ) was treated as the markdown URL closer, so the image broke and .png) leaked out as text. Every image on the Frappe Helpdesk docs hit this.

This PR ships three commits, narrow → broader:

1. TipTap editor fix — frontend/src/components/tiptap-extensions/image-extension.js

The custom marked.js tokenizer captured URLs with [^)"]+, which stops at the first ). Widened the URL group to allow one level of balanced parens (matching CommonMark, which is what marked.js did before this tokenizer overrode it).

2. Public-page fix — wiki/wiki/markdown.py (Mistune workaround)

Mistune 3.2 has the same paren bug at the library level. Extended the existing image URL pre-encoder to also percent-encode (%28 and )%29 (already encoded literal spaces); widened IMAGE_PATTERN to match URLs with balanced parens so the pre-encoder can see them. Added 5 tests covering the paren cases.

3. Replace Mistune with markdown-it-py — wiki/wiki/markdown.py

Rather than keep nursing Mistune's parser quirks, swapped it for markdown-it-py, the Python port of the JS markdown-it library. It's CommonMark-compliant, well-maintained, and handles balanced parens natively.

  • Dependency change: mistune>=3.0markdown-it-py>=3.0, mdit-py-plugins>=0.4
  • What was rewritten: WikiRenderer (mistune.HTMLRenderer subclass) → render-rule overrides on the markdown-it-py renderer (fence/code_block for trailing-whitespace trim; image for video-block detection + script-tag stripping on alt/title). Heading slug + TOC extraction moved out of hidden renderer state into an explicit token walk before render. Plugins: footnote_plugin, tasklists_plugin (table + strikethrough are core).
  • What stayed identical: callout preprocessor (:::note → placeholder → HTML), video preprocessor, image-URL space encoder, inline-code pipe sentinel for table rows (verified by repro that markdown-it-py's table plugin has the same bug as Mistune — it counts raw | per row and drops the block on mismatch).
  • Simplifications enabled by the swap: paren encoding dropped — URLs render as <img src="/files/image%20(14).png">. The image URL encoder shrank back to a single .replace(" ", "%20"). TestImageUrlParensEncoding renamed → TestImageUrlWithParens, assertions flipped to the natural form.

Net diff in markdown.py: −47 lines. All 57 tests pass (52 existing + 5 paren cases).

Test plan

  • Open a wiki page whose markdown contains ![](/files/image%20(14).png) (or any image (N).png Frappe upload) and confirm the image renders on the public page.
  • Open the same page in the editor and confirm the image renders, alt/title round-trip, and the Stack Overflow caption pattern (*caption* on the next line) still works.
  • Spot-check plain ![alt](/files/foo.png) and ![alt](/files/foo.png "title") (no parens) still render.
  • Spot-check a real wiki page with tables (especially one with | inside inline code), callouts, code fences with language hints, task lists, footnotes, and h2/h3 headings (TOC anchors).
  • Confirm video URLs (*.mp4 etc.) still render as the video block.

Reviewer note: the engine swap (commit 3) is opt-in scope. If you'd rather merge just commits 1–2 (the targeted fixes) and review the swap separately, say the word and I'll split it out.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Markdown image handling now correctly supports filenames with spaces and one level of parentheses (e.g., "image (24).png"), so such images render reliably.
  • New Features

    • Markdown processing updated: callouts/asides and block-level video embeds are better recognized and rendered; heading anchors and table-of-contents extraction are improved.
  • Tests

    • Added tests covering image URL encoding and rendering for filenames with parentheses.

Review Change Stack

The custom marked.js image tokenizer captured URLs with `[^)"]+`,
which stops at the first `)` and breaks Frappe uploads named like
`/files/image (24).png` (URL-encoded as `/files/image%20(24).png`).
The inner `)` was treated as the markdown closer, leaving `.png)` as
stray text and the image broken.

Allow one level of balanced parens in the URL — matches CommonMark
behavior that marked.js had before this custom tokenizer overrode it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@NagariaHussain NagariaHussain marked this pull request as draft May 27, 2026 12:53
Same paren bug as the previous commit, but in the server-side Mistune
renderer that produces the public page HTML. Mistune 3.2's image
parser stops at the first `)`, so `![](/files/image (14).png)` rendered
as `<img src="/files/image%20(14"...>.png)` — broken image plus
stray text. Every image on the Frappe Helpdesk docs hit this.

- Widen `IMAGE_PATTERN` to allow one level of balanced parens (and the
  spaces that Frappe URLs already needed) so the pre-encoder sees the
  whole URL.
- Pre-encode `(` and `)` to `%28`/`%29` in image URLs alongside the
  existing space → `%20` step, since Mistune can't be coaxed into
  handling literal parens itself.
- Rename `_encode_image_url_spaces` → `_encode_image_url_unsafe_chars`
  to match the broader job.

Adds `TestImageUrlParensEncoding` covering literal parens, the
encoded-space + literal-paren form Frappe actually emits, title and
inline-paragraph variants, and multiple images per document.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Walkthrough

The PR replaces the Mistune renderer with a markdown-it-py pipeline, adds custom token rendering (callouts, images, video), updates IMAGE_PATTERN to allow one level of balanced parentheses in image URLs, rewrites the image URL preprocessor to percent-encode literal spaces while preserving titles, updates imports/dependencies, adjusts heading slug/TOC handling, and adds tests verifying rendered for filenames with parentheses and spaces.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective: fixing image path rendering with parentheses and replacing Mistune with markdown-it-py as the server-side renderer.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@NagariaHussain NagariaHussain marked this pull request as ready for review May 27, 2026 13:09
Mistune 3.2's image parser stops at the first `)` in a URL even when
the parens are balanced, which is what bit us on Frappe upload paths
like `/files/image (14).png`. The previous commits worked around it
with a paren pre-encoder; this commit removes the root cause by
switching the public-page renderer to markdown-it-py, a CommonMark-
compliant parser that handles balanced parens natively.

Engine swap:
- `mistune>=3.0` → `markdown-it-py>=3.0`, `mdit-py-plugins>=0.4`
- `WikiRenderer` (mistune.HTMLRenderer subclass) replaced with render-
  rule overrides on the markdown-it-py renderer: `fence`/`code_block`
  (rstrip phantom trailing whitespace) and `image` (video-block
  detection + script-tag stripping on alt/title).
- Heading slug + TOC extraction moved out of the renderer's hidden
  state into an explicit token walk before render.
- Plugins: `footnote_plugin`, `tasklists_plugin`; `table` and
  `strikethrough` are core in markdown-it-py.

Simplifications now that parens parse natively:
- `_encode_image_url_unsafe_chars` → `_encode_image_url_spaces`,
  shrunk to a single `.replace(" ", "%20")`. Parens flow through.
- Tests in `TestImageUrlWithParens` (was `TestImageUrlParensEncoding`)
  now assert the natural `(14)` form rather than `%2814%29`.
- `html.unescape` import and call dropped (the mistune-era TOC text
  was pre-rendered HTML; `renderInlineAsText` returns plain text).

Stays the same — these operate on raw markdown and are engine-agnostic:
- Callout preprocessor (`:::note` blocks → placeholders → HTML)
- Video preprocessor (full-line image markdown with video extension)
- Image-URL space encoder (CommonMark forbids unescaped whitespace
  in URLs; Frappe uploads have them)
- Inline-code pipe sentinel for table rows. Confirmed via repro that
  markdown-it-py's table plugin has the same bug as Mistune here —
  it counts raw `|` per row and rejects the block on mismatch.

All 57 tests pass (52 existing + 5 paren cases added earlier in this
PR, expectations flipped to match the new parser).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@NagariaHussain NagariaHussain changed the title fix: render image paths containing parentheses fix(wiki): render image paths with parens; replace Mistune with markdown-it-py May 27, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
wiki/wiki/markdown.py (1)

134-145: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep block-video handling on the same path for paren filenames and callout bodies.

At Line 399, _process_videos_with_placeholders still relies on VIDEO_MARKDOWN_PATTERN, which rejects the new balanced-parens URL shape, so ![](/files/video%20(1).mp4) falls through to image_render and ends up wrapped as <p><div ...></div></p>. Line 406 has the same outcome for any full-line video inside a callout because md.render skips the video placeholder prepass entirely. Reuse the balanced-parens URL matcher for videos and render callout bodies through the same preprocessing stack as top-level content.

Also applies to: 398-399, 406-406

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@wiki/wiki/markdown.py` around lines 134 - 145, The video-placeholder prepass
is still using VIDEO_MARKDOWN_PATTERN which rejects URLs with balanced parens,
and callout bodies are not preprocessed the same as top-level content; update
_process_videos_with_placeholders to use the balanced-parens URL matcher (the
new regex you added for balanced filenames) instead of VIDEO_MARKDOWN_PATTERN,
and ensure render_inner in _replace_callout_placeholders runs the same
preprocessing/video-placeholder pipeline as top-level rendering (i.e., invoke
the same prepass that replaces video placeholders before Markdown rendering so
full-line videos inside callouts are handled the same way as top-level content).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@wiki/wiki/markdown.py`:
- Around line 326-344: The image_render function currently interpolates src,
alt, and title directly into HTML attributes; update image_render to HTML-escape
these values (after calling _remove_script_tags) before building the tag so
quotes, ampersands and other special chars are encoded for attribute context;
use a suitable utility (e.g., html.escape(value, quote=True) or existing project
HTML-escape helper) on src, alt and title and then use the escaped variables
when constructing the video block and img tag, leaving _is_video_url and the
surrounding logic unchanged.

---

Outside diff comments:
In `@wiki/wiki/markdown.py`:
- Around line 134-145: The video-placeholder prepass is still using
VIDEO_MARKDOWN_PATTERN which rejects URLs with balanced parens, and callout
bodies are not preprocessed the same as top-level content; update
_process_videos_with_placeholders to use the balanced-parens URL matcher (the
new regex you added for balanced filenames) instead of VIDEO_MARKDOWN_PATTERN,
and ensure render_inner in _replace_callout_placeholders runs the same
preprocessing/video-placeholder pipeline as top-level rendering (i.e., invoke
the same prepass that replaces video placeholders before Markdown rendering so
full-line videos inside callouts are handled the same way as top-level content).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a291d110-881f-40eb-b1d9-daf2b00fc78c

📥 Commits

Reviewing files that changed from the base of the PR and between 3fd4168 and f808776.

📒 Files selected for processing (3)
  • pyproject.toml
  • wiki/wiki/markdown.py
  • wiki/wiki/test_markdown.py

Comment thread wiki/wiki/markdown.py
Comment on lines +326 to 344
def image_render(tokens, idx, options, env):
tok = tokens[idx]
src = tok.attrGet("src") or ""
alt = _remove_script_tags(tok.content)
title = _remove_script_tags(tok.attrGet("title") or "")

if _is_video_url(url):
title_attr = f' title="{safe_title}"' if safe_title else ""
if _is_video_url(src):
title_attr = f' title="{title}"' if title else ""
data_alt_attr = f' data-alt="{alt}"' if alt else ""
return (
f'<div data-type="video-block" data-src="{src}"{data_alt_attr}>'
f'<video src="{src}" controls preload="metadata"{title_attr}>'
f'<source src="{src}" />'
"</video></div>"
)

s = f'<img src="{src}" alt="{alt}"'
if safe_title:
s += f' title="{safe_title}"'
if title:
s += f' title="{title}"'
return s + " />"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Escape attribute values before interpolating them into the custom image/video HTML.

src, alt, and title are written into HTML attributes verbatim here. A quote in alt/title breaks the tag and can leak extra attributes into the rendered markup; & in URLs/titles is also emitted raw. Please escape these values for attribute context before concatenation.

Suggested fix
 	def image_render(tokens, idx, options, env):
 		tok = tokens[idx]
-		src = tok.attrGet("src") or ""
-		alt = _remove_script_tags(tok.content)
-		title = _remove_script_tags(tok.attrGet("title") or "")
+		src = escapeHtml(tok.attrGet("src") or "")
+		alt = escapeHtml(_remove_script_tags(tok.content))
+		title = escapeHtml(_remove_script_tags(tok.attrGet("title") or ""))
 
 		if _is_video_url(src):
 			title_attr = f' title="{title}"' if title else ""
 			data_alt_attr = f' data-alt="{alt}"' if alt else ""
 			return (
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@wiki/wiki/markdown.py` around lines 326 - 344, The image_render function
currently interpolates src, alt, and title directly into HTML attributes; update
image_render to HTML-escape these values (after calling _remove_script_tags)
before building the tag so quotes, ampersands and other special chars are
encoded for attribute context; use a suitable utility (e.g., html.escape(value,
quote=True) or existing project HTML-escape helper) on src, alt and title and
then use the escaped variables when constructing the video block and img tag,
leaving _is_video_url and the surrounding logic unchanged.

@NagariaHussain NagariaHussain merged commit e183f28 into frappe:develop May 27, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant