Fix duplicate Table/Chart sections in to_markdown_by_page by randerzander · Pull Request #1741 · NVIDIA/NeMo-Retriever

randerzander · 2026-03-27T17:54:04Z

when running the core pipeline on multimodal_test.pdf and using to_markdown_by_page, table and charts get duplicated in the markdown text.

Claude's description:
When multiple chunks per page carry table/chart column data, _collect_page_record appended a section per chunk, producing 3× Table/Chart headers with identical content. _dedupe_blocks couldn't catch this because auto-incremented headers made identical blocks appear distinct.

Fix: deduplicate sections by content-only key (stripping the numeric header) before combining with text blocks, and filter out text blocks whose content is already covered by a labeled section.

When multiple chunks per page all carry table/chart column data, _collect_page_record was appending a section for each chunk, producing 3x Table/Chart headers with identical content. _dedupe_blocks could not catch this because auto-incremented headers (### Table 1, ### Table 2) made otherwise-identical blocks appear distinct. Fix: deduplicate sections by content-only key (stripping the numeric header) before combining with text blocks, and filter out text blocks whose content is already represented by a labeled section. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

randerzander · 2026-03-27T18:19:48Z

nemo_retriever/README.md

 # markdown formatted table from the first page
 >>> chunks[1]["text"]
-'| Table | 1 |\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |'
+'| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |'


Notice here that the text surrounding the table (captured in crop because it increases relevance for recall) is also included in the markdown format.

ToDo: exclude surrounding text from markdown formatting

greptile-apps · 2026-04-09T21:52:29Z

Greptile Summary

This PR fixes duplicate Table/Chart sections in to_markdown_by_page by deduplicating sections on content alone (stripping the auto-incremented ### Label N header before comparing) and filtering text blocks already covered by a labeled section. The README is also updated to reflect the current non-nested return shape.

P1 – regex gap for multi-word labels: The pattern ^### \\S+ \\d+\ \ uses \\S+ (no spaces), so ### Page Image 1\ \ is never matched and Page Image sections will not be deduplicated — a one-character fix (\\S+ → .+?) resolves this.
P2 – no regression test: The specific duplication scenario (multiple page-record chunks sharing identical table/chart column data) has no corresponding test, leaving the fix without a guard against future regression.

Confidence Score: 4/5

Safe to merge after fixing the regex to handle multi-word labels; the primary Table/Chart fix is correct.

A P1 gap exists: the deduplication regex silently fails for any label containing a space (currently only 'Page Image'), meaning that type of duplication is not fixed. The one-character regex fix is straightforward and low-risk, but it should be made before merging to avoid shipping an incomplete fix.

nemo_retriever/src/nemo_retriever/io/markdown.py — review the regex on line 60

Vulnerabilities

No security concerns identified.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/io/markdown.py	Adds content-based section deduplication in to_markdown_by_page, but the regex \S+ fails to strip headers for multi-word labels like "Page Image", leaving those duplicates un-deduplicated.
nemo_retriever/README.md	Documentation update: corrects outdated examples that showed the old nested-by-filename API shape and replaces them with the fixed, non-duplicated output.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[page_content sections list] --> B[Loop over each section block]
    B --> C[Strip header via regex to get content_key]
    C --> D{content_key already seen?}
    D -- No --> E[Add to seen set and deduped list]
    D -- Yes --> F[Discard duplicate]
    E --> G[Filter text_blocks]
    A2[page_content text_blocks] --> G
    G --> H{block.strip in seen set?}
    H -- Yes --> I[Discard: covered by section]
    H -- No --> J[Keep text block]
    J --> K[_dedupe_blocks on combined list]
    E --> K
    K --> L[Rendered page markdown]

Prompt To Fix All With AI

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/markdown.py
Line: 60

Comment:
**Regex fails for multi-word labels like "Page Image"**

`\S+` matches only non-whitespace, so a section header like `### Page Image 1\n\n` is never stripped — `\S+` consumes `Page`, the regex then expects `\d+` but sees `Image`, causing `re.sub` to return the block unchanged. Duplicate `Page Image` sections (e.g. a page with multiple OCR image chunks carrying the same content) will not be deduplicated by this new logic.

Replace `\S+` with a non-greedy `.+?` so any label including multi-word ones is correctly stripped: `re.sub(r"^### .+? \d+\n\n", "", block.strip())`

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/markdown.py
Line: 56-66

Comment:
**No regression test for the duplicate-section scenario**

The PR fixes a concrete bug (multiple chunks per page each carrying the same table/chart column data producing 3× headers), but no test covers that path. The existing tests in `test_io_markdown.py` don't construct a record list with multiple entries sharing identical `table`/`chart` column entries. Without such a test, a future refactor of `_collect_page_record` could silently re-introduce the duplication.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "Merge branch 'main' into markdown_fix" | Re-trigger Greptile}

greptile-apps · 2026-04-09T21:52:33Z

nemo_retriever/src/nemo_retriever/io/markdown.py

+        seen_section_content: set[str] = set()
+        deduped_sections: list[str] = []
+        for block in page_content.sections:
+            content_key = re.sub(r"^### \S+ \d+\n\n", "", block.strip())


Regex fails for multi-word labels like "Page Image"

\S+ matches only non-whitespace, so a section header like ### Page Image 1\n\n is never stripped — \S+ consumes Page, the regex then expects \d+ but sees Image, causing re.sub to return the block unchanged. Duplicate Page Image sections (e.g. a page with multiple OCR image chunks carrying the same content) will not be deduplicated by this new logic.

Replace \S+ with a non-greedy .+? so any label including multi-word ones is correctly stripped: re.sub(r"^### .+? \d+\n\n", "", block.strip())

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/io/markdown.py Line: 60 Comment: **Regex fails for multi-word labels like "Page Image"** `\S+` matches only non-whitespace, so a section header like `### Page Image 1\n\n` is never stripped — `\S+` consumes `Page`, the regex then expects `\d+` but sees `Image`, causing `re.sub` to return the block unchanged. Duplicate `Page Image` sections (e.g. a page with multiple OCR image chunks carrying the same content) will not be deduplicated by this new logic. Replace `\S+` with a non-greedy `.+?` so any label including multi-word ones is correctly stripped: `re.sub(r"^### .+? \d+\n\n", "", block.strip())` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-09T21:52:34Z

nemo_retriever/src/nemo_retriever/io/markdown.py

+        # deduplicate sections by content only (ignore auto-incremented header numbers)
+        seen_section_content: set[str] = set()
+        deduped_sections: list[str] = []
+        for block in page_content.sections:
+            content_key = re.sub(r"^### \S+ \d+\n\n", "", block.strip())
+            if content_key not in seen_section_content:
+                seen_section_content.add(content_key)
+                deduped_sections.append(block)
+        # exclude text blocks whose content is already represented by a section
+        text_blocks = [b for b in page_content.text_blocks if b.strip() not in seen_section_content]
+        blocks = _dedupe_blocks(text_blocks + deduped_sections)


No regression test for the duplicate-section scenario

The PR fixes a concrete bug (multiple chunks per page each carrying the same table/chart column data producing 3× headers), but no test covers that path. The existing tests in test_io_markdown.py don't construct a record list with multiple entries sharing identical table/chart column entries. Without such a test, a future refactor of _collect_page_record could silently re-introduce the duplication.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/io/markdown.py Line: 56-66 Comment: **No regression test for the duplicate-section scenario** The PR fixes a concrete bug (multiple chunks per page each carrying the same table/chart column data producing 3× headers), but no test covers that path. The existing tests in `test_io_markdown.py` don't construct a record list with multiple entries sharing identical `table`/`chart` column entries. Without such a test, a future refactor of `_collect_page_record` could silently re-introduce the duplication. How can I resolve this? If you propose a fix, please make it concise.

randerzander requested review from a team as code owners March 27, 2026 17:54

randerzander requested a review from charlesbluca March 27, 2026 17:54

randerzander added 2 commits March 27, 2026 13:57

Apply black formatting to markdown.py

0c88abf

Updating README snippets

d5f6488

randerzander commented Mar 27, 2026

View reviewed changes

randerzander requested a review from jioffe502 March 27, 2026 19:50

jioffe502 approved these changes Mar 30, 2026

View reviewed changes

Merge branch 'main' into markdown_fix

2471861

greptile-apps bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix duplicate Table/Chart sections in to_markdown_by_page#1741

Fix duplicate Table/Chart sections in to_markdown_by_page#1741
randerzander wants to merge 4 commits intoNVIDIA:mainfrom
randerzander:markdown_fix

randerzander commented Mar 27, 2026 •

edited

Loading

Uh oh!

randerzander Mar 27, 2026

Uh oh!

greptile-apps bot commented Apr 9, 2026

Confidence Score: 4/5

Vulnerabilities

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Apr 9, 2026

Uh oh!

greptile-apps bot Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

randerzander commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

randerzander Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Apr 9, 2026

Greptile Summary

Confidence Score: 4/5

Vulnerabilities

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

randerzander commented Mar 27, 2026 •

edited

Loading