Skip to content

Fix duplicate Table/Chart sections in to_markdown_by_page#1741

Open
randerzander wants to merge 4 commits intoNVIDIA:mainfrom
randerzander:markdown_fix
Open

Fix duplicate Table/Chart sections in to_markdown_by_page#1741
randerzander wants to merge 4 commits intoNVIDIA:mainfrom
randerzander:markdown_fix

Conversation

@randerzander
Copy link
Copy Markdown
Collaborator

@randerzander randerzander commented Mar 27, 2026

when running the core pipeline on multimodal_test.pdf and using to_markdown_by_page, table and charts get duplicated in the markdown text.

Claude's description:
When multiple chunks per page carry table/chart column data, _collect_page_record appended a section per chunk, producing 3× Table/Chart headers with identical content. _dedupe_blocks couldn't catch this because auto-incremented headers made identical blocks appear distinct.

Fix: deduplicate sections by content-only key (stripping the numeric header) before combining with text blocks, and filter out text blocks whose content is already covered by a labeled section.

When multiple chunks per page all carry table/chart column data,
_collect_page_record was appending a section for each chunk, producing
3x Table/Chart headers with identical content. _dedupe_blocks could not
catch this because auto-incremented headers (### Table 1, ### Table 2)
made otherwise-identical blocks appear distinct.

Fix: deduplicate sections by content-only key (stripping the numeric
header) before combining with text blocks, and filter out text blocks
whose content is already represented by a labeled section.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@randerzander randerzander requested review from a team as code owners March 27, 2026 17:54
# markdown formatted table from the first page
>>> chunks[1]["text"]
'| Table | 1 |\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |'
'| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |'
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice here that the text surrounding the table (captured in crop because it increases relevance for recall) is also included in the markdown format.

ToDo: exclude surrounding text from markdown formatting

@randerzander randerzander requested a review from jioffe502 March 27, 2026 19:50
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 9, 2026

Greptile Summary

This PR fixes duplicate Table/Chart sections in to_markdown_by_page by deduplicating sections on content alone (stripping the auto-incremented ### Label N header before comparing) and filtering text blocks already covered by a labeled section. The README is also updated to reflect the current non-nested return shape.

  • P1 – regex gap for multi-word labels: The pattern ^### \\S+ \\d+\ \ uses \\S+ (no spaces), so ### Page Image 1\ \ is never matched and Page Image sections will not be deduplicated — a one-character fix (\\S+.+?) resolves this.
  • P2 – no regression test: The specific duplication scenario (multiple page-record chunks sharing identical table/chart column data) has no corresponding test, leaving the fix without a guard against future regression.

Confidence Score: 4/5

Safe to merge after fixing the regex to handle multi-word labels; the primary Table/Chart fix is correct.

A P1 gap exists: the deduplication regex silently fails for any label containing a space (currently only 'Page Image'), meaning that type of duplication is not fixed. The one-character regex fix is straightforward and low-risk, but it should be made before merging to avoid shipping an incomplete fix.

nemo_retriever/src/nemo_retriever/io/markdown.py — review the regex on line 60

Vulnerabilities

No security concerns identified.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/io/markdown.py Adds content-based section deduplication in to_markdown_by_page, but the regex \S+ fails to strip headers for multi-word labels like "Page Image", leaving those duplicates un-deduplicated.
nemo_retriever/README.md Documentation update: corrects outdated examples that showed the old nested-by-filename API shape and replaces them with the fixed, non-duplicated output.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[page_content sections list] --> B[Loop over each section block]
    B --> C[Strip header via regex to get content_key]
    C --> D{content_key already seen?}
    D -- No --> E[Add to seen set and deduped list]
    D -- Yes --> F[Discard duplicate]
    E --> G[Filter text_blocks]
    A2[page_content text_blocks] --> G
    G --> H{block.strip in seen set?}
    H -- Yes --> I[Discard: covered by section]
    H -- No --> J[Keep text block]
    J --> K[_dedupe_blocks on combined list]
    E --> K
    K --> L[Rendered page markdown]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/markdown.py
Line: 60

Comment:
**Regex fails for multi-word labels like "Page Image"**

`\S+` matches only non-whitespace, so a section header like `### Page Image 1\n\n` is never stripped — `\S+` consumes `Page`, the regex then expects `\d+` but sees `Image`, causing `re.sub` to return the block unchanged. Duplicate `Page Image` sections (e.g. a page with multiple OCR image chunks carrying the same content) will not be deduplicated by this new logic.

Replace `\S+` with a non-greedy `.+?` so any label including multi-word ones is correctly stripped: `re.sub(r"^### .+? \d+\n\n", "", block.strip())`

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/markdown.py
Line: 56-66

Comment:
**No regression test for the duplicate-section scenario**

The PR fixes a concrete bug (multiple chunks per page each carrying the same table/chart column data producing 3× headers), but no test covers that path. The existing tests in `test_io_markdown.py` don't construct a record list with multiple entries sharing identical `table`/`chart` column entries. Without such a test, a future refactor of `_collect_page_record` could silently re-introduce the duplication.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Merge branch 'main' into markdown_fix" | Re-trigger Greptile

seen_section_content: set[str] = set()
deduped_sections: list[str] = []
for block in page_content.sections:
content_key = re.sub(r"^### \S+ \d+\n\n", "", block.strip())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Regex fails for multi-word labels like "Page Image"

\S+ matches only non-whitespace, so a section header like ### Page Image 1\n\n is never stripped — \S+ consumes Page, the regex then expects \d+ but sees Image, causing re.sub to return the block unchanged. Duplicate Page Image sections (e.g. a page with multiple OCR image chunks carrying the same content) will not be deduplicated by this new logic.

Replace \S+ with a non-greedy .+? so any label including multi-word ones is correctly stripped: re.sub(r"^### .+? \d+\n\n", "", block.strip())

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/markdown.py
Line: 60

Comment:
**Regex fails for multi-word labels like "Page Image"**

`\S+` matches only non-whitespace, so a section header like `### Page Image 1\n\n` is never stripped — `\S+` consumes `Page`, the regex then expects `\d+` but sees `Image`, causing `re.sub` to return the block unchanged. Duplicate `Page Image` sections (e.g. a page with multiple OCR image chunks carrying the same content) will not be deduplicated by this new logic.

Replace `\S+` with a non-greedy `.+?` so any label including multi-word ones is correctly stripped: `re.sub(r"^### .+? \d+\n\n", "", block.strip())`

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +56 to +66
# deduplicate sections by content only (ignore auto-incremented header numbers)
seen_section_content: set[str] = set()
deduped_sections: list[str] = []
for block in page_content.sections:
content_key = re.sub(r"^### \S+ \d+\n\n", "", block.strip())
if content_key not in seen_section_content:
seen_section_content.add(content_key)
deduped_sections.append(block)
# exclude text blocks whose content is already represented by a section
text_blocks = [b for b in page_content.text_blocks if b.strip() not in seen_section_content]
blocks = _dedupe_blocks(text_blocks + deduped_sections)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No regression test for the duplicate-section scenario

The PR fixes a concrete bug (multiple chunks per page each carrying the same table/chart column data producing 3× headers), but no test covers that path. The existing tests in test_io_markdown.py don't construct a record list with multiple entries sharing identical table/chart column entries. Without such a test, a future refactor of _collect_page_record could silently re-introduce the duplication.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/markdown.py
Line: 56-66

Comment:
**No regression test for the duplicate-section scenario**

The PR fixes a concrete bug (multiple chunks per page each carrying the same table/chart column data producing 3× headers), but no test covers that path. The existing tests in `test_io_markdown.py` don't construct a record list with multiple entries sharing identical `table`/`chart` column entries. Without such a test, a future refactor of `_collect_page_record` could silently re-introduce the duplication.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants