Fix duplicate Table/Chart sections in to_markdown_by_page#1741
Fix duplicate Table/Chart sections in to_markdown_by_page#1741randerzander wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
When multiple chunks per page all carry table/chart column data, _collect_page_record was appending a section for each chunk, producing 3x Table/Chart headers with identical content. _dedupe_blocks could not catch this because auto-incremented headers (### Table 1, ### Table 2) made otherwise-identical blocks appear distinct. Fix: deduplicate sections by content-only key (stripping the numeric header) before combining with text blocks, and filter out text blocks whose content is already represented by a labeled section. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nemo_retriever/README.md
Outdated
| # markdown formatted table from the first page | ||
| >>> chunks[1]["text"] | ||
| '| Table | 1 |\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |' | ||
| '| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |' |
There was a problem hiding this comment.
Notice here that the text surrounding the table (captured in crop because it increases relevance for recall) is also included in the markdown format.
ToDo: exclude surrounding text from markdown formatting
Greptile SummaryThis PR fixes duplicate Table/Chart sections in
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/io/markdown.py | Adds content-based section deduplication in to_markdown_by_page, but the regex \S+ fails to strip headers for multi-word labels like "Page Image", leaving those duplicates un-deduplicated. |
| nemo_retriever/README.md | Documentation update: corrects outdated examples that showed the old nested-by-filename API shape and replaces them with the fixed, non-duplicated output. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[page_content sections list] --> B[Loop over each section block]
B --> C[Strip header via regex to get content_key]
C --> D{content_key already seen?}
D -- No --> E[Add to seen set and deduped list]
D -- Yes --> F[Discard duplicate]
E --> G[Filter text_blocks]
A2[page_content text_blocks] --> G
G --> H{block.strip in seen set?}
H -- Yes --> I[Discard: covered by section]
H -- No --> J[Keep text block]
J --> K[_dedupe_blocks on combined list]
E --> K
K --> L[Rendered page markdown]
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/markdown.py
Line: 60
Comment:
**Regex fails for multi-word labels like "Page Image"**
`\S+` matches only non-whitespace, so a section header like `### Page Image 1\n\n` is never stripped — `\S+` consumes `Page`, the regex then expects `\d+` but sees `Image`, causing `re.sub` to return the block unchanged. Duplicate `Page Image` sections (e.g. a page with multiple OCR image chunks carrying the same content) will not be deduplicated by this new logic.
Replace `\S+` with a non-greedy `.+?` so any label including multi-word ones is correctly stripped: `re.sub(r"^### .+? \d+\n\n", "", block.strip())`
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/markdown.py
Line: 56-66
Comment:
**No regression test for the duplicate-section scenario**
The PR fixes a concrete bug (multiple chunks per page each carrying the same table/chart column data producing 3× headers), but no test covers that path. The existing tests in `test_io_markdown.py` don't construct a record list with multiple entries sharing identical `table`/`chart` column entries. Without such a test, a future refactor of `_collect_page_record` could silently re-introduce the duplication.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "Merge branch 'main' into markdown_fix" | Re-trigger Greptile
| seen_section_content: set[str] = set() | ||
| deduped_sections: list[str] = [] | ||
| for block in page_content.sections: | ||
| content_key = re.sub(r"^### \S+ \d+\n\n", "", block.strip()) |
There was a problem hiding this comment.
Regex fails for multi-word labels like "Page Image"
\S+ matches only non-whitespace, so a section header like ### Page Image 1\n\n is never stripped — \S+ consumes Page, the regex then expects \d+ but sees Image, causing re.sub to return the block unchanged. Duplicate Page Image sections (e.g. a page with multiple OCR image chunks carrying the same content) will not be deduplicated by this new logic.
Replace \S+ with a non-greedy .+? so any label including multi-word ones is correctly stripped: re.sub(r"^### .+? \d+\n\n", "", block.strip())
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/markdown.py
Line: 60
Comment:
**Regex fails for multi-word labels like "Page Image"**
`\S+` matches only non-whitespace, so a section header like `### Page Image 1\n\n` is never stripped — `\S+` consumes `Page`, the regex then expects `\d+` but sees `Image`, causing `re.sub` to return the block unchanged. Duplicate `Page Image` sections (e.g. a page with multiple OCR image chunks carrying the same content) will not be deduplicated by this new logic.
Replace `\S+` with a non-greedy `.+?` so any label including multi-word ones is correctly stripped: `re.sub(r"^### .+? \d+\n\n", "", block.strip())`
How can I resolve this? If you propose a fix, please make it concise.| # deduplicate sections by content only (ignore auto-incremented header numbers) | ||
| seen_section_content: set[str] = set() | ||
| deduped_sections: list[str] = [] | ||
| for block in page_content.sections: | ||
| content_key = re.sub(r"^### \S+ \d+\n\n", "", block.strip()) | ||
| if content_key not in seen_section_content: | ||
| seen_section_content.add(content_key) | ||
| deduped_sections.append(block) | ||
| # exclude text blocks whose content is already represented by a section | ||
| text_blocks = [b for b in page_content.text_blocks if b.strip() not in seen_section_content] | ||
| blocks = _dedupe_blocks(text_blocks + deduped_sections) |
There was a problem hiding this comment.
No regression test for the duplicate-section scenario
The PR fixes a concrete bug (multiple chunks per page each carrying the same table/chart column data producing 3× headers), but no test covers that path. The existing tests in test_io_markdown.py don't construct a record list with multiple entries sharing identical table/chart column entries. Without such a test, a future refactor of _collect_page_record could silently re-introduce the duplication.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/markdown.py
Line: 56-66
Comment:
**No regression test for the duplicate-section scenario**
The PR fixes a concrete bug (multiple chunks per page each carrying the same table/chart column data producing 3× headers), but no test covers that path. The existing tests in `test_io_markdown.py` don't construct a record list with multiple entries sharing identical `table`/`chart` column entries. Without such a test, a future refactor of `_collect_page_record` could silently re-introduce the duplication.
How can I resolve this? If you propose a fix, please make it concise.
when running the core pipeline on multimodal_test.pdf and using to_markdown_by_page, table and charts get duplicated in the markdown text.
Claude's description:
When multiple chunks per page carry table/chart column data,
_collect_page_recordappended a section per chunk, producing 3× Table/Chart headers with identical content._dedupe_blockscouldn't catch this because auto-incremented headers made identical blocks appear distinct.Fix: deduplicate sections by content-only key (stripping the numeric header) before combining with text blocks, and filter out text blocks whose content is already covered by a labeled section.