-
Notifications
You must be signed in to change notification settings - Fork 315
Fix duplicate Table/Chart sections in to_markdown_by_page #1741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,6 +4,7 @@ | |
|
|
||
| from __future__ import annotations | ||
|
|
||
| import re | ||
| from collections import defaultdict | ||
| from collections.abc import Iterable, Mapping | ||
| from dataclasses import dataclass, field | ||
|
|
@@ -52,7 +53,17 @@ def to_markdown_by_page(results: object) -> dict[int, str]: | |
|
|
||
| rendered: dict[int, str] = {} | ||
| for page_number, page_content in sorted(by_page.items(), key=_page_sort_key): | ||
| blocks = _dedupe_blocks(page_content.text_blocks + page_content.sections) | ||
| # deduplicate sections by content only (ignore auto-incremented header numbers) | ||
| seen_section_content: set[str] = set() | ||
| deduped_sections: list[str] = [] | ||
| for block in page_content.sections: | ||
| content_key = re.sub(r"^### \S+ \d+\n\n", "", block.strip()) | ||
| if content_key not in seen_section_content: | ||
| seen_section_content.add(content_key) | ||
| deduped_sections.append(block) | ||
| # exclude text blocks whose content is already represented by a section | ||
| text_blocks = [b for b in page_content.text_blocks if b.strip() not in seen_section_content] | ||
| blocks = _dedupe_blocks(text_blocks + deduped_sections) | ||
|
Comment on lines
+56
to
+66
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The PR fixes a concrete bug (multiple chunks per page each carrying the same table/chart column data producing 3× headers), but no test covers that path. The existing tests in Prompt To Fix With AIThis is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/markdown.py
Line: 56-66
Comment:
**No regression test for the duplicate-section scenario**
The PR fixes a concrete bug (multiple chunks per page each carrying the same table/chart column data producing 3× headers), but no test covers that path. The existing tests in `test_io_markdown.py` don't construct a record list with multiple entries sharing identical `table`/`chart` column entries. Without such a test, a future refactor of `_collect_page_record` could silently re-introduce the duplication.
How can I resolve this? If you propose a fix, please make it concise. |
||
| header = f"## Page {page_number}" if page_number != _UNKNOWN_PAGE else "## Page Unknown" | ||
| rendered[page_number] = header + ("\n\n" + "\n\n".join(blocks) if blocks else "\n") | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\S+matches only non-whitespace, so a section header like### Page Image 1\n\nis never stripped —\S+consumesPage, the regex then expects\d+but seesImage, causingre.subto return the block unchanged. DuplicatePage Imagesections (e.g. a page with multiple OCR image chunks carrying the same content) will not be deduplicated by this new logic.Replace
\S+with a non-greedy.+?so any label including multi-word ones is correctly stripped:re.sub(r"^### .+? \d+\n\n", "", block.strip())Prompt To Fix With AI