Skip to content

Add illuminator notebook #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

alinaryan
Copy link
Member

@alinaryan alinaryan commented May 5, 2025

This change adds the Illuminator tool’s core functions to the instructlab-knowledge notebook for analyzing a converted document and summarizing merged table cell issues for each table.

Also refactor's the illuminator to accept json as input and adjusts some imports to be relative

@@ -40,37 +41,37 @@ def summarize_tables(doc) -> Tuple[int, List[int]]:

return num_tables, pages


def analyze_pdf_with_docling(file_path) -> Dict[str, Union[int, List[Any], set]]:
def convert_pdf_with_docling(file_path: str) -> DoclingDocument:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to rename this to `convert_to_docling_document since it's not pdf specific anymore. It would also be nice for this function to have an argument on whether or not the markdown should be saved or not. I'm not sure in every case we want that.

@alimaredia
Copy link
Contributor

alimaredia commented May 21, 2025

@alinaryan This PR title mentions adding an illuminator notebook but there isn't on in here. Maybe the title should change to reflect the refactoring you're doing.

Also if you added the illuminator into the instructlab-knowledge notebook do you have an idea of what the calls the notebook would make might look like?

@alinaryan alinaryan force-pushed the add-illuminator-notebook branch 2 times, most recently from 974086f to f6f8fc0 Compare May 21, 2025 20:43
@alinaryan
Copy link
Member Author

This PR title mentions adding an illuminator notebook but there isn't on in here. Maybe the title should change to reflect the refactoring you're doing.
Also if you added the illuminator into the instructlab-knowledge notebook do you have an idea of what the calls the notebook would make might look like?

@alimaredia
Added a commit that adds the notebook and reworded the PR desc. It currently outputs the illuminator results to the notebook output AND to an output file. LMK what you think about this approach

@alinaryan alinaryan force-pushed the add-illuminator-notebook branch from f6f8fc0 to f276f14 Compare May 27, 2025 18:09
@@ -38,7 +38,7 @@
"WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME\n",
"WORKSPACE_DIR.mkdir(exist_ok=True)\n",
"\n",
"SOURCE_DOCUMENT = None # to process a specific document, set its path here; otherwise, the entire source documents repository will be used\n",
"SOURCE_DOCUMENT = \"/home/alina/dev/pdfs/merged_cell_pdfs/BofA_InterestChecking_en_ADA.pdf\" # to process a specific document, set its path here; otherwise, the entire source documents repository will be used\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"SOURCE_DOCUMENT = \"/home/alina/dev/pdfs/merged_cell_pdfs/BofA_InterestChecking_en_ADA.pdf\" # to process a specific document, set its path here; otherwise, the entire source documents repository will be used\n",
"SOURCE_DOCUMENT = None # to process a specific document, set its path here; otherwise, the entire source documents repository will be used\n",

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops missed this! thank you

@iamemilio
Copy link
Contributor

I think its worth going through this and clearing all the outputs from the instructlab-knowledge.ipynb notebook

"id": "2572e2d0-94dc-4ca0-b032-3978af26c9c9",
"metadata": {},
"source": [
"This step guides you through analyzing docling-converted documents for problematic table structures, specifically merged table cells, using core functions from the Illuminator codebase."
Copy link
Contributor

@iamemilio iamemilio May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this text block is an opportunity to explain what your tool does, and why they should optionally use it. I think phrasing this to tell that story might be clearer. Here is an example:

Document conversion isn't always perfect. Data can get distorted or corrupted, which may result in lower quality training outcomes. It is strongly recommended, that you review your converted data. This example uses the Illuminator tool to scan converted documents for common conversion issues.

@kelbrown20 can probably phrase this much better than I can

Copy link

@kelbrown20 kelbrown20 May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The output of document conversion is not always perfect. Data may become distorted or corrupted, which can negatively affect a model's performance after training. While optional, reviewing your converted data is strongly recommended. The following example explains how to use the Illuminator tool to identify common conversion issues."

^ this is what I came up with
I just want to ensure its still technically accurate @iamemilio @alinaryan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great to me

@iamemilio
Copy link
Contributor

I like the little emoji's in the console output

Signed-off-by: Alina Ryan <[email protected]>
@alinaryan alinaryan force-pushed the add-illuminator-notebook branch from f276f14 to d625f13 Compare May 29, 2025 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants