-
Notifications
You must be signed in to change notification settings - Fork 7
Add illuminator notebook #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
adb4520
to
420d143
Compare
@@ -40,37 +41,37 @@ def summarize_tables(doc) -> Tuple[int, List[int]]: | |||
|
|||
return num_tables, pages | |||
|
|||
|
|||
def analyze_pdf_with_docling(file_path) -> Dict[str, Union[int, List[Any], set]]: | |||
def convert_pdf_with_docling(file_path: str) -> DoclingDocument: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to rename this to `convert_to_docling_document since it's not pdf specific anymore. It would also be nice for this function to have an argument on whether or not the markdown should be saved or not. I'm not sure in every case we want that.
@alinaryan This PR title mentions adding an illuminator notebook but there isn't on in here. Maybe the title should change to reflect the refactoring you're doing. Also if you added the illuminator into the |
974086f
to
f6f8fc0
Compare
@alimaredia |
Signed-off-by: Alina Ryan <[email protected]>
f6f8fc0
to
f276f14
Compare
@@ -38,7 +38,7 @@ | |||
"WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME\n", | |||
"WORKSPACE_DIR.mkdir(exist_ok=True)\n", | |||
"\n", | |||
"SOURCE_DOCUMENT = None # to process a specific document, set its path here; otherwise, the entire source documents repository will be used\n", | |||
"SOURCE_DOCUMENT = \"/home/alina/dev/pdfs/merged_cell_pdfs/BofA_InterestChecking_en_ADA.pdf\" # to process a specific document, set its path here; otherwise, the entire source documents repository will be used\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"SOURCE_DOCUMENT = \"/home/alina/dev/pdfs/merged_cell_pdfs/BofA_InterestChecking_en_ADA.pdf\" # to process a specific document, set its path here; otherwise, the entire source documents repository will be used\n", | |
"SOURCE_DOCUMENT = None # to process a specific document, set its path here; otherwise, the entire source documents repository will be used\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops missed this! thank you
I think its worth going through this and clearing all the outputs from the |
"id": "2572e2d0-94dc-4ca0-b032-3978af26c9c9", | ||
"metadata": {}, | ||
"source": [ | ||
"This step guides you through analyzing docling-converted documents for problematic table structures, specifically merged table cells, using core functions from the Illuminator codebase." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this text block is an opportunity to explain what your tool does, and why they should optionally use it. I think phrasing this to tell that story might be clearer. Here is an example:
Document conversion isn't always perfect. Data can get distorted or corrupted, which may result in lower quality training outcomes. It is strongly recommended, that you review your converted data. This example uses the Illuminator tool to scan converted documents for common conversion issues.
@kelbrown20 can probably phrase this much better than I can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The output of document conversion is not always perfect. Data may become distorted or corrupted, which can negatively affect a model's performance after training. While optional, reviewing your converted data is strongly recommended. The following example explains how to use the Illuminator tool to identify common conversion issues."
^ this is what I came up with
I just want to ensure its still technically accurate @iamemilio @alinaryan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great to me
I like the little emoji's in the console output |
Signed-off-by: Alina Ryan <[email protected]>
f276f14
to
d625f13
Compare
This change adds the Illuminator tool’s core functions to the instructlab-knowledge notebook for analyzing a converted document and summarizing merged table cell issues for each table.
Also refactor's the illuminator to accept json as input and adjusts some imports to be relative