Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename the "Intro" notebooks to call out specific functionality it supports (PDF to Embedings) #782

Open
Bytes-Explorer opened this issue Nov 6, 2024 · 10 comments
Assignees

Comments

@Bytes-Explorer
Copy link
Collaborator

No description provided.

@shahrokhDaijavad
Copy link
Member

@sujee Please suggest a couple of names other than intro for this, and after we agree on it, you can submit a PR for the name change. I know you use this example in workshops as an "introductory" notebook, and @Bytes-Explorer 's suggestion of using PDF2Embeddings (its functionality) is a little too "rigid" as a name for an example, so something along the lines of "Run_your_first-pipeline_pdf2embeddings" (seems too long, doesn't it?) is more appropriate.

@sujee
Copy link
Contributor

sujee commented Nov 6, 2024

@Bytes-Explorer @shahrokhDaijavad

how is something along the lines of pdf processing part 1

Totally open to suggestions :)

I plan to add other examples along the lines of

  • PDF processing with OCR
  • PDF processing with tables
  • PDF processing removing PII
  • etc

@shahrokhDaijavad
Copy link
Member

shahrokhDaijavad commented Nov 6, 2024

@sujee I am ok with "PDF processing Part 1", especially if you are planning to add subsequent examples with OCR, Tables, ... and using the PII transform. Of course, the example does a lot more by showing the effectiveness of exact and fuzzy dedup along the way, but we cannot spell out everything in the name.
@Bytes-Explorer what do you think?

@Bytes-Explorer
Copy link
Collaborator Author

It will be nice to understand the functionality from the name. How about PDF processing for RAG? I would also suggest that there should be a readme at the top folder that tells a user what can they learn from every example

@sujee
Copy link
Contributor

sujee commented Nov 7, 2024

Very good! how about something like..

  • pdf_processing_1_for_RAG
  • pdf_processing_2_handling_PII
  • pdf_processing_3_handling_duplicates
  • etc

@shahrokhDaijavad
Copy link
Member

Thanks, @Bytes-Explorer and @sujee. Let's go with pdf_processing_1_for_RAG
@sujee Please submit a PR with this new name (and the README explanation of what we learn from this example). We will then use the same PR to incrementally update to release 0.2.2.dev2 AFTER Michele solves the problem with the new Docling.

@shahrokhDaijavad
Copy link
Member

@sujee Please see my last comment in #763 and based on that submit a PR for renaming this example and potentially taking care of pip installing humanfriendly and the parameter for Ray settings on Colab. Thanks.

@shahrokhDaijavad
Copy link
Member

@sujee I just remembered that this example will also change its flow from "chunking" documents and then "deduplicating" chunks to "deduplicating" documents first and "chunking" next, so the PR should be submitted after that change.

@sujee
Copy link
Contributor

sujee commented Nov 14, 2024

@sujee I just remembered that this example will also change its flow from "chunking" documents and then "deduplicating" chunks to "deduplicating" documents first and "chunking" next, so the PR should be submitted after that change.

yes. a few changes are going to go into this example. I need to verify a couple of issues I raised to get this functionality (#605)

@shahrokhDaijavad
Copy link
Member

Right, @sujee The blockers you had in #605, #756 and #767 should all be resolved now. Please test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants