Skip to content

PDF parser #2085

@swaroopgv

Description

@swaroopgv

Question

My Requirement is to parse technical documentations, which may contain tables , meaningful images and provide the contents in JSON format(Since it the suggested format for any LLM models ??)

Currently i have

  1. converted PDF to markdown , Since markdown currently does'nt preserve hierarchy, I have written an helper function which takes markdown file , here based on number in the section name , it sorts things and put everything into JSON format and dump to file
    (I hope this is the right approach, if there any better ways please suggest)

Now
2. Here also need to read the tables and convert them to JSON.
3. Take pictures and run it on computer vision model to get extract information about the pictures.

Questions

  1. what are the available options for this?
  2. if there are any other better approach please suggest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions