Skip to content

Latest commit

 

History

History
101 lines (78 loc) · 8.46 KB

process.md

File metadata and controls

101 lines (78 loc) · 8.46 KB

⚙️ Process

The process module enables the extraction and standardization of text and images from diverse file formats (listed below), making it ideal for creating datasets for applications such as RAG, multimodal content generation, and preprocessing data for multimodal LLMs/LLMs.

🔨 Quick Start

🧑‍💻 Global installation

Setup the project in each device you want to use using our setup script or looking at what it does and doing it manually.

pip install -e '.[all]'

💻 Running locally

You need to specify the input folder by modifying the config file. You can also twist the parameters to your needs. Once ready, you can run the process using the following command:

python -m mmore process --config-file examples/process/config.yaml

The output of the pipeline has the following structure:

output_path
├── processors
│   ├── Processor_type_1
│   │   └── results.jsonl
│   ├── Processor_type_2
│   │   └── results.jsonl
│   ├── ...
│   
└── merged
│    └── merged_results.jsonl
|
└── images

🚀 Running on distributed nodes

We provide a simple bash script to run the process on distributed mode. Please call it with your arguments.

bash scripts/process_distributed.sh -f /path/to/my/input/folder 

⌛ Dashboard UI

Getting a sense of the overall progress of the pipeline can be challenging when running on a large dataset, and especially in a distributed environment. You can optionally use the dashboard to monitor the progress of the pipeline. You will be able to visualize results 📈. The dashboard also lets you gently stop workers 📉 and monitor their progression. 1. Start the backend on the cluster backend README. 2. Specify the backend URL in the frontend as an environment variable. 3. Start the frontend on your local machine frontend README. 4. Specify the backend URL in the process_config.yaml file and finally execute run_process.py as usual.

📜 Examples

You can find more examples scripts in the /examples directory.

⚡ Optimization

🏎️ Fast mode

For some file types, we provide a fast mode that will allow you to process the files faster, using a different method. To use it, set the use_fast_processors to true in the config file.

Be aware that the fast mode might not be as accurate as the default mode, especially for scanned non-native PDFs, which may require Optical Character Recognition (OCR) for more accurate extraction.

🚀 Distributed mode

The project is designed to be easily scalable to a multi GPU / multi node environment. To use it, To use it, set the distribued to true in the config file., and follow the steps described in the section.

🔧 File type parameters tuning

Many parameters are hardware-dependent and can be customized to suit your needs. For example, you can adjust the processor batch size, dispatcher batch size, and the number of threads per worker to optimize performance.

You can configure parameters by providing a custom config file. You can find an example of a config file in the examples folder.

🚨 Not all parameters are configurable yet 😉

📜 More information on what's under the hood

🚧 Pipeline architecture

Our pipeline is a 3 steps process:

  • Crawling: We first crawl over the file/folder to list all the files we need to process (by skipping those already processed).
  • Dispatching: We then dispatch the files to the workers, using a dispatcher that will send the files to the workers in batches. This part is in charge of the load balancing between different nodes if the project is running in a distributed environment.
  • Processing: The workers then process the files, using the appropriate tools for each file type. They extract the text, images, audio, and video frames, and send them to the next step. Our goal is to provide an easy way to add new processors for new file types, or even other types of processing for existing file types.

🛠️ Used tools

The project supports multiple file types and utilizes various AI-based tools for processing. Below is a table summarizing the supported file types and corresponding tools (N/A means no choice):

File Type Default Mode Tool(s) Fast Mode Tool(s)
DOCX python-docx to extract the text and images. N/A
MD markdown for text extraction, markdownify for HTML conversion N/A
PPTX python-pptx to extract the text and images. N/A
XLSX openpyxl to extract the text and images. N/A
TXT python built-in library N/A
EML python built-in library N/A
MP4, MOV, AVI, MKV, MP3, WAV, AAC moviepy for video frame extraction; whisper-large-v3-turbo for transcription whisper-tiny
PDF marker-pdf for OCR and structured data extraction PyMuPDF for text and image extraction
Webpages (TBD) TODO selenium to navigate the webpage and extract content; requests for images; trafilatura for content extraction

We also use Dask distributed to manage the distributed environment.

🔧 Customization

The system is designed to be extensible, allowing you to register custom processors for handling new file types or specialized processing. To implement a new processor you need to inherit the Processor class and implement only two methods:

  • accepts: defines the file types your processor supports (e.g. docx)
  • process: how to process a single file (input:file type, output: Multimodal sample, see other processors for reference)

See TextProcessor in src/process/processors/text_processor.py for a minimal example.