Skip to content

Commit 21768e6

Browse files
committedMar 20, 2025
checked repo
1 parent 919b6f8 commit 21768e6

File tree

6 files changed

+67
-193
lines changed

6 files changed

+67
-193
lines changed
 

‎README.md

+26-107
Original file line numberDiff line numberDiff line change
@@ -17,51 +17,13 @@ A scalable multimodal pipeline for processing, indexing, and querying multimodal
1717
Ever needed to take 8000 PDFs, 2000 videos, and 500 spreadsheets and feed them to an LLM as a knowledge base?
1818
Well, MMORE is here to help you!
1919

20-
## MMORE Installation Guide
20+
## :bulb: Quickstart
2121

22-
### Installation Option 1: pip (recommended)
22+
### Installation
2323

24-
To install all dependencies, run:
24+
#### (Step 0 – Install system dependencies)
2525

26-
```bash
27-
pip install -e '.[all]'
28-
```
29-
30-
To install only processor-related dependencies, run:
31-
32-
```bash
33-
pip install -e '.[processor]'
34-
```
35-
36-
To install only RAG-related dependencies, run:
37-
38-
```bash
39-
pip install -e '.[rag]'
40-
```
41-
42-
---
43-
44-
45-
### Minimal Example
46-
47-
```python
48-
from mmore.process.processors.pdf_processor import PDFProcessor
49-
from mmore.process.processors.base import ProcessorConfig
50-
from mmore.type import MultimodalSample
51-
52-
pdf_file_paths = ["examples/sample_data/pdf/calendar.pdf"]
53-
out_file = "results/example.jsonl"
54-
55-
pdf_processor_config = ProcessorConfig(custom_config={"output_path": "results"})
56-
pdf_processor = PDFProcessor(config=pdf_processor_config)
57-
result_pdf = pdf_processor.process_batch(pdf_file_paths, True, 1) # args: file_paths, fast mode (True/False), num_workers
58-
59-
MultimodalSample.to_jsonl(out_file, result_pdf)
60-
```
61-
62-
### Installation Option 2: uv
63-
64-
#### Step 1: Install system dependencies
26+
Our package requires system dependencies. This snippet will take care of installing them!
6527

6628
```bash
6729
sudo apt update
@@ -70,95 +32,52 @@ sudo apt install -y ffmpeg libsm6 libxext6 chromium-browser libnss3 \
7032
libxext6 libxfixes3 libxrender1 libasound2 libatk1.0-0 libgtk-3-0 libreoffice
7133
```
7234

73-
#### Step 2: Install `uv`
74-
75-
Refer to the [uv installation guide](https://docs.astral.sh/uv/getting-started/installation/) for detailed instructions.
76-
```
77-
curl -LsSf https://astral.sh/uv/install.sh | sh
78-
```
79-
80-
#### Step 3: Clone this repository
35+
#### Step 1 – Install MMORE
8136

82-
```bash
83-
git clone https://github.com/swiss-ai/mmore
84-
cd mmore
85-
```
86-
87-
#### Step 4: Install project and dependencies
37+
To install the package simply run:
8838

8939
```bash
90-
uv sync
40+
pip install -e .
9141
```
9242

93-
For CPU-only installation, use:
43+
To install additional RAG-related dependencies, run:
9444

9545
```bash
96-
uv sync --extra cpu
46+
pip install -e '.[rag]'
9747
```
9848

99-
#### Step 5: Run a test command
100-
101-
Activate the virtual environment before running commands:
49+
> :warning: This is a big package with a lot of dependencies, so we recommend to use `uv` to handle `pip` installations. [Check our tutorial on uv](./docs/uv.md).
10250
103-
```bash
104-
python -m venv .venv
105-
source .venv/bin/activate
106-
```
51+
### Minimal Example
10752

108-
Alternatively, prepend each command with `uv run`:
53+
You can use our predefined CLI commands to execute parts of the pipeline. Note that you might need to prepend `python -m` to the command if the package does not properly create bash aliases.
10954

11055
```bash
11156
# Run processing
112-
python -m mmore process --config-file examples/process/config.yaml
57+
mmore process --config-file examples/process/config.yaml
11358

11459
# Run indexer
115-
python -m mmore index --config-file examples/index/config.yaml
60+
mmore index --config-file examples/index/config.yaml
11661

11762
# Run RAG
118-
python -m mmore rag --config-file examples/rag/api/rag_api.yaml
119-
```
120-
121-
---
122-
123-
### Installation Option 3: Docker
124-
125-
**Note:** For manual installation without Docker, refer to the section below.
126-
127-
#### Step 1: Install Docker
128-
129-
Follow the official [Docker installation guide](https://docs.docker.com/get-started/get-docker/).
130-
131-
#### Step 2: Build the Docker image
132-
133-
```bash
134-
docker build . --tag mmore
135-
```
136-
137-
To build for CPU-only platforms (results in a smaller image size):
138-
139-
```bash
140-
docker build --build-arg PLATFORM=cpu -t mmore .
141-
```
142-
143-
#### Step 3: Start an interactive session
144-
145-
```bash
146-
docker run -it -v ./test_data:/app/test_data mmore
63+
mmore rag --config-file examples/rag/api/rag_api.yaml
14764
```
14865

149-
*Note:* The `test_data` folder is mapped to `/app/test_data` inside the container, corresponding to the default path in `examples/process_config.yaml`.
66+
You can also use our package in python code as shown here:
15067

151-
#### Step 4: Run the application inside the container
68+
```python
69+
from mmore.process.processors.pdf_processor import PDFProcessor
70+
from mmore.process.processors.base import ProcessorConfig
71+
from mmore.type import MultimodalSample
15272

153-
```bash
154-
# Run processing
155-
mmore process --config-file examples/process/config.yaml
73+
pdf_file_paths = ["examples/sample_data/pdf/calendar.pdf"]
74+
out_file = "results/example.jsonl"
15675

157-
# Run indexer
158-
mmore index --config-file examples/index/config.yaml
76+
pdf_processor_config = ProcessorConfig(custom_config={"output_path": "results"})
77+
pdf_processor = PDFProcessor(config=pdf_processor_config)
78+
result_pdf = pdf_processor.process_batch(pdf_file_paths, True, 1) # args: file_paths, fast mode (True/False), num_workers
15979

160-
# Run RAG
161-
mmore rag --config-file examples/rag/api/rag_api.yaml
80+
MultimodalSample.to_jsonl(out_file, result_pdf)
16281
```
16382

16483
---

‎docs/docker.md

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
### Installation Option 3: Docker
2+
3+
**Note:** For manual installation without Docker, refer to the section below.
4+
5+
#### Step 1: Install Docker
6+
7+
Follow the official [Docker installation guide](https://docs.docker.com/get-started/get-docker/).
8+
9+
#### Step 2: Build the Docker image
10+
11+
```bash
12+
docker build . --tag mmore
13+
```
14+
15+
To build for CPU-only platforms (results in a smaller image size):
16+
17+
```bash
18+
docker build --build-arg PLATFORM=cpu -t mmore .
19+
```
20+
21+
#### Step 3: Start an interactive session
22+
23+
```bash
24+
docker run -it -v ./test_data:/app/test_data mmore
25+
```
26+
27+
*Note:* The `test_data` folder is mapped to `/app/test_data` inside the container, corresponding to the default path in `examples/process_config.yaml`.

‎examples/index/config.yaml

-9
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,3 @@
1-
<<<<<<< HEAD
2-
dense_model:
3-
model_name: BAAI/bge-small-en
4-
sparse_model:
5-
model_name: splade
6-
db:
7-
uri: ./examples/index/sample.db
8-
=======
91
indexer:
102
dense_model:
113
model_name: sentence-transformers/all-MiniLM-L6-v2
@@ -18,4 +10,3 @@ indexer:
1810
name: my_db
1911
collection_name: my_docs
2012
documents_path: 'examples/process/outputs/merged/merged_results.jsonl'
21-
>>>>>>> 21388c46a4f8f731769264e8ef0d85d055f7a6d8

‎examples/rag/config_api.yaml

-11
Original file line numberDiff line numberDiff line change
@@ -8,16 +8,6 @@ rag:
88
# Retriever Config
99
retriever:
1010
db:
11-
<<<<<<< HEAD:examples/rag/config_api.yaml
12-
uri: ./examples/rag/ner.db
13-
hybrid_search_weight: 0.5
14-
k: 5
15-
system_prompt: "Use the following context to answer the questions.\n\nContext:\n{context}"
16-
api:
17-
endpoint: '/rag'
18-
port: 8000
19-
host: 'localhost'
20-
=======
2111
uri: ./proc_demo.db
2212
name: my_db
2313
hybrid_search_weight: 0.5
@@ -30,4 +20,3 @@ mode_args:
3020
endpoint: '/rag'
3121
port: 8000
3222
host: 'localhost'
33-
>>>>>>> 21388c46a4f8f731769264e8ef0d85d055f7a6d8:examples/rag/api/rag_api.yaml

‎pyproject.toml

+14-66
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
[build-system]
2+
requires = ["setuptools", "setuptools-scm"]
3+
build-backend = "setuptools.build_meta"
4+
15
[project]
26
name = "mmore"
37
version = "1.0.0"
@@ -19,23 +23,10 @@ dependencies = [
1923
"python-dotenv==1.0.1",
2024
"dacite==1.8.1",
2125
"click>=8.1.7",
22-
"dask-cuda>=24.10.0",
23-
"cuda-python>=12.6.2",
24-
"ucx-py-cu12>=0.40.0",
26+
"dask[distributed]>=2025.2.0",
2527
"pytest>=8.3.4",
2628
"validators==0.34.0",
2729
"httpx==0.27.2",
28-
]
29-
30-
[project.optional-dependencies]
31-
cpu = [
32-
"torch>=2.5.1",
33-
]
34-
cu124 = [
35-
"torch>=2.5.1",
36-
]
37-
38-
process = [
3930
"Pillow",
4031
"PyMuPDF",
4132
"Unidecode",
@@ -69,36 +60,16 @@ process = [
6960
"pymongo==4.9.2"
7061
]
7162

72-
rag = [
73-
"accelerate",
74-
"langchain-anthropic==0.3.4",
75-
"langchain-aws",
76-
"langchain-cohere==0.4.2",
77-
"langchain-huggingface==0.1.2",
78-
"langchain-milvus==0.1.8",
79-
"langchain-mistralai==0.2.7",
80-
"langchain-nvidia-ai-endpoints",
81-
"langchain-openai==0.3.7",
82-
"langchain==0.3.20",
83-
"langdetect>=1.0.9",
84-
"langserve[all]==0.3.1",
85-
"pymilvus[model]==2.5.0",
86-
"ragas==0.2.6",
87-
"nltk>=3.9",
63+
[project.optional-dependencies]
64+
cpu = [
65+
"torch>=2.5.1",
66+
]
67+
cu124 = [
68+
"torch>=2.5.1",
8869
]
8970

90-
all = [
71+
rag = [
9172
"accelerate",
92-
"Pillow",
93-
"PyMuPDF",
94-
"Unidecode",
95-
"bokeh",
96-
"chonkie==0.2.1.post1",
97-
"clean-text",
98-
"datatrove[processing]",
99-
"docx2pdf",
100-
"fastapi==0.115.5",
101-
"fastapi[standard]",
10273
"langchain-anthropic==0.3.4",
10374
"langchain-aws",
10475
"langchain-cohere==0.4.2",
@@ -110,28 +81,9 @@ all = [
11081
"langchain==0.3.20",
11182
"langdetect>=1.0.9",
11283
"langserve[all]==0.3.1",
113-
"lxml_html_clean",
114-
"markdown==3.7",
115-
"markdownify==0.13.1",
116-
"marker-pdf==1.6.0",
117-
"motor==3.6.0",
118-
"moviepy==2.1.1",
119-
"nltk>=3.9",
120-
"openpyxl==3.1.5",
121-
"py7zr==0.22.0",
122-
"pydantic== 2.10.4",
12384
"pymilvus[model]==2.5.0",
124-
"pymongo==4.9.2",
125-
"python-docx",
126-
"python-pptx",
12785
"ragas==0.2.6",
128-
"rarfile==4.2",
129-
"requests==2.28.1",
130-
"selenium==4.27.1",
131-
"surya-ocr>=0.8.3",
132-
"trafilatura==1.4.0",
133-
"validators==0.34.0",
134-
"xlrd==2.0.1",
86+
"nltk>=3.9",
13587
]
13688

13789
[tool.uv]
@@ -160,8 +112,4 @@ url = "https://download.pytorch.org/whl/cu124"
160112
explicit = true
161113

162114
[project.scripts]
163-
mmore = "mmore:__main__"
164-
165-
[build-system]
166-
requires = ["hatchling"]
167-
build-backend = "hatchling.build"
115+
mmore = "mmore:__main__"

‎requirements.txt

Whitespace-only changes.

0 commit comments

Comments
 (0)
Please sign in to comment.