A CPU-optimized Retrieval-Augmented Generation (RAG) system for analyzing scientific papers.
- CPU Optimized: Runs efficiently on local hardware (tested on 6 physical cores) using
Qwen2.5-3B(GGUF). Answers take 15-30 seconds to generate. - Intelligent Retrieval: Vector search using FAISS and a reranker model.
- Layout-Aware Parsing: Handles multi-column scientific PDFs without header/footer noise. Text is recurively split into chunks.
- Incremental Indexing: Only processes new PDFs added to the data directory.
- Verified Citations: Instructed to include precise references to consulted documents.
- Stream Generation: Improved latency feeling by printing each token right after generation.
- User Interface: Gradio chatbot interface.
- Paper Selection: If desired, you can choose specific papers to allow for retrieval.
- LLM: Qwen2.5-3B-Instruct (Quantized Q4_K_M) via
llama-cpp-python - Embeddings:
sentence-transformers/all-MiniLM-L6-v2 - Reranker
cross-encoder/ms-marco-MiniLM-L-6-v2
This project is built using Python 3.10.12. Use a virtual environment to avoid conflicts.
git clone https://github.com/jorgesarrato/galaxy_rag_project.git
cd rag_projectIf you want to use a virtual environment:
python3 -m venv rag_env
source rag_env/bin/activateInstall dependencies:
pip install --upgrade pip
pip install -r requirements.txtCreate a .env file with the following content:
HF_TOKEN=#Your huggingface token
DATA_DIR=#Path to PDFs
DB_DIR=#Path to store vector database
MODEL_DIR=#Path to llm modelsStore llm models:
mkdir # your MODEL_DIR
hf download bartowski/Qwen2.5-3B-Instruct-GGUF --include "Qwen2.5-3B-Instruct-Q4_K_M.gguf" --local-dir # your MODEL_DIRTheoretically the pipeline will download your model if you include it in the MODEL_MAP. In practice I found it's faster to call hf download manually.
Store your PDF files in data/ or the folder you defined in your .env as DATA_DIR
Execute in terminal mode:
python src/main.pyOr in app mode, and open the provided local link to chat:
python src/main_gradio.pyThe project can also be run as a containerized service, which avoids local dependency issues and ensures reproducibility across systems.
With this option you plan to deploy the API as a service
If you defined custom DATA_DIR or DB_DIR, you need to modify "data/" by your new data path in this line from Dockerfile
COPY data/ ./data
From the project root:
docker build -t rag-api .To ingest your PDF collection and build the vector index (replace with your paths):
-v /path/to/pdfs:/app/pdfs \
-v /path/to/vectors:/app/vectors \
rag-api python ingest.py
Start the API:
-p 8000:8000 \
-v /path/to/pdfs:/data/pdfs \
-v /path/to/vectors:/data/vectors
rag-api
The API will be available at:
http://localhost:8000
Veify the service is running:
curl http://localhost:8000/health
Send a query. Add The list of papers you want to include in your query:
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"query": "Explain the physical motivation for cored dark matter profiles",
"selected_papers": [
"Rocha_2013_SIDM.pdf",
"Kaplinghat_2016_DMcores.pdf",
"Bullock_2017_CDMreview.pdf"
]
}'
Or don't add it if you want to look through all your collection:
-H "Content-Type: application/json" \
-d '{"query":"Explain abundance matching"}'