Skip to content

jorgesarrato/galaxy_rag_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Galaxy RAG Project

A CPU-optimized Retrieval-Augmented Generation (RAG) system for analyzing scientific papers.

Features

  • CPU Optimized: Runs efficiently on local hardware (tested on 6 physical cores) using Qwen2.5-3B (GGUF). Answers take 15-30 seconds to generate.
  • Intelligent Retrieval: Vector search using FAISS and a reranker model.
  • Layout-Aware Parsing: Handles multi-column scientific PDFs without header/footer noise. Text is recurively split into chunks.
  • Incremental Indexing: Only processes new PDFs added to the data directory.
  • Verified Citations: Instructed to include precise references to consulted documents.
  • Stream Generation: Improved latency feeling by printing each token right after generation.
  • User Interface: Gradio chatbot interface.
  • Paper Selection: If desired, you can choose specific papers to allow for retrieval.

Models

  • LLM: Qwen2.5-3B-Instruct (Quantized Q4_K_M) via llama-cpp-python
  • Embeddings: sentence-transformers/all-MiniLM-L6-v2
  • Reranker cross-encoder/ms-marco-MiniLM-L-6-v2

Installation

This project is built using Python 3.10.12. Use a virtual environment to avoid conflicts.

git clone https://github.com/jorgesarrato/galaxy_rag_project.git
cd rag_project

If you want to use a virtual environment:

python3 -m venv rag_env
source rag_env/bin/activate

Install dependencies:

pip install --upgrade pip
pip install -r requirements.txt

Create a .env file with the following content:

HF_TOKEN=#Your huggingface token
DATA_DIR=#Path to PDFs
DB_DIR=#Path to store vector database
MODEL_DIR=#Path to llm models

Store llm models:

mkdir # your MODEL_DIR
hf download bartowski/Qwen2.5-3B-Instruct-GGUF --include "Qwen2.5-3B-Instruct-Q4_K_M.gguf" --local-dir # your MODEL_DIR

Theoretically the pipeline will download your model if you include it in the MODEL_MAP. In practice I found it's faster to call hf download manually.

Data Placing

Store your PDF files in data/ or the folder you defined in your .env as DATA_DIR

Usage

Execute in terminal mode:

python src/main.py

Or in app mode, and open the provided local link to chat:

python src/main_gradio.py

Running with Docker

The project can also be run as a containerized service, which avoids local dependency issues and ensures reproducibility across systems.

With this option you plan to deploy the API as a service

Build the Docker image

If you defined custom DATA_DIR or DB_DIR, you need to modify "data/" by your new data path in this line from Dockerfile

COPY data/ ./data 

From the project root:

docker build -t rag-api .

Run ingestion (one time)

To ingest your PDF collection and build the vector index (replace with your paths):

  -v /path/to/pdfs:/app/pdfs \
  -v /path/to/vectors:/app/vectors \
  rag-api python ingest.py

Run the API service

Start the API:

  -p 8000:8000 \
  -v /path/to/pdfs:/data/pdfs \
  -v /path/to/vectors:/data/vectors
  rag-api

The API will be available at:

http://localhost:8000

Health Check

Veify the service is running:

curl http://localhost:8000/health

Query the API

Send a query. Add The list of papers you want to include in your query:

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Explain the physical motivation for cored dark matter profiles",
    "selected_papers": [
      "Rocha_2013_SIDM.pdf",
      "Kaplinghat_2016_DMcores.pdf",
      "Bullock_2017_CDMreview.pdf"
    ]
  }'

Or don't add it if you want to look through all your collection:

  -H "Content-Type: application/json" \
  -d '{"query":"Explain abundance matching"}'

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published