A fully local, privacy-first RAG (Retrieval-Augmented Generation) application that lets you chat with any PDF document. Upload a PDF, ask questions in plain English, and get grounded answers with page citations.
Built with LangChain, Groq, HuggingFace, ChromaDB, and Streamlit.
Upload any PDF — a research paper, contract, report, or manual — and ask questions about it in plain English. The app finds the most relevant sections and generates a grounded answer with page citations.
- ✅ No hallucinations — the LLM only answers from your document's content
- ✅ Page citations — every answer links back to source page numbers
- ✅ Transparent retrieval — inspect the exact chunks used to generate each answer
- ✅ Free to run — Groq's free tier is fast and generous
PDF Upload
│
▼
┌─────────────────────────────────────────────┐
│ loader.py │
│ PyMuPDF extracts text page by page │
│ RecursiveCharacterTextSplitter chunks it │
└─────────────────┬───────────────────────────┘
│ list[Document] chunks
▼
┌─────────────────────────────────────────────┐
│ embedder.py │
│ all-MiniLM-L6-v2 converts chunks → vectors │
│ ChromaDB stores vectors in-memory │ ← updated
└─────────────────┬───────────────────────────┘
│
User asks a question
│
▼
┌─────────────────────────────────────────────┐
│ retriever.py │
│ Question → vector → similarity search │
│ Top-5 chunks injected into prompt │
│ Llama 3.3 70B (via Groq) generates answer │
└─────────────────┬───────────────────────────┘
│ answer + page citations
▼
Streamlit UI (app.py)
| Component | Tool | Why |
|---|---|---|
| UI | Streamlit | Fast to build, easy to demo |
| LLM | Llama 3.3 70B via Groq | Free API, faster than local inference |
| Embeddings | all-MiniLM-L6-v2 (HuggingFace) | Runs on CPU, no API key needed |
| Vector DB | ChromaDB (in-memory) | Lightweight, no disk writes, works anywhere |
| PDF Parsing | PyMuPDF | Fast, accurate text extraction |
| RAG Framework | LangChain | Industry standard |
doc-qa-assistant/
│
├── app.py # Streamlit UI — chat interface
├── rag/
│ ├── __init__.py
│ ├── loader.py # PDF loading & chunking
│ ├── embedder.py # Embeddings + ChromaDB vector store
│ └── retriever.py # RAG chain — retrieval + answer generation
│
├── .python-version # Pins Python 3.11 (via uv)
├── .streamlit/
│ └── secrets.toml # Local secrets — not committed
├── requirements.txt
└── .gitignore
- Python 3.11
- uv — fast Python package manager
- Groq API key — free, takes 1 minute to get
git clone https://github.com/SameerGadge/doc-qa-assistant.git
cd doc-qa-assistantuv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements.txtThis app uses Streamlit secrets for API key management (both locally and in the cloud).
Create .streamlit/secrets.toml in the project root:
GROQ_API_KEY = "gsk_your_key_here"
⚠️ Make sure.streamlit/secrets.tomlis in your.gitignore— never commit API keys.
Get a free key at console.groq.com → API Keys.
uv run streamlit run app.pyOpen http://localhost:8501 in your browser.
- Push the repo to GitHub
- Go to share.streamlit.io and connect your repo
- Set Main file path to
app.py - Under Advanced Settings → Secrets, add:
GROQ_API_KEY = "gsk_your_key_here"
- Click Deploy
- Upload a PDF using the sidebar file uploader
- Wait for the document to be processed (chunked + embedded)
- Type a question in the chat input
- View the answer with page citations
- Expand "View retrieved context" to see exactly which chunks were used
Key parameters can be tuned in each module:
Chunking (rag/loader.py)
chunk_size = 500 # characters per chunk — increase for denser docs
chunk_overlap = 100 # overlap between chunks — increase to avoid boundary lossRetrieval (rag/retriever.py)
TOP_K = 5 # chunks retrieved per query
temperature = 0 # 0 = factual/deterministic, 1 = creative
LLM_MODEL = "llama-3.3-70b-versatile" # swap to llama-3.1-8b-instant for faster responsesRAG (Retrieval-Augmented Generation) — Instead of fine-tuning a model on your documents, RAG retrieves relevant snippets at query time and injects them into the prompt. The LLM answers using only that context, reducing hallucinations and keeping it grounded in your data.
Chunking — LLMs have limited context windows, so documents are split into small overlapping pieces. Overlap ensures sentences that span chunk boundaries aren't lost.
Embeddings — Each chunk is converted into a vector that captures its semantic meaning. Similar meanings produce similar vectors, enabling meaning-based search rather than keyword matching.
Vector Database — ChromaDB stores chunk vectors in-memory per session and performs fast similarity search to find the most relevant chunks for any query.
- Support multiple PDFs simultaneously with document selection
- Add hybrid search (semantic + keyword BM25) for better retrieval
- Implement conversation memory for multi-turn follow-up questions
- Evaluate retrieval quality with the RAGAS framework
- Add a reranker to improve chunk ranking
- Export Q&A sessions as a PDF report
MIT License — free to use, modify, and distribute.