arXiv Search • RAG (Chroma) • LangGraph Workflow • Streamlit UI • Gemini (LLM + Embeddings)
An end-to-end research assistant that turns a topic into a grounded, cited report or grounded Q&A by:
- searching arXiv for relevant papers
- downloading PDFs and extracting full text
- chunking + embedding and indexing in ChromaDB
- retrieving evidence (RAG)
- generating outputs that cite only the retrieved sources
- Two modes
- Research & Report: generates a multi-section report with citations
- Ask Questions: answers questions grounded in retrieved chunks (with citations)
- RAG with ChromaDB
- stable chunk IDs + upserts (dedupe-safe)
- deterministic source numbering to avoid citation drift
- LangGraph orchestration
- conditional routing between report vs QA
- optional retry/search when retrieval is weak
- Best-effort PDF ingestion
- downloads PDFs when available
- extracts + cleans text
- section-aware chunking
- Quality helpers
- report critique + lightweight scoring (with JSON repair fallback)
- Logging
- console logs + rotating file logs:
logs/report.log,logs/qa.log
- console logs + rotating file logs:
- Search arXiv for the topic (top-N results)
- Create paper cards (structured summary from abstract)
- Full-text ingest (best effort): download PDF → extract text → clean → chunk
- Index chunks in Chroma (or fallback to abstract-only indexing)
- Retrieve top-k evidence chunks for the topic
- Generate report using only retrieved context + paper cards
- Critique + score the report
- Retrieve top-k evidence chunks for the question
- If retrieval is weak (optional): trigger search/ingest/index and retry retrieval
- Generate answer using only retrieved context and citations
- Streamlit: UI
- LangGraph: workflow/state machine (report + QA)
- ChromaDB: vector store
- Google Gemini: text generation + embeddings
- arxiv Python client: paper discovery + metadata
- PDF ingest: download → extract → clean → chunk
app/
streamlit_app.py
core/
arxiv_client.py
citations.py
config.py
evaluation.py
llm_gemini.py
logging_config.py
pdf_ingest.py
rag_chroma.py
schemas.py
utils.py
graph/
research_graph.py
Dockerfile
docker-compose.yml
requirements.txt
README.md
git clone <YOUR_GITHUB_REPO_URL>
cd <YOUR_REPO_FOLDER>Create a local .env file in the repo root:
GOOGLE_API_KEY=your_key_hereOptional (recommended defaults when running with Docker Compose + Chroma server):
CHROMA_HTTP=1
CHROMA_HOST=chromadb
CHROMA_PORT=8000
ARXIV_MAX_RESULTS=20
DEFAULT_TOP_K=4
FAST_MODE=1docker compose up --build- Streamlit UI: http://localhost:8501
- Chroma server: http://localhost:8000
Note: If you open
http://localhost:8000/you may see{"detail":"Not Found"}— that’s expected. Chroma’s API routes are under/api/....
docker compose downLocal mode supports:
- embedded persistent Chroma (default when
CHROMA_HTTP=0)- Chroma server (when
CHROMA_HTTP=1)
python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate
pip install -r requirements.txtCreate .env:
GOOGLE_API_KEY=your_key_hereIf using a Chroma server:
CHROMA_HTTP=1
CHROMA_HOST=localhost
CHROMA_PORT=8000streamlit run app/streamlit_app.pyRequired
GOOGLE_API_KEY— Gemini API key
arXiv / retrieval
ARXIV_MAX_RESULTS— number of arXiv results (default: 5)DEFAULT_TOP_K— retrieved chunks (default: 6)
Chroma
CHROMA_HTTP—1to use Chroma server,0for embedded clientCHROMA_HOST— host whenCHROMA_HTTP=1(Docker Compose useschromadb)CHROMA_PORT— port whenCHROMA_HTTP=1(default: 8000)CHROMA_PERSIST_DIR— embedded persistence path (ifCHROMA_HTTP=0)
PDF ingest
PDF_CACHE_DIR— where PDFs are cached (default:data/pdfs)
Logging
LOG_DIR— defaultlogsLOG_CONSOLE_LEVEL—DEBUG|INFO|WARNING|ERROR(default:INFO)
Logging goes to:
- console (INFO by default)
logs/report.logfor report runslogs/qa.logfor QA runs
- Query understanding (LLM-assisted expansion): expand topics into subtopics/keywords to improve arXiv relevance across domains.
- Caching + incremental indexing: persist embeddings and skip re-indexing unchanged papers/chunks (paper_id + chunk hash).
- More robust PDF ingest: retries, alternate endpoints, better text extraction fallbacks, skip noisy pages, improved section detection.
- Evaluation & regression tests: add automated checks for citation mapping, retrieval formatting, scoring JSON, ingestion reliability.
- Observability: structured logs + latency metrics (embedding batches, token usage, failures), optional tracing.
- Deployment: GHCR publishing + GitHub Actions CI/CD, Kubernetes manifests/Helm chart, secrets management patterns.
- User features: save/export report, paper library/bookmarks, search within indexed papers, shareable links.
This project is licensed under the MIT License — see the LICENSE.txt file for details.
Contributions are welcome!
- Fork the repository
- Create a feature branch
git checkout -b feature/your-feature - Make your changes (please include clear documentation / comments)
- Run tests (if applicable)
pytest - Submit a Pull Request with a detailed description of what you changed and why
For questions or issues:
- Review logs:
logs/report.log(report pipeline)logs/qa.log(Q&A pipeline)
- Open a GitHub issue and include:
- steps to reproduce
- expected vs actual behavior
- relevant log lines (redact secrets)
- your environment (OS, Python version, Docker version)

