Autonomous Research Assistant

arXiv Search • RAG (Chroma) • LangGraph Workflow • Streamlit UI • Gemini (LLM + Embeddings)

An end-to-end research assistant that turns a topic into a grounded, cited report or grounded Q&A by:

searching arXiv for relevant papers
downloading PDFs and extracting full text
chunking + embedding and indexing in ChromaDB
retrieving evidence (RAG)
generating outputs that cite only the retrieved sources

Screenshots

Report tab

Q&A tab

Features

Two modes
- Research & Report: generates a multi-section report with citations
- Ask Questions: answers questions grounded in retrieved chunks (with citations)
RAG with ChromaDB
- stable chunk IDs + upserts (dedupe-safe)
- deterministic source numbering to avoid citation drift
LangGraph orchestration
- conditional routing between report vs QA
- optional retry/search when retrieval is weak
Best-effort PDF ingestion
- downloads PDFs when available
- extracts + cleans text
- section-aware chunking
Quality helpers
- report critique + lightweight scoring (with JSON repair fallback)
Logging
- console logs + rotating file logs: logs/report.log, logs/qa.log

How it works (high level)

Report flow

Search arXiv for the topic (top-N results)
Create paper cards (structured summary from abstract)
Full-text ingest (best effort): download PDF → extract text → clean → chunk
Index chunks in Chroma (or fallback to abstract-only indexing)
Retrieve top-k evidence chunks for the topic
Generate report using only retrieved context + paper cards
Critique + score the report

Q&A flow

Retrieve top-k evidence chunks for the question
If retrieval is weak (optional): trigger search/ingest/index and retry retrieval
Generate answer using only retrieved context and citations

Tech stack

Streamlit: UI
LangGraph: workflow/state machine (report + QA)
ChromaDB: vector store
Google Gemini: text generation + embeddings
arxiv Python client: paper discovery + metadata
PDF ingest: download → extract → clean → chunk

Repository layout

app/
  streamlit_app.py
core/
  arxiv_client.py
  citations.py
  config.py
  evaluation.py
  llm_gemini.py
  logging_config.py
  pdf_ingest.py
  rag_chroma.py
  schemas.py
  utils.py
graph/
  research_graph.py
Dockerfile
docker-compose.yml
requirements.txt
README.md

Quickstart (Docker Compose)

1) Clone

git clone <YOUR_GITHUB_REPO_URL>
cd <YOUR_REPO_FOLDER>

2) Create `.env`

Create a local .env file in the repo root:

GOOGLE_API_KEY=your_key_here

Optional (recommended defaults when running with Docker Compose + Chroma server):

CHROMA_HTTP=1
CHROMA_HOST=chromadb
CHROMA_PORT=8000
ARXIV_MAX_RESULTS=20
DEFAULT_TOP_K=4
FAST_MODE=1

3) Start services

docker compose up --build

4) Open

Streamlit UI: http://localhost:8501
Chroma server: http://localhost:8000

Note: If you open http://localhost:8000/ you may see {"detail":"Not Found"} — that’s expected. Chroma’s API routes are under /api/....

5) Stop

docker compose down

Quickstart (Local / virtualenv)

Local mode supports:

embedded persistent Chroma (default when CHROMA_HTTP=0)

Chroma server (when CHROMA_HTTP=1)

1) Install

python -m venv .venv

# Windows:
.venv\Scripts\activate

# macOS/Linux:
source .venv/bin/activate

pip install -r requirements.txt

2) Configure

Create .env:

GOOGLE_API_KEY=your_key_here

If using a Chroma server:

CHROMA_HTTP=1
CHROMA_HOST=localhost
CHROMA_PORT=8000

3) Run

streamlit run app/streamlit_app.py

Configuration

Environment variables

Required

GOOGLE_API_KEY — Gemini API key

arXiv / retrieval

ARXIV_MAX_RESULTS — number of arXiv results (default: 5)
DEFAULT_TOP_K — retrieved chunks (default: 6)

Chroma

CHROMA_HTTP — 1 to use Chroma server, 0 for embedded client
CHROMA_HOST — host when CHROMA_HTTP=1 (Docker Compose uses chromadb)
CHROMA_PORT — port when CHROMA_HTTP=1 (default: 8000)
CHROMA_PERSIST_DIR — embedded persistence path (if CHROMA_HTTP=0)

PDF ingest

PDF_CACHE_DIR — where PDFs are cached (default: data/pdfs)

Logging

LOG_DIR — default logs
LOG_CONSOLE_LEVEL — DEBUG|INFO|WARNING|ERROR (default: INFO)

Logging

Logging goes to:

console (INFO by default)
logs/report.log for report runs
logs/qa.log for QA runs

Future Improvements

Query understanding (LLM-assisted expansion): expand topics into subtopics/keywords to improve arXiv relevance across domains.
Caching + incremental indexing: persist embeddings and skip re-indexing unchanged papers/chunks (paper_id + chunk hash).
More robust PDF ingest: retries, alternate endpoints, better text extraction fallbacks, skip noisy pages, improved section detection.
Evaluation & regression tests: add automated checks for citation mapping, retrieval formatting, scoring JSON, ingestion reliability.
Observability: structured logs + latency metrics (embedding batches, token usage, failures), optional tracing.
Deployment: GHCR publishing + GitHub Actions CI/CD, Kubernetes manifests/Helm chart, secrets management patterns.
User features: save/export report, paper library/bookmarks, search within indexed papers, shareable links.

License

This project is licensed under the MIT License — see the LICENSE.txt file for details.

Contributing

Contributions are welcome!

Fork the repository
Create a feature branch
git checkout -b feature/your-feature
Make your changes (please include clear documentation / comments)
Run tests (if applicable)
pytest
Submit a Pull Request with a detailed description of what you changed and why

Support

For questions or issues:

Review logs:
- logs/report.log (report pipeline)
- logs/qa.log (Q&A pipeline)
Open a GitHub issue and include:
- steps to reproduce
- expected vs actual behavior
- relevant log lines (redact secrets)
- your environment (OS, Python version, Docker version)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autonomous Research Assistant

Screenshots

Report tab

Q&A tab

Features

How it works (high level)

Report flow

Q&A flow

Tech stack

Repository layout

Quickstart (Docker Compose)

1) Clone

2) Create `.env`

3) Start services

4) Open

5) Stop

Quickstart (Local / virtualenv)

1) Install

2) Configure

3) Run

Configuration

Environment variables

Logging

Future Improvements

License

Contributing

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
chroma_data		chroma_data
core		core
graph		graph
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
qa_page.png		qa_page.png
report_page.png		report_page.png
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Autonomous Research Assistant

Screenshots

Report tab

Q&A tab

Features

How it works (high level)

Report flow

Q&A flow

Tech stack

Repository layout

Quickstart (Docker Compose)

1) Clone

2) Create .env

3) Start services

4) Open

5) Stop

Quickstart (Local / virtualenv)

1) Install

2) Configure

3) Run

Configuration

Environment variables

Logging

Future Improvements

License

Contributing

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2) Create `.env`

Packages