Skip to content

ahnaf015/autonomous-research-assistant

Repository files navigation

Autonomous Research Assistant

arXiv Search • RAG (Chroma) • LangGraph Workflow • Streamlit UI • Gemini (LLM + Embeddings)

An end-to-end research assistant that turns a topic into a grounded, cited report or grounded Q&A by:

  • searching arXiv for relevant papers
  • downloading PDFs and extracting full text
  • chunking + embedding and indexing in ChromaDB
  • retrieving evidence (RAG)
  • generating outputs that cite only the retrieved sources

Screenshots

Report tab

Report tab screenshot

Q&A tab

Q&A tab screenshot


Features

  • Two modes
    • Research & Report: generates a multi-section report with citations
    • Ask Questions: answers questions grounded in retrieved chunks (with citations)
  • RAG with ChromaDB
    • stable chunk IDs + upserts (dedupe-safe)
    • deterministic source numbering to avoid citation drift
  • LangGraph orchestration
    • conditional routing between report vs QA
    • optional retry/search when retrieval is weak
  • Best-effort PDF ingestion
    • downloads PDFs when available
    • extracts + cleans text
    • section-aware chunking
  • Quality helpers
    • report critique + lightweight scoring (with JSON repair fallback)
  • Logging
    • console logs + rotating file logs: logs/report.log, logs/qa.log

How it works (high level)

Report flow

  1. Search arXiv for the topic (top-N results)
  2. Create paper cards (structured summary from abstract)
  3. Full-text ingest (best effort): download PDF → extract text → clean → chunk
  4. Index chunks in Chroma (or fallback to abstract-only indexing)
  5. Retrieve top-k evidence chunks for the topic
  6. Generate report using only retrieved context + paper cards
  7. Critique + score the report

Q&A flow

  1. Retrieve top-k evidence chunks for the question
  2. If retrieval is weak (optional): trigger search/ingest/index and retry retrieval
  3. Generate answer using only retrieved context and citations

Tech stack

  • Streamlit: UI
  • LangGraph: workflow/state machine (report + QA)
  • ChromaDB: vector store
  • Google Gemini: text generation + embeddings
  • arxiv Python client: paper discovery + metadata
  • PDF ingest: download → extract → clean → chunk

Repository layout

app/
  streamlit_app.py
core/
  arxiv_client.py
  citations.py
  config.py
  evaluation.py
  llm_gemini.py
  logging_config.py
  pdf_ingest.py
  rag_chroma.py
  schemas.py
  utils.py
graph/
  research_graph.py
Dockerfile
docker-compose.yml
requirements.txt
README.md

Quickstart (Docker Compose)

1) Clone

git clone <YOUR_GITHUB_REPO_URL>
cd <YOUR_REPO_FOLDER>

2) Create .env

Create a local .env file in the repo root:

GOOGLE_API_KEY=your_key_here

Optional (recommended defaults when running with Docker Compose + Chroma server):

CHROMA_HTTP=1
CHROMA_HOST=chromadb
CHROMA_PORT=8000
ARXIV_MAX_RESULTS=20
DEFAULT_TOP_K=4
FAST_MODE=1

3) Start services

docker compose up --build

4) Open

Note: If you open http://localhost:8000/ you may see {"detail":"Not Found"} — that’s expected. Chroma’s API routes are under /api/....

5) Stop

docker compose down

Quickstart (Local / virtualenv)

Local mode supports:

  • embedded persistent Chroma (default when CHROMA_HTTP=0)
  • Chroma server (when CHROMA_HTTP=1)

1) Install

python -m venv .venv

# Windows:
.venv\Scripts\activate

# macOS/Linux:
source .venv/bin/activate

pip install -r requirements.txt

2) Configure

Create .env:

GOOGLE_API_KEY=your_key_here

If using a Chroma server:

CHROMA_HTTP=1
CHROMA_HOST=localhost
CHROMA_PORT=8000

3) Run

streamlit run app/streamlit_app.py

Configuration

Environment variables

Required

  • GOOGLE_API_KEY — Gemini API key

arXiv / retrieval

  • ARXIV_MAX_RESULTS — number of arXiv results (default: 5)
  • DEFAULT_TOP_K — retrieved chunks (default: 6)

Chroma

  • CHROMA_HTTP1 to use Chroma server, 0 for embedded client
  • CHROMA_HOST — host when CHROMA_HTTP=1 (Docker Compose uses chromadb)
  • CHROMA_PORT — port when CHROMA_HTTP=1 (default: 8000)
  • CHROMA_PERSIST_DIR — embedded persistence path (if CHROMA_HTTP=0)

PDF ingest

  • PDF_CACHE_DIR — where PDFs are cached (default: data/pdfs)

Logging

  • LOG_DIR — default logs
  • LOG_CONSOLE_LEVELDEBUG|INFO|WARNING|ERROR (default: INFO)

Logging

Logging goes to:

  • console (INFO by default)
  • logs/report.log for report runs
  • logs/qa.log for QA runs

Future Improvements

  • Query understanding (LLM-assisted expansion): expand topics into subtopics/keywords to improve arXiv relevance across domains.
  • Caching + incremental indexing: persist embeddings and skip re-indexing unchanged papers/chunks (paper_id + chunk hash).
  • More robust PDF ingest: retries, alternate endpoints, better text extraction fallbacks, skip noisy pages, improved section detection.
  • Evaluation & regression tests: add automated checks for citation mapping, retrieval formatting, scoring JSON, ingestion reliability.
  • Observability: structured logs + latency metrics (embedding batches, token usage, failures), optional tracing.
  • Deployment: GHCR publishing + GitHub Actions CI/CD, Kubernetes manifests/Helm chart, secrets management patterns.
  • User features: save/export report, paper library/bookmarks, search within indexed papers, shareable links.

License

This project is licensed under the MIT License — see the LICENSE.txt file for details.


Contributing

Contributions are welcome!

  1. Fork the repository
  2. Create a feature branch
    git checkout -b feature/your-feature
  3. Make your changes (please include clear documentation / comments)
  4. Run tests (if applicable)
    pytest
  5. Submit a Pull Request with a detailed description of what you changed and why

Support

For questions or issues:

  • Review logs:
    • logs/report.log (report pipeline)
    • logs/qa.log (Q&A pipeline)
  • Open a GitHub issue and include:
    • steps to reproduce
    • expected vs actual behavior
    • relevant log lines (redact secrets)
    • your environment (OS, Python version, Docker version)

About

improved_research_agent

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors