Skip to content

Replace gpt-researcher with a minimal in-house research engine #2

@mgoldsborough

Description

@mgoldsborough

Summary

Replace the gpt-researcher dependency with a ~300–400 line in-house research engine. Our actual runtime path uses a small fraction of what gpt-researcher ships, and the transitive dep graph is what's driving the bundle bloat tracked in #1.

This is follow-on work to #1 (.mcpbignore trims, done in 53ffc73, saved ~6 MB uncompressed). That issue's .mcpbignore fixes were banked; pruning individual vendored packages was judged too fragile. This issue is the durable alternative: stop depending on gpt-researcher entirely.

What we actually use from gpt-researcher

I traced the import path against our vendored deps/. In our config (RETRIEVER=tavily, Anthropic for all LLM slots, report_source=web), the runtime hits exactly five responsibilities:

  1. Plan — LLM turns the user query into 3–5 sub-queries
  2. Search — Tavily API per sub-query → URLs + cleaned text
  3. Chunk + embed — tiktoken + OpenAI text-embedding-3-small on retrieved content
  4. Rank — cosine similarity, top-k chunks per sub-query
  5. Write — LLM writes the markdown report from compressed context

Everything else in gpt-researcher (multi-retriever fallbacks, litellm routing, local doc loaders, vector store integrations, browser/nodriver scraping, hybrid/Azure/LangChainDocuments report sources, the DocumentLoader chain that drags in unstructuredspacy/thinc/blis at import time) is never reached.

Scope

Add mcp_research/engine/ with:

Module Rough size Purpose
planner.py ~50 lines Query → sub-queries via Anthropic
search.py ~30 lines Tavily wrapper (use include_raw_content=True so we don't need our own scraper)
chunker.py ~80 lines tiktoken-based chunking + OpenAI embeddings
retriever.py ~30 lines Top-k cosine similarity
writer.py ~50 lines Context → markdown report via Anthropic
orchestrator.py ~100 lines The run loop with progress callbacks

Existing src/mcp_research/worker.py stays — it already handles entity updates, progress streaming to ctx, timeouts, cancel/failure transitions, and the orphan reaper. Only the GPTResearcher(...) / conduct_research() / write_report() calls get swapped.

Dep footprint

Package Size Purpose
anthropic 8 MB planner + writer
openai 12 MB embeddings
tavily-python <1 MB search
tiktoken 3 MB chunking
numpy 22 MB cosine sim (optional — pure Python works at 1536-dim)
common transitive (httpx, pydantic) ~5 MB already required

Estimated total: ~50 MB uncompressed / ~15 MB compressed. Down from 541 MB / 166 MB at 0.1.0 — roughly 10× smaller.

What we lose (and don't use)

  • Multi-retriever support (DDG, Bing, Google, SerpAPI, …) — we hardcode Tavily
  • Multi-LLM routing via litellm — we hardcode Anthropic for fast/smart/strategic slots
  • Local document loaders (PDF, docx, md, csv, xlsx) — we run web-only
  • Vector store integrations (FAISS, Chroma, Pinecone, …) — in-memory is fine for per-run context
  • Browser scraping (nodriver, playwright) — Tavily include_raw_content replaces this
  • Hybrid/Azure/LangChainDocuments report sources — we only use web
  • Report-type variants (outline, detailed, resource report, …) — we only use research_report

Risks

  1. Report quality regression. gpt-researcher's prompts are tuned from many real user runs. Ours won't be on day one. Primary validation gate. Mitigation: borrow prompt structure from gpt-researcher (Apache 2.0 — confirm before copying) and run a 10-query eval harness comparing both pipelines side-by-side.
  2. PDF URLs from Tavily. We'll depend on Tavily's advanced search returning usable content for PDF results. If raw_content is weak for PDFs, fall back to pypdf (5 MB, pure Python — not pymupdf's 51 MB C extension).
  3. Maintenance. 300 lines we own vs chasing gpt-researcher version bumps. Net probably a wash or favorable.

Plan

  1. Feature branch engine/in-house
  2. Implement the six modules above; keep worker.py's public contract unchanged
  3. Write an eval harness: run 10 canned research queries through both engines, diff reports on length, source coverage, factual accuracy, readability
  4. If parity → ship as 0.2.0 with gpt-researcher fully removed
  5. If quality gap → tune prompts once or bail (revisit targeted dep pruning as plan B)

Acceptance criteria

  • mcp_research/engine/ module implementing all five responsibilities
  • gpt-researcher, langchain-anthropic, and their transitive chain removed from pyproject.toml
  • Existing test suite passes with the FakeGPTR fixture replaced by an equivalent fake engine
  • New eval harness comparing engine vs baseline on 10 queries, checked in under tests/eval/
  • Bundle size <25 MB compressed / <75 MB uncompressed
  • start_research smoke test passes end-to-end on a vanilla agent-platform pod
  • Prompts documented in docs/prompts.md (or inline with provenance if lifted from gpt-researcher)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions