Summary
Replace the gpt-researcher dependency with a ~300–400 line in-house research engine. Our actual runtime path uses a small fraction of what gpt-researcher ships, and the transitive dep graph is what's driving the bundle bloat tracked in #1.
This is follow-on work to #1 (.mcpbignore trims, done in 53ffc73, saved ~6 MB uncompressed). That issue's .mcpbignore fixes were banked; pruning individual vendored packages was judged too fragile. This issue is the durable alternative: stop depending on gpt-researcher entirely.
What we actually use from gpt-researcher
I traced the import path against our vendored deps/. In our config (RETRIEVER=tavily, Anthropic for all LLM slots, report_source=web), the runtime hits exactly five responsibilities:
- Plan — LLM turns the user query into 3–5 sub-queries
- Search — Tavily API per sub-query → URLs + cleaned text
- Chunk + embed — tiktoken + OpenAI
text-embedding-3-small on retrieved content
- Rank — cosine similarity, top-k chunks per sub-query
- Write — LLM writes the markdown report from compressed context
Everything else in gpt-researcher (multi-retriever fallbacks, litellm routing, local doc loaders, vector store integrations, browser/nodriver scraping, hybrid/Azure/LangChainDocuments report sources, the DocumentLoader chain that drags in unstructured → spacy/thinc/blis at import time) is never reached.
Scope
Add mcp_research/engine/ with:
| Module |
Rough size |
Purpose |
planner.py |
~50 lines |
Query → sub-queries via Anthropic |
search.py |
~30 lines |
Tavily wrapper (use include_raw_content=True so we don't need our own scraper) |
chunker.py |
~80 lines |
tiktoken-based chunking + OpenAI embeddings |
retriever.py |
~30 lines |
Top-k cosine similarity |
writer.py |
~50 lines |
Context → markdown report via Anthropic |
orchestrator.py |
~100 lines |
The run loop with progress callbacks |
Existing src/mcp_research/worker.py stays — it already handles entity updates, progress streaming to ctx, timeouts, cancel/failure transitions, and the orphan reaper. Only the GPTResearcher(...) / conduct_research() / write_report() calls get swapped.
Dep footprint
| Package |
Size |
Purpose |
anthropic |
8 MB |
planner + writer |
openai |
12 MB |
embeddings |
tavily-python |
<1 MB |
search |
tiktoken |
3 MB |
chunking |
numpy |
22 MB |
cosine sim (optional — pure Python works at 1536-dim) |
| common transitive (httpx, pydantic) |
~5 MB |
already required |
Estimated total: ~50 MB uncompressed / ~15 MB compressed. Down from 541 MB / 166 MB at 0.1.0 — roughly 10× smaller.
What we lose (and don't use)
- Multi-retriever support (DDG, Bing, Google, SerpAPI, …) — we hardcode Tavily
- Multi-LLM routing via litellm — we hardcode Anthropic for fast/smart/strategic slots
- Local document loaders (PDF, docx, md, csv, xlsx) — we run web-only
- Vector store integrations (FAISS, Chroma, Pinecone, …) — in-memory is fine for per-run context
- Browser scraping (nodriver, playwright) — Tavily
include_raw_content replaces this
- Hybrid/Azure/LangChainDocuments report sources — we only use
web
- Report-type variants (outline, detailed, resource report, …) — we only use
research_report
Risks
- Report quality regression. gpt-researcher's prompts are tuned from many real user runs. Ours won't be on day one. Primary validation gate. Mitigation: borrow prompt structure from gpt-researcher (Apache 2.0 — confirm before copying) and run a 10-query eval harness comparing both pipelines side-by-side.
- PDF URLs from Tavily. We'll depend on Tavily's advanced search returning usable content for PDF results. If raw_content is weak for PDFs, fall back to
pypdf (5 MB, pure Python — not pymupdf's 51 MB C extension).
- Maintenance. 300 lines we own vs chasing
gpt-researcher version bumps. Net probably a wash or favorable.
Plan
- Feature branch
engine/in-house
- Implement the six modules above; keep
worker.py's public contract unchanged
- Write an eval harness: run 10 canned research queries through both engines, diff reports on length, source coverage, factual accuracy, readability
- If parity → ship as 0.2.0 with
gpt-researcher fully removed
- If quality gap → tune prompts once or bail (revisit targeted dep pruning as plan B)
Acceptance criteria
Related
Summary
Replace the
gpt-researcherdependency with a ~300–400 line in-house research engine. Our actual runtime path uses a small fraction of what gpt-researcher ships, and the transitive dep graph is what's driving the bundle bloat tracked in #1.This is follow-on work to #1 (
.mcpbignoretrims, done in 53ffc73, saved ~6 MB uncompressed). That issue's.mcpbignorefixes were banked; pruning individual vendored packages was judged too fragile. This issue is the durable alternative: stop depending ongpt-researcherentirely.What we actually use from gpt-researcher
I traced the import path against our vendored
deps/. In our config (RETRIEVER=tavily, Anthropic for all LLM slots,report_source=web), the runtime hits exactly five responsibilities:text-embedding-3-smallon retrieved contentEverything else in
gpt-researcher(multi-retriever fallbacks, litellm routing, local doc loaders, vector store integrations, browser/nodriver scraping, hybrid/Azure/LangChainDocuments report sources, theDocumentLoaderchain that drags inunstructured→spacy/thinc/blisat import time) is never reached.Scope
Add
mcp_research/engine/with:planner.pysearch.pyinclude_raw_content=Trueso we don't need our own scraper)chunker.pyretriever.pywriter.pyorchestrator.pyExisting
src/mcp_research/worker.pystays — it already handles entity updates, progress streaming toctx, timeouts, cancel/failure transitions, and the orphan reaper. Only theGPTResearcher(...)/conduct_research()/write_report()calls get swapped.Dep footprint
anthropicopenaitavily-pythontiktokennumpyEstimated total: ~50 MB uncompressed / ~15 MB compressed. Down from 541 MB / 166 MB at 0.1.0 — roughly 10× smaller.
What we lose (and don't use)
include_raw_contentreplaces thiswebresearch_reportRisks
pypdf(5 MB, pure Python — not pymupdf's 51 MB C extension).gpt-researcherversion bumps. Net probably a wash or favorable.Plan
engine/in-houseworker.py's public contract unchangedgpt-researcherfully removedAcceptance criteria
mcp_research/engine/module implementing all five responsibilitiesgpt-researcher,langchain-anthropic, and their transitive chain removed frompyproject.tomlFakeGPTRfixture replaced by an equivalent fake enginetests/eval/start_researchsmoke test passes end-to-end on a vanilla agent-platform poddocs/prompts.md(or inline with provenance if lifted from gpt-researcher)Related
.mcpbignorefixes from that issue are already in. This issue is the durable followup.