Skip to content
Tyler Blaine Hall edited this page Apr 25, 2025 · 3 revisions

The ast-mcp-server repo is already a solid proof-of-concept: it wraps Tree-sitter parsing and a lightweight ASG builder in a Model Context Protocol (MCP) service, so any Claude/LLM client can pull structural graphs on-demand. The design lines up with current best practice—FastMCP scaffolding for tool exposure, Tree-sitter for incremental ASTs, and simple graph edges that can later feed structure-aware models such as GraphCodeBERT.


What’s working well

Clean MCP surface

  • Uses FastMCP so each analysis function is a self-describing “tool,” instantly discoverable by any MCP-aware agent.
  • Caches resources by ast://<hash> and asg://<hash>, which keeps prompts small and encourages re-use across chat turns.

Pragmatic AST pipeline

  • Tree-sitter is the right call for multi-language, incremental parsing; it’s battle-tested in IDEs and can update on every keystroke.
  • build_parsers.py makes local C parsers, avoiding binary wheels and keeping install friction low.

Early—but extensible—ASG builder

  • Edges for definitions and references emulate the data-flow heads that boosted GraphCodeBERT accuracy.
  • The separation of parse_code_to_astcreate_asg_from_ast mirrors the Semantic Code Graph literature, making it easy to switch to richer schemas later.

LLM integration friendly

  • README ships a ready-made Claude desktop config snippet; few open MCP repos do this yet.

Gaps & concrete improvements

Area Why it matters Suggested action
Packaging & CI Easier adoption and repeatability Add a pyproject.toml, publish to TestPyPI, and wire up GitHub Actions for lint + unit tests
Edge completeness Data/control-flow edges are partial; complex scopes may resolve incorrectly Keep a scope stack while walking the AST; look at the graph-based semantics paper for multi-level token→stmt→graph capture.
Performance Large repos will hit memory when you JSON-dump whole ASTs Offer a “diff” mode that returns only changed sub-trees; Tree-sitter can give edit ranges natively.
Security The server executes arbitrary user input (parsers) Sandbox parsing in a restricted process or use seccomp/py-seccomp for Linux targets
Testing corpus Ensures language coverage and guards refactors Pull tiny fixtures from GitHub’s corpus-manager samples or Rosetta Code and add pytest golden files
Docs & examples Drives contributions Expand the examples/ folder: show a full round-trip where Claude asks for “refactor functions over 20 LOC” and the tool replies with positions

Road-map ideas

  1. Graph storage back-end
    Persist ASGs in Neo4j or DuckDB and expose a Cypher-query tool so LLMs can answer “Which functions mutate global state?” on large codebases.

  2. LLM-guided repair
    Pipe ASG slices into a fine-tuned GraphCodeBERT or Mistral-7B-Code to generate safe patches automatically; send the diff back through MCP.

  3. Language-Server (LSP) bridge
    Create an LSP that proxies to your MCP server; devs would get semantic diagnostics in VS Code while the same graphs feed the chat agent.

  4. Streaming mode
    Upgrade FastMCP handlers to support server-sent events so the client sees partial ASTs/ASGs as soon as they’re ready—useful for real-time copilots.

  5. Benchmark harness
    Integrate the CodeQL Benchmark suite or Defects4J to track “bugs fixed per minute” as you evolve the ASG heuristics.


Verdict

For an initial drop, ast-mcp-server nails the essentials: lightweight, language-agnostic parsing exposed through a forward-looking protocol that big tooling vendors (Anthropic, Replit, Sourcegraph) are converging on. Tighten up packaging, flesh out semantic edges, and add CI tests