Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)
[![CI](https://img.shields.io/github/actions/workflow/status/DeusData/codebase-memory-mcp/dry-run.yml?label=CI)](https://github.com/DeusData/codebase-memory-mcp/actions/workflows/dry-run.yml)
[![Tests](https://img.shields.io/badge/tests-5604_passing-brightgreen)](https://github.com/DeusData/codebase-memory-mcp)
[![Languages](https://img.shields.io/badge/languages-158-orange)](https://github.com/DeusData/codebase-memory-mcp)
[![Languages](https://img.shields.io/badge/languages-160-orange)](https://github.com/DeusData/codebase-memory-mcp)
[![Hybrid LSP](https://img.shields.io/badge/Hybrid_LSP-9_languages-blue)](#hybrid-lsp)
[![Agents](https://img.shields.io/badge/agents-11-purple)](https://github.com/DeusData/codebase-memory-mcp)
[![Pure C](https://img.shields.io/badge/pure_C-zero_dependencies-blue)](https://github.com/DeusData/codebase-memory-mcp)
Expand All @@ -16,7 +16,7 @@

**The fastest and most efficient code intelligence engine for AI coding agents.** Full-indexes an average repository in milliseconds, the Linux kernel (28M LOC, 75K files) in 3 minutes. Answers structural queries in under 1ms. Ships as a single static binary for macOS, Linux, and Windows — download, run `install`, done.

High-quality parsing through [tree-sitter](https://tree-sitter.github.io/tree-sitter/) AST analysis across all 158 languages, enhanced with [**Hybrid LSP** semantic type resolution](#hybrid-lsp) for Python, TypeScript / JavaScript / JSX / TSX, PHP, C#, Go, C, C++, Java, Kotlin, and Rust — producing a persistent knowledge graph of functions, classes, call chains, HTTP routes, and cross-service links. 14 MCP tools. Zero dependencies. Plug and play across 11 coding agents.
High-quality parsing through [tree-sitter](https://tree-sitter.github.io/tree-sitter/) AST analysis across all 160 languages, enhanced with [**Hybrid LSP** semantic type resolution](#hybrid-lsp) for Python, TypeScript / JavaScript / JSX / TSX, PHP, C#, Go, C, C++, Java, Kotlin, and Rust — producing a persistent knowledge graph of functions, classes, call chains, HTTP routes, and cross-service links. 14 MCP tools. Zero dependencies. Plug and play across 11 coding agents.

> **Research** — The design and benchmarks behind this project are described in the preprint [*Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP*](https://arxiv.org/abs/2603.27277) (arXiv:2603.27277). Evaluated across 31 real-world repositories: 83% answer quality, 10× fewer tokens, 2.1× fewer tool calls vs. file-by-file exploration.

Expand All @@ -32,7 +32,7 @@ High-quality parsing through [tree-sitter](https://tree-sitter.github.io/tree-si

- **Extreme indexing speed** — Linux kernel (28M LOC, 75K files) in 3 minutes. RAM-first pipeline: LZ4 compression, in-memory SQLite, fused Aho-Corasick pattern matching. Memory released after indexing.
- **Plug and play** — single static binary for macOS (arm64/amd64), Linux (arm64/amd64), and Windows (amd64). No Docker, no runtime dependencies, no API keys. Download → `install` → restart agent → done.
- **158 languages** — vendored tree-sitter grammars compiled into the binary. Nothing to install, nothing that breaks.
- **159 languages** — vendored tree-sitter grammars compiled into the binary. Nothing to install, nothing that breaks.
- **120x fewer tokens** — 5 structural queries: ~3,400 tokens vs ~412,000 via file-by-file search. One graph query replaces dozens of grep/read cycles.
- **11 agents, one command** — `install` auto-detects Claude Code, Codex CLI, Gemini CLI, Zed, OpenCode, Antigravity, Aider, KiloCode, VS Code, OpenClaw, and Kiro — configures MCP entries, instruction files, and pre-tool hooks for each.
- **Built-in graph visualization** — 3D interactive UI at `localhost:9749` (optional UI binary variant).
Expand Down Expand Up @@ -174,7 +174,7 @@ Removes all agent configs, skills, hooks, and instructions. Does not remove the
- `SEMANTICALLY_RELATED` (vocabulary-mismatch, same-language, score ≥ 0.80)

### Indexing pipeline
- **158 vendored tree-sitter grammars** compiled into the binary
- **159 vendored tree-sitter grammars** compiled into the binary
- **Generic package / module resolution** — bare specifiers like `@myorg/pkg`, `github.com/foo/bar`, `use my_crate::foo` resolved via manifest scanning (`package.json`, `go.mod`, `Cargo.toml`, `pyproject.toml`, `composer.json`, `pubspec.yaml`, `pom.xml`, `build.gradle`, `mix.exs`, `*.gemspec`)
- **Infrastructure-as-code indexing** — Dockerfiles, Kubernetes manifests, Kustomize overlays as graph nodes
- **[Hybrid LSP semantic type resolution](#hybrid-lsp)** for Python, TypeScript / JavaScript / JSX / TSX, PHP, C#, Go, C, C++, Java, Kotlin, and Rust — a lightweight C implementation of language type-resolution algorithms, structurally inspired by and compatible with major language servers including tsserver / typescript-go, pyright, gopls, Roslyn, Eclipse JDT, and rust-analyzer (parameter binding, return-type inference, generic substitution, JSX component dispatch, JSDoc inference for plain JS files, namespace + trait + late-static-binding resolution for PHP, file-scoped namespaces + records + LINQ method syntax for C#, class-hierarchy + overload + lambda resolution for Java, extension-function + scope-function resolution for Kotlin, trait-method + UFCS resolution for Rust)
Expand Down Expand Up @@ -542,14 +542,14 @@ codebase-memory-mcp ships a **lightweight C implementation of language type-reso

**Two-layer architecture:**

1. **Tree-sitter pass** — fast, syntactic, runs for every one of the 158 languages. Extracts definitions, calls, imports.
1. **Tree-sitter pass** — fast, syntactic, runs for every one of the 159 languages. Extracts definitions, calls, imports.
2. **Hybrid LSP pass** — type-aware, runs above the tree-sitter pass per-language. Refines call edges using the import graph plus a per-file or pre-built cross-file definition registry. Languages without a Hybrid LSP pass yet fall back to textual resolution, so you always get *some* answer.

The result is a knowledge graph accurate enough to drive `trace_path` across packages, inheritance hierarchies, and stdlib calls — without paying for a language server process per project.

## Language Support

158 languages, all parsed via vendored tree-sitter grammars compiled into the binary. Benchmarked against 64 real open-source repositories (78 to 49K nodes):
159 languages, all parsed via vendored tree-sitter grammars compiled into the binary. Benchmarked against 64 real open-source repositories (78 to 49K nodes):

| Tier | Score | Languages |
|------|-------|-----------|
Expand All @@ -574,7 +574,7 @@ src/
traces/ Runtime trace ingestion
ui/ Embedded HTTP server + 3D graph visualization
foundation/ Platform abstractions (threads, filesystem, logging, memory)
internal/cbm/ Vendored tree-sitter grammars (158 languages) + AST extraction engine
internal/cbm/ Vendored tree-sitter grammars (159 languages) + AST extraction engine
```

## Security
Expand Down
1 change: 1 addition & 0 deletions internal/cbm/cbm.h
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ typedef enum {
CBM_LANG_QML, // Qt QML (Qt Modeling Language — declarative UI + embedded JS)
CBM_LANG_CFSCRIPT, // CFML script dialect (.cfc components — Lucee/ColdFusion)
CBM_LANG_CFML, // CFML tag dialect (.cfm templates — Lucee/ColdFusion)
CBM_LANG_MOJO, // Mojo (Modular — Python-superset systems language; .mojo / .🔥)
CBM_LANG_COUNT
} CBMLanguage;

Expand Down
19 changes: 19 additions & 0 deletions internal/cbm/lang_specs.c
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@ extern const TSLanguage *tree_sitter_apex(void);
extern const TSLanguage *tree_sitter_soql(void);
extern const TSLanguage *tree_sitter_sosl(void);
extern const TSLanguage *tree_sitter_pine(void);
extern const TSLanguage *tree_sitter_mojo(void);

// -- Empty sentinel --
static const char *empty_types[] = {NULL};
Expand Down Expand Up @@ -205,6 +206,18 @@ static const char *py_var_types[] = {"assignment", "augmented_assignment", NULL}
static const char *py_throw_types[] = {"raise_statement", NULL};
static const char *py_decorator_types[] = {"decorator", NULL};

// ==================== MOJO ====================
// Mojo (Modular) is a Python superset; the lsh/tree-sitter-mojo grammar is
// forked from tree-sitter-python, so every node type mirrors Python exactly
// EXCEPT the class array — Mojo's "struct"/"class" both parse as
// class_definition, but "trait" and the "__extension" form get their own
// nodes. So the spec reuses the py_* arrays and overrides only the class
// types. ("fn"/"def" both parse as function_definition; compile-time
// "alias NAME = value" has no dedicated node and is recovered as an
// `assignment`, so it falls under py_var_types like ordinary `var` fields.)
static const char *mojo_class_types[] = {"class_definition", "trait_definition",
"extension_definition", NULL};

// ==================== JAVASCRIPT ====================
static const char *js_func_types[] = {"function_declaration", "generator_function_declaration",
"function_expression", "arrow_function",
Expand Down Expand Up @@ -2039,6 +2052,12 @@ static const CBMLangSpec lang_specs[CBM_LANG_COUNT] = {
empty_types, empty_types, NULL, empty_types, NULL, NULL, tree_sitter_cfml,
NULL},

// CBM_LANG_MOJO (Python-derived; reuses py_* arrays, only class types differ)
[CBM_LANG_MOJO] = {CBM_LANG_MOJO, py_func_types, mojo_class_types, empty_types, py_module_types,
py_call_types, py_import_types, py_import_from_types, py_branch_types,
py_var_types, py_var_types, py_throw_types, NULL, py_decorator_types,
py_env_funcs, py_env_members, tree_sitter_mojo, NULL},

// CBM_LANG_GLEAM
[CBM_LANG_GLEAM] = {CBM_LANG_GLEAM, gleam_func_types, gleam_class_types, gleam_field_types,
gleam_module_types, gleam_call_types, gleam_import_types, empty_types,
Expand Down
15 changes: 15 additions & 0 deletions scripts/new-languages.json
Original file line number Diff line number Diff line change
Expand Up @@ -1291,5 +1291,20 @@
"filenames": [],
"has_scanner": true,
"module_root": "source_file"
},
{
"name": "mojo",
"enum": "MOJO",
"display": "Mojo",
"ts_func": "tree_sitter_mojo",
"repo": "https://github.com/lsh/tree-sitter-mojo",
"subdir": "",
"extensions": [
".mojo",
".🔥"
],
"filenames": [],
"has_scanner": true,
"module_root": "module"
}
]
5 changes: 5 additions & 0 deletions src/discover/language.c
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,10 @@ static const ext_entry_t EXT_TABLE[] = {
/* Meson */
{".meson", CBM_LANG_MESON},

/* Mojo — .mojo and the .🔥 (U+1F525) fire-emoji extension */
{".mojo", CBM_LANG_MOJO},
{".🔥", CBM_LANG_MOJO},

/* Nix */
{".nix", CBM_LANG_NIX},

Expand Down Expand Up @@ -835,6 +839,7 @@ static const char *LANG_NAMES[CBM_LANG_COUNT] = {
[CBM_LANG_APEX] = "Apex",
[CBM_LANG_SOQL] = "SOQL",
[CBM_LANG_SOSL] = "SOSL",
[CBM_LANG_MOJO] = "Mojo",

};

Expand Down
1 change: 1 addition & 0 deletions tests/test_grammar_labels.c
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ static const LabelGolden LABEL_GOLDENS[] = {
{"swift", "Class:1,Function:1,Module:1"},
{"scala", "Class:1,Function:1,Method:1,Module:1"},
{"gdscript", "Function:1,Module:1"},
{"mojo", "Class:1,Function:1,Interface:1,Method:1,Module:1,Variable:1"},
{"groovy", "Class:1,Method:1,Module:1"},
{"zig", "Function:2,Module:1"},
{"solidity", "Class:1,Function:1,Method:1,Module:1"},
Expand Down
8 changes: 8 additions & 0 deletions tests/test_grammar_regression.c
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,14 @@ const GrammarCase CBM_GRAMMAR_CASES[] = {
{"swift", CBM_LANG_SWIFT, "a.swift", "func foo() {}\nclass A {}\n", 2, {"foo", "A", NULL}},
{"scala", CBM_LANG_SCALA, "a.scala", "object A {\n def foo() = 1\n}\n", 1, {"A", NULL}},
{"gdscript", CBM_LANG_GDSCRIPT, "a.gd", "func foo():\n pass\n", 1, {"foo", NULL}},
/* Mojo: fn/def -> function, struct -> class, trait -> interface */
{"mojo",
CBM_LANG_MOJO,
"a.mojo",
"fn foo() -> Int:\n return 1\nstruct Bar:\n var x: Int\ntrait Baz:\n fn m(self): "
"...\n",
3,
{"foo", "Bar", "Baz", NULL}},
{"groovy", CBM_LANG_GROOVY, "a.groovy", "class A {\n def foo() {}\n}\n", 1, {"A", NULL}},
{"zig", CBM_LANG_ZIG, "a.zig", "fn foo() void {}\nfn bar() void {}\n", 1, {"foo", NULL}},
{"solidity",
Expand Down
4 changes: 3 additions & 1 deletion tests/test_lang_contract.c
Original file line number Diff line number Diff line change
Expand Up @@ -489,7 +489,9 @@ enum { GRAMMAR_BREADTH_MAX = 300, GRAMMAR_PATH_BUF = 96 };
* sshconfig — discover detects ssh_config / .ssh/config, not the generic name "config" */
static bool grammar_graph_allowlisted(const char *name) {
static const char *allow[] = {"nasm", "dotenv", "jsdoc", "regex",
"gitignore", "gitattributes", "sshconfig", NULL};
"gitignore", "gitattributes", "sshconfig",
/* mojo — grammar files removed pending provenance audit (#737) */
"mojo", NULL};
for (int i = 0; allow[i]; i++) {
if (strcmp(allow[i], name) == 0) {
return true;
Expand Down
Loading