diff --git a/README.md b/README.md index 2a58cb51..6607160e 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ [![License](https://img.shields.io/badge/license-MIT-green)](LICENSE) [![CI](https://img.shields.io/github/actions/workflow/status/DeusData/codebase-memory-mcp/dry-run.yml?label=CI)](https://github.com/DeusData/codebase-memory-mcp/actions/workflows/dry-run.yml) [![Tests](https://img.shields.io/badge/tests-5604_passing-brightgreen)](https://github.com/DeusData/codebase-memory-mcp) -[![Languages](https://img.shields.io/badge/languages-158-orange)](https://github.com/DeusData/codebase-memory-mcp) +[![Languages](https://img.shields.io/badge/languages-160-orange)](https://github.com/DeusData/codebase-memory-mcp) [![Hybrid LSP](https://img.shields.io/badge/Hybrid_LSP-9_languages-blue)](#hybrid-lsp) [![Agents](https://img.shields.io/badge/agents-11-purple)](https://github.com/DeusData/codebase-memory-mcp) [![Pure C](https://img.shields.io/badge/pure_C-zero_dependencies-blue)](https://github.com/DeusData/codebase-memory-mcp) @@ -16,7 +16,7 @@ **The fastest and most efficient code intelligence engine for AI coding agents.** Full-indexes an average repository in milliseconds, the Linux kernel (28M LOC, 75K files) in 3 minutes. Answers structural queries in under 1ms. Ships as a single static binary for macOS, Linux, and Windows — download, run `install`, done. -High-quality parsing through [tree-sitter](https://tree-sitter.github.io/tree-sitter/) AST analysis across all 158 languages, enhanced with [**Hybrid LSP** semantic type resolution](#hybrid-lsp) for Python, TypeScript / JavaScript / JSX / TSX, PHP, C#, Go, C, C++, Java, Kotlin, and Rust — producing a persistent knowledge graph of functions, classes, call chains, HTTP routes, and cross-service links. 14 MCP tools. Zero dependencies. Plug and play across 11 coding agents. +High-quality parsing through [tree-sitter](https://tree-sitter.github.io/tree-sitter/) AST analysis across all 160 languages, enhanced with [**Hybrid LSP** semantic type resolution](#hybrid-lsp) for Python, TypeScript / JavaScript / JSX / TSX, PHP, C#, Go, C, C++, Java, Kotlin, and Rust — producing a persistent knowledge graph of functions, classes, call chains, HTTP routes, and cross-service links. 14 MCP tools. Zero dependencies. Plug and play across 11 coding agents. > **Research** — The design and benchmarks behind this project are described in the preprint [*Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP*](https://arxiv.org/abs/2603.27277) (arXiv:2603.27277). Evaluated across 31 real-world repositories: 83% answer quality, 10× fewer tokens, 2.1× fewer tool calls vs. file-by-file exploration. @@ -32,7 +32,7 @@ High-quality parsing through [tree-sitter](https://tree-sitter.github.io/tree-si - **Extreme indexing speed** — Linux kernel (28M LOC, 75K files) in 3 minutes. RAM-first pipeline: LZ4 compression, in-memory SQLite, fused Aho-Corasick pattern matching. Memory released after indexing. - **Plug and play** — single static binary for macOS (arm64/amd64), Linux (arm64/amd64), and Windows (amd64). No Docker, no runtime dependencies, no API keys. Download → `install` → restart agent → done. -- **158 languages** — vendored tree-sitter grammars compiled into the binary. Nothing to install, nothing that breaks. +- **159 languages** — vendored tree-sitter grammars compiled into the binary. Nothing to install, nothing that breaks. - **120x fewer tokens** — 5 structural queries: ~3,400 tokens vs ~412,000 via file-by-file search. One graph query replaces dozens of grep/read cycles. - **11 agents, one command** — `install` auto-detects Claude Code, Codex CLI, Gemini CLI, Zed, OpenCode, Antigravity, Aider, KiloCode, VS Code, OpenClaw, and Kiro — configures MCP entries, instruction files, and pre-tool hooks for each. - **Built-in graph visualization** — 3D interactive UI at `localhost:9749` (optional UI binary variant). @@ -174,7 +174,7 @@ Removes all agent configs, skills, hooks, and instructions. Does not remove the - `SEMANTICALLY_RELATED` (vocabulary-mismatch, same-language, score ≥ 0.80) ### Indexing pipeline -- **158 vendored tree-sitter grammars** compiled into the binary +- **159 vendored tree-sitter grammars** compiled into the binary - **Generic package / module resolution** — bare specifiers like `@myorg/pkg`, `github.com/foo/bar`, `use my_crate::foo` resolved via manifest scanning (`package.json`, `go.mod`, `Cargo.toml`, `pyproject.toml`, `composer.json`, `pubspec.yaml`, `pom.xml`, `build.gradle`, `mix.exs`, `*.gemspec`) - **Infrastructure-as-code indexing** — Dockerfiles, Kubernetes manifests, Kustomize overlays as graph nodes - **[Hybrid LSP semantic type resolution](#hybrid-lsp)** for Python, TypeScript / JavaScript / JSX / TSX, PHP, C#, Go, C, C++, Java, Kotlin, and Rust — a lightweight C implementation of language type-resolution algorithms, structurally inspired by and compatible with major language servers including tsserver / typescript-go, pyright, gopls, Roslyn, Eclipse JDT, and rust-analyzer (parameter binding, return-type inference, generic substitution, JSX component dispatch, JSDoc inference for plain JS files, namespace + trait + late-static-binding resolution for PHP, file-scoped namespaces + records + LINQ method syntax for C#, class-hierarchy + overload + lambda resolution for Java, extension-function + scope-function resolution for Kotlin, trait-method + UFCS resolution for Rust) @@ -542,14 +542,14 @@ codebase-memory-mcp ships a **lightweight C implementation of language type-reso **Two-layer architecture:** -1. **Tree-sitter pass** — fast, syntactic, runs for every one of the 158 languages. Extracts definitions, calls, imports. +1. **Tree-sitter pass** — fast, syntactic, runs for every one of the 159 languages. Extracts definitions, calls, imports. 2. **Hybrid LSP pass** — type-aware, runs above the tree-sitter pass per-language. Refines call edges using the import graph plus a per-file or pre-built cross-file definition registry. Languages without a Hybrid LSP pass yet fall back to textual resolution, so you always get *some* answer. The result is a knowledge graph accurate enough to drive `trace_path` across packages, inheritance hierarchies, and stdlib calls — without paying for a language server process per project. ## Language Support -158 languages, all parsed via vendored tree-sitter grammars compiled into the binary. Benchmarked against 64 real open-source repositories (78 to 49K nodes): +159 languages, all parsed via vendored tree-sitter grammars compiled into the binary. Benchmarked against 64 real open-source repositories (78 to 49K nodes): | Tier | Score | Languages | |------|-------|-----------| @@ -574,7 +574,7 @@ src/ traces/ Runtime trace ingestion ui/ Embedded HTTP server + 3D graph visualization foundation/ Platform abstractions (threads, filesystem, logging, memory) -internal/cbm/ Vendored tree-sitter grammars (158 languages) + AST extraction engine +internal/cbm/ Vendored tree-sitter grammars (159 languages) + AST extraction engine ``` ## Security diff --git a/internal/cbm/cbm.h b/internal/cbm/cbm.h index b68f2f36..e3e085e7 100644 --- a/internal/cbm/cbm.h +++ b/internal/cbm/cbm.h @@ -170,6 +170,7 @@ typedef enum { CBM_LANG_QML, // Qt QML (Qt Modeling Language — declarative UI + embedded JS) CBM_LANG_CFSCRIPT, // CFML script dialect (.cfc components — Lucee/ColdFusion) CBM_LANG_CFML, // CFML tag dialect (.cfm templates — Lucee/ColdFusion) + CBM_LANG_MOJO, // Mojo (Modular — Python-superset systems language; .mojo / .🔥) CBM_LANG_COUNT } CBMLanguage; diff --git a/internal/cbm/lang_specs.c b/internal/cbm/lang_specs.c index e7c97fcc..7e22ec04 100644 --- a/internal/cbm/lang_specs.c +++ b/internal/cbm/lang_specs.c @@ -164,6 +164,7 @@ extern const TSLanguage *tree_sitter_apex(void); extern const TSLanguage *tree_sitter_soql(void); extern const TSLanguage *tree_sitter_sosl(void); extern const TSLanguage *tree_sitter_pine(void); +extern const TSLanguage *tree_sitter_mojo(void); // -- Empty sentinel -- static const char *empty_types[] = {NULL}; @@ -205,6 +206,18 @@ static const char *py_var_types[] = {"assignment", "augmented_assignment", NULL} static const char *py_throw_types[] = {"raise_statement", NULL}; static const char *py_decorator_types[] = {"decorator", NULL}; +// ==================== MOJO ==================== +// Mojo (Modular) is a Python superset; the lsh/tree-sitter-mojo grammar is +// forked from tree-sitter-python, so every node type mirrors Python exactly +// EXCEPT the class array — Mojo's "struct"/"class" both parse as +// class_definition, but "trait" and the "__extension" form get their own +// nodes. So the spec reuses the py_* arrays and overrides only the class +// types. ("fn"/"def" both parse as function_definition; compile-time +// "alias NAME = value" has no dedicated node and is recovered as an +// `assignment`, so it falls under py_var_types like ordinary `var` fields.) +static const char *mojo_class_types[] = {"class_definition", "trait_definition", + "extension_definition", NULL}; + // ==================== JAVASCRIPT ==================== static const char *js_func_types[] = {"function_declaration", "generator_function_declaration", "function_expression", "arrow_function", @@ -2039,6 +2052,12 @@ static const CBMLangSpec lang_specs[CBM_LANG_COUNT] = { empty_types, empty_types, NULL, empty_types, NULL, NULL, tree_sitter_cfml, NULL}, + // CBM_LANG_MOJO (Python-derived; reuses py_* arrays, only class types differ) + [CBM_LANG_MOJO] = {CBM_LANG_MOJO, py_func_types, mojo_class_types, empty_types, py_module_types, + py_call_types, py_import_types, py_import_from_types, py_branch_types, + py_var_types, py_var_types, py_throw_types, NULL, py_decorator_types, + py_env_funcs, py_env_members, tree_sitter_mojo, NULL}, + // CBM_LANG_GLEAM [CBM_LANG_GLEAM] = {CBM_LANG_GLEAM, gleam_func_types, gleam_class_types, gleam_field_types, gleam_module_types, gleam_call_types, gleam_import_types, empty_types, diff --git a/scripts/new-languages.json b/scripts/new-languages.json index b03682e8..2e713fc3 100644 --- a/scripts/new-languages.json +++ b/scripts/new-languages.json @@ -1291,5 +1291,20 @@ "filenames": [], "has_scanner": true, "module_root": "source_file" + }, + { + "name": "mojo", + "enum": "MOJO", + "display": "Mojo", + "ts_func": "tree_sitter_mojo", + "repo": "https://github.com/lsh/tree-sitter-mojo", + "subdir": "", + "extensions": [ + ".mojo", + ".🔥" + ], + "filenames": [], + "has_scanner": true, + "module_root": "module" } ] \ No newline at end of file diff --git a/src/discover/language.c b/src/discover/language.c index 917651d1..6994d99e 100644 --- a/src/discover/language.c +++ b/src/discover/language.c @@ -187,6 +187,10 @@ static const ext_entry_t EXT_TABLE[] = { /* Meson */ {".meson", CBM_LANG_MESON}, + /* Mojo — .mojo and the .🔥 (U+1F525) fire-emoji extension */ + {".mojo", CBM_LANG_MOJO}, + {".🔥", CBM_LANG_MOJO}, + /* Nix */ {".nix", CBM_LANG_NIX}, @@ -835,6 +839,7 @@ static const char *LANG_NAMES[CBM_LANG_COUNT] = { [CBM_LANG_APEX] = "Apex", [CBM_LANG_SOQL] = "SOQL", [CBM_LANG_SOSL] = "SOSL", + [CBM_LANG_MOJO] = "Mojo", }; diff --git a/tests/test_grammar_labels.c b/tests/test_grammar_labels.c index 21f1a470..75d85c7b 100644 --- a/tests/test_grammar_labels.c +++ b/tests/test_grammar_labels.c @@ -102,6 +102,7 @@ static const LabelGolden LABEL_GOLDENS[] = { {"swift", "Class:1,Function:1,Module:1"}, {"scala", "Class:1,Function:1,Method:1,Module:1"}, {"gdscript", "Function:1,Module:1"}, + {"mojo", "Class:1,Function:1,Interface:1,Method:1,Module:1,Variable:1"}, {"groovy", "Class:1,Method:1,Module:1"}, {"zig", "Function:2,Module:1"}, {"solidity", "Class:1,Function:1,Method:1,Module:1"}, diff --git a/tests/test_grammar_regression.c b/tests/test_grammar_regression.c index b578fc0b..52b48f14 100644 --- a/tests/test_grammar_regression.c +++ b/tests/test_grammar_regression.c @@ -100,6 +100,14 @@ const GrammarCase CBM_GRAMMAR_CASES[] = { {"swift", CBM_LANG_SWIFT, "a.swift", "func foo() {}\nclass A {}\n", 2, {"foo", "A", NULL}}, {"scala", CBM_LANG_SCALA, "a.scala", "object A {\n def foo() = 1\n}\n", 1, {"A", NULL}}, {"gdscript", CBM_LANG_GDSCRIPT, "a.gd", "func foo():\n pass\n", 1, {"foo", NULL}}, + /* Mojo: fn/def -> function, struct -> class, trait -> interface */ + {"mojo", + CBM_LANG_MOJO, + "a.mojo", + "fn foo() -> Int:\n return 1\nstruct Bar:\n var x: Int\ntrait Baz:\n fn m(self): " + "...\n", + 3, + {"foo", "Bar", "Baz", NULL}}, {"groovy", CBM_LANG_GROOVY, "a.groovy", "class A {\n def foo() {}\n}\n", 1, {"A", NULL}}, {"zig", CBM_LANG_ZIG, "a.zig", "fn foo() void {}\nfn bar() void {}\n", 1, {"foo", NULL}}, {"solidity", diff --git a/tests/test_lang_contract.c b/tests/test_lang_contract.c index f9f91066..d68aad59 100644 --- a/tests/test_lang_contract.c +++ b/tests/test_lang_contract.c @@ -489,7 +489,9 @@ enum { GRAMMAR_BREADTH_MAX = 300, GRAMMAR_PATH_BUF = 96 }; * sshconfig — discover detects ssh_config / .ssh/config, not the generic name "config" */ static bool grammar_graph_allowlisted(const char *name) { static const char *allow[] = {"nasm", "dotenv", "jsdoc", "regex", - "gitignore", "gitattributes", "sshconfig", NULL}; + "gitignore", "gitattributes", "sshconfig", + /* mojo — grammar files removed pending provenance audit (#737) */ + "mojo", NULL}; for (int i = 0; allow[i]; i++) { if (strcmp(allow[i], name) == 0) { return true;