Skip to content

Latest commit

 

History

History
597 lines (433 loc) · 15.7 KB

File metadata and controls

597 lines (433 loc) · 15.7 KB

Output Schema

Complete reference for read_ast() output columns.

Quick Reference

Default output (20 columns):

Column Type Description
node_id BIGINT Unique node identifier within file
type VARCHAR Tree-sitter AST node type
semantic_type SEMANTIC_TYPE Universal semantic category
flags UTINYINT Node property flags
name VARCHAR Extracted identifier name
signature_type VARCHAR Type/return type information
parameters STRUCT[] Function parameters with names and types
modifiers VARCHAR[] Access modifiers and keywords
annotations VARCHAR Decorator/annotation text
qualified_name LIST<STRUCT> Scope-based path as segment list (semantic_type + name + index), unique within a file
file_path VARCHAR Source file path
language VARCHAR Programming language
start_line UINTEGER Starting line (1-based)
end_line UINTEGER Ending line (1-based)
parent_id BIGINT Parent node ID
depth UINTEGER Tree depth (0 for root)
sibling_index UINTEGER Position among siblings
children_count UINTEGER Direct child count
descendant_count UINTEGER Total descendant count
peek VARCHAR Source code snippet

Additional columns with source := 'full' (22 columns total):

Column Type Description
start_column UINTEGER Starting column (1-based)
end_column UINTEGER Ending column (1-based)
-- Get column positions with source := 'full'
SELECT name, start_line, start_column, end_line, end_column
FROM read_ast('example.py', source := 'full')
WHERE is_function_definition(semantic_type);

Identity Columns

node_id

Type: BIGINT

Unique identifier for each AST node within a file.

SELECT node_id, type, name
FROM read_ast('example.py')
ORDER BY node_id;
  • Starts at 0 for the root node
  • Sequential within each file
  • Use with parent_id for tree traversal

type

Type: VARCHAR

Language-specific AST node type from Tree-sitter.

-- Common types vary by language
SELECT DISTINCT type
FROM read_ast('example.py')
ORDER BY type;

Python examples: function_definition, class_definition, identifier JavaScript examples: function_declaration, class_declaration, identifier


semantic_type

Type: SEMANTIC_TYPE (custom logical type)

Universal semantic category. This is a custom DuckDB type that:

  • Displays as string - Shows DEFINITION_FUNCTION instead of numeric code
  • Supports string comparison - WHERE semantic_type = 'DEFINITION_FUNCTION'
  • Stores efficiently - Underlying storage is UTINYINT for fast comparisons
-- Direct string comparison (natural syntax)
SELECT * FROM read_ast('example.py')
WHERE semantic_type = 'DEFINITION_FUNCTION';

-- Group by semantic type
SELECT semantic_type, COUNT(*)
FROM read_ast('example.py')
GROUP BY semantic_type
ORDER BY COUNT(*) DESC;
  • Cross-language semantic classification
  • See Semantic Types for values

flags

Type: UTINYINT

Bitfield containing node properties. Use helper predicates to check flags:

Flag Predicate Description
IS_CONSTRUCT is_construct(flags) Semantic construct (not punctuation)
IS_EMBODIED is_embodied(flags) / has_body(flags) Has implementation body
IS_DECLARATION_ONLY is_declaration_only(flags) Forward declaration without body
IS_SYNTAX_ONLY is_syntax_only(flags) Pure syntax token (keyword, punctuation)
-- Find function definitions with implementations (not just declarations)
SELECT name, file_path
FROM read_ast('**/*.cpp', ignore_errors := true)
WHERE is_function_definition(semantic_type)
  AND has_body(flags);

-- Find forward declarations only
SELECT name, file_path
FROM read_ast('**/*.h', ignore_errors := true)
WHERE is_function_definition(semantic_type)
  AND is_declaration_only(flags);

Name and Extraction Columns

name

Type: VARCHAR (nullable)

Extracted identifier or name for the node.

SELECT name, type, start_line
FROM read_ast('example.py')
WHERE name IS NOT NULL;
  • NULL for nodes without meaningful names
  • Populated for definitions, identifiers, and named constructs
  • Extraction depends on language-specific rules

signature_type

Type: VARCHAR (nullable)

Type information extracted from the node. Meaning varies by semantic type:

Semantic Type signature_type Contains
DEFINITION_FUNCTION Return type (int, void, *big.Int)
DEFINITION_CLASS Class kind (class, interface, trait, struct)
COMPUTATION_CALL Full call expression (obj.method, pkg.func)
DEFINITION_VARIABLE Variable type
-- Find functions with their return types
SELECT name, signature_type as return_type, start_line
FROM read_ast('src/**/*.go')
WHERE is_function_definition(semantic_type);

-- Find method calls by signature
SELECT file_path, start_line, signature_type
FROM read_ast('src/**/*.cpp')
WHERE is_function_call(semantic_type)
  AND signature_type LIKE '%.empty';

parameters

Type: STRUCT("name" VARCHAR, "type" VARCHAR)[]

Function parameters as an array of structs containing parameter name and type.

-- List functions with their parameters
SELECT
    name,
    parameters,
    array_length(parameters) as param_count
FROM read_ast('example.py')
WHERE is_function_definition(semantic_type);

-- Find functions with specific parameter names
SELECT name, parameters
FROM read_ast('src/**/*.py')
WHERE is_function_definition(semantic_type)
  AND list_contains([p.name FOR p IN parameters], 'self');
  • Empty array [] for functions with no parameters
  • Parameter types may be NULL for dynamically-typed languages

modifiers

Type: VARCHAR[]

Access modifiers, keywords, and other declarative attributes.

-- Find public static methods in Java
SELECT name, modifiers
FROM read_ast('src/**/*.java')
WHERE is_function_definition(semantic_type)
  AND list_contains(modifiers, 'public')
  AND list_contains(modifiers, 'static');

-- Find async functions
SELECT name, file_path
FROM read_ast('src/**/*.js')
WHERE is_function_definition(semantic_type)
  AND list_contains(modifiers, 'async');

Common modifier values by language:

  • Java: public, private, protected, static, final, abstract
  • JavaScript: async, const, let, var
  • Go: var
  • Rust: pub, mut, async

annotations

Type: VARCHAR (nullable)

Decorator or annotation text for the node.

-- Find decorated Python functions
SELECT name, annotations
FROM read_ast('src/**/*.py')
WHERE is_function_definition(semantic_type)
  AND annotations IS NOT NULL;

qualified_name

Type: LIST(STRUCT(semantic_type SEMANTIC_TYPE, name VARCHAR, index INTEGER)) (nullable)

Scope-based definition path that uniquely identifies a node within its file. Each element of the list is one segment of the path from outermost scope to innermost. Built during AST traversal by tracking nested definition scopes.

Segment fields:

Field Type Meaning
semantic_type SEMANTIC_TYPE The semantic role of this scope level (e.g. DEFINITION_FUNCTION, DEFINITION_CLASS, DEFINITION_VARIABLE, DEFINITION_MODULE, EXTERNAL_IMPORT, EXTERNAL_EXPORT). Cast to VARCHAR for the long-form name.
name VARCHAR The identifier name at that scope level.
index INTEGER 1-based occurrence index for disambiguating same-name collisions within a scope. The first x = ... in a scope gets index = 1; the second gets index = 2; and so on. Always explicit.
-- Inspect the raw segment list
SELECT name, qualified_name
FROM read_ast('src/models.py')
WHERE name = 'validate';
-- validate  [{semantic_type: DEFINITION_CLASS,    name: User,     index: 1},
--            {semantic_type: DEFINITION_FUNCTION, name: validate, index: 1}]

-- Filter by semantic role at any depth (e.g. all definitions inside a class)
SELECT name
FROM read_ast('src/**/*.py')
WHERE len(list_filter(qualified_name,
                      s -> is_class_definition(s.semantic_type))) > 0;

-- Innermost enclosing class name
SELECT name,
       (list_reverse(list_filter(qualified_name,
                                 s -> is_class_definition(s.semantic_type))))[1].name AS in_class
FROM read_ast('src/**/*.py')
WHERE is_function_definition(semantic_type);

-- Top-level definitions only (depth = 1)
SELECT name
FROM read_ast('src/**/*.py')
WHERE len(qualified_name) = 1 AND is_function_definition(semantic_type);

-- Innermost segment
SELECT name, qualified_name[-1].name, qualified_name[-1].index
FROM read_ast('src/**/*.py')
WHERE is_definition(semantic_type);

-- Composite join key across parses
SELECT a.name, a.start_line, b.start_line
FROM read_ast('v1/**/*.py') a
JOIN read_ast('v2/**/*.py') b USING (file_path, qualified_name);

Collision disambiguation. Within a single file, each (semantic_type, name) pair in a given scope gets a monotonically increasing index. The first occurrence has index = 1, the second has index = 2, and so on. This makes qualified_name unique within a file for every named definition.

counter = 0    # [{variable, counter, 1}]
counter = 1    # [{variable, counter, 2}]
counter = 2    # [{variable, counter, 3}]

Counters reset at scope boundaries — each function/class gets a fresh counter, so [{function, foo}, {variable, x}] and [{function, bar}, {variable, x}] both use index = 1 for their x.

String form. For display, logging, or test assertions, ast_qualified_name_as_string(qualified_name) renders the list into the legacy bracket form:

SELECT name, ast_qualified_name_as_string(qualified_name)
FROM read_ast('src/models.py')
WHERE name = 'validate';
-- validate  C[User] F[validate]

The bracket form uses single-letter prefixes (F/C/V/M/I/E) and omits the [N] suffix for index = 1, so V[x] means "the first x" and V[x][2] means "the second x in that scope".

Key properties:

  • NULL for non-definition nodes (the qualified_name column holds a NULL list, not an empty one)
  • Only populated for named definition nodes (functions, classes, variables, modules, imports, exports)
  • Available at context := 'normalized' and above
  • Unique within a file — every named definition node gets a distinct segment list
  • Joinable: USING (file_path, qualified_name) works as a cross-parse composite key

Source Location Columns

file_path

Type: VARCHAR

Path to the source file.

SELECT DISTINCT file_path
FROM read_ast('src/**/*.py');
  • Relative paths preserved from input
  • Absolute paths if provided as input

language

Type: VARCHAR

Detected or specified programming language.

SELECT language, COUNT(*) as node_count
FROM read_ast(['**/*.py', '**/*.js'], ignore_errors := true)
GROUP BY language;

start_line / end_line

Type: UINTEGER

Line numbers (1-based).

SELECT name, start_line, end_line, end_line - start_line + 1 as line_count
FROM read_ast('example.py')
WHERE type = 'function_definition';

start_column / end_column

Type: UINTEGER

Column positions (1-based). Only available with source := 'full'.

-- Must use source := 'full' to get column positions
SELECT name, start_line, start_column, end_line, end_column
FROM read_ast('example.py', source := 'full')
WHERE type = 'identifier';

Tree Structure Columns

parent_id

Type: BIGINT (nullable)

Node ID of the parent node.

-- Find children of a specific node
SELECT type, name
FROM read_ast('example.py')
WHERE parent_id = 5;
  • NULL for root node
  • Use for tree traversal

depth

Type: UINTEGER

Tree depth (0 for root).

-- Find deeply nested code
SELECT type, name, depth
FROM read_ast('example.py')
WHERE depth > 5
ORDER BY depth DESC;

sibling_index

Type: UINTEGER

Position among siblings (0-based).

SELECT type, sibling_index
FROM read_ast('example.py')
WHERE parent_id = 0
ORDER BY sibling_index;

children_count

Type: UINTEGER

Number of direct child nodes.

SELECT type, name, children_count
FROM read_ast('example.py')
WHERE children_count > 10;

descendant_count

Type: UINTEGER

Total number of descendant nodes. Useful as a complexity metric.

-- Find complex functions
SELECT name, descendant_count as complexity
FROM read_ast('example.py')
WHERE type = 'function_definition'
ORDER BY complexity DESC;

Content Columns

peek

Type: VARCHAR (nullable)

Source code snippet for the node.

SELECT type, name, peek
FROM read_ast('example.py', peek := 100)
WHERE type = 'function_definition';
  • Size controlled by peek parameter
  • NULL if peek := 'none'
  • Use peek := 'full' for complete source text

Column Availability by Parameter

By context Parameter

Column 'none' 'node_types_only' 'normalized' 'native'
node_id Yes Yes Yes Yes
type Yes Yes Yes Yes
semantic_type No Yes Yes Yes
flags No Yes Yes Yes
name No No Yes Yes
signature_type No No No Yes
parameters No No No Yes
modifiers No No No Yes
annotations No No No Yes
qualified_name No No Yes Yes
file_path Yes Yes Yes Yes
language Yes Yes Yes Yes
start_line Yes Yes Yes Yes
end_line Yes Yes Yes Yes
parent_id Yes* Yes* Yes* Yes
depth Yes* Yes* Yes* Yes
sibling_index Yes* Yes* Yes* Yes
children_count Yes* Yes* Yes* Yes
descendant_count Yes* Yes* Yes* Yes
peek Yes** Yes** Yes** Yes**

* Depends on structure parameter ** Depends on peek parameter

By source Parameter

Column 'none' 'path' 'lines_only' 'lines' 'full'
file_path No Yes Yes Yes Yes
start_line No No Yes Yes Yes
end_line No No Yes Yes Yes
start_column No No No No Yes
end_column No No No No Yes
-- Default: line positions only
SELECT start_line, end_line FROM read_ast('example.py');

-- With source := 'full': includes column positions
SELECT start_line, start_column, end_line, end_column
FROM read_ast('example.py', source := 'full');

Working with Output

Create Table from Results

CREATE TABLE ast_cache AS
SELECT * FROM read_ast('src/**/*.py', ignore_errors := true);

-- Query cached data
SELECT type, COUNT(*) FROM ast_cache GROUP BY type;

Export to Parquet

COPY (
    SELECT file_path, type, name, start_line, semantic_type
    FROM read_ast('src/**/*.py')
) TO 'ast_export.parquet';

Join with Other Data

-- Join with file metadata
SELECT
    ast.file_path,
    ast.name,
    files.last_modified
FROM read_ast('src/**/*.py') ast
JOIN file_metadata files ON ast.file_path = files.path
WHERE ast.type = 'function_definition';

Next Steps