Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Code Search, I] Explore existing syntactic search #1125

Closed
EagleoutIce opened this issue Nov 7, 2024 · 3 comments
Closed

[Code Search, I] Explore existing syntactic search #1125

EagleoutIce opened this issue Nov 7, 2024 · 3 comments
Assignees
Labels
search Code Search (what is possible in the backend of the search API)
Milestone

Comments

@EagleoutIce
Copy link
Member

We are not the first to search for syntactic structures in the code.
For example ast-grep which relies on tree-sitter to provide a lot of amazing pattern matching capabilities. It offers an r parser r-tree-sitter as well as node.js bindings so there should be a way to use them. From what i can figure out looking at it this should be already capable of doing a lot of the syntactic matching that we are interested in (although finding "all function calls" this way is another question.

So please, just to get started, have a look at the ast-grep system and report on whether you find it suitable for our use-case of querying R code with syntactic patterns.

@EagleoutIce EagleoutIce added the search Code Search (what is possible in the backend of the search API) label Nov 7, 2024
@EagleoutIce EagleoutIce added this to the Code Search milestone Nov 7, 2024
@EagleoutIce EagleoutIce changed the title [Code Search, I] Explore existing syntact search [Code Search, I] Explore existing syntactic search Nov 7, 2024
@Ellpeck
Copy link
Member

Ellpeck commented Nov 19, 2024

We currently have to prebuild the tree-sitter-r package manually because it is not on npm. To do so, clone the repository and run

npm i
npm x -- tree-sitter generate
npm x -- prebuildify --strip --arch x64 --target 20.9.0
npm x -- prebuildify --strip --arch arm64 --target 20.9.0

Then, copy the prebuilt directories from prebuild into the tgz or run npm pack to pack an OS-specific version.

Pre-building WASM binaries is also possible using a similar setup.

Sourced from this official GitHub Action.


UPDATE: Actually, to build the tree-sitter-r package, it may be worth considering this documentation, specifically this part about running wasm in node.js. It seems like this would spare us from having to make these prebuilds for everything.

@Ellpeck
Copy link
Member

Ellpeck commented Nov 19, 2024

The API seems pretty nice, I made this playground as a quick example:

import Parser from 'tree-sitter';
import tree_sitter_r from 'tree-sitter-r';

if(require.main === module) {
	const parser = new Parser();
	parser.setLanguage(tree_sitter_r);
	const sourceCode = `
sum <- 0
product <- 1
w <- 7
N <- 10

for (i in 1:(N-1)) {
  sum <- sum + i + w
  product <- product * i
}

cat("Sum:", sum, "\n")
cat("Product:", product, "\n")
`;
	const tree = parser.parse(sourceCode);
	console.log(tree.rootNode.toString());
}

with the output being

(program (binary_operator lhs: (identifier) rhs: (float)) (binary_operator lhs: (identifier) rhs: (float)) (binary_operator lhs: (identifier) rhs: (float)) (binary_operator lhs: (
identifier) rhs: (float)) (for_statement variable: (identifier) sequence: (binary_operator lhs: (float) rhs: (parenthesized_expression body: (binary_operator lhs: (identifier) rhs
: (float)))) body: (braced_expression body: (binary_operator lhs: (identifier) rhs: (binary_operator lhs: (binary_operator lhs: (identifier) rhs: (identifier)) rhs: (identifier)))
 body: (binary_operator lhs: (identifier) rhs: (binary_operator lhs: (identifier) rhs: (identifier))))) (call function: (identifier) arguments: (arguments argument: (argument valu
e: (string content: (string_content))) (comma) argument: (argument value: (identifier)) (comma) argument: (argument value: (string content: (string_content))))) (call function: (i
dentifier) arguments: (arguments argument: (argument value: (string content: (string_content))) (comma) argument: (argument value: (identifier)) (comma) argument: (argument value: (string content: (string_content))))))

There are also several functions for querying specific nodes or specific parts of the tree, and printing out the full tree reveals that it has a lot of additional info:

Tree {
  input: '\n' +
    'sum <- 0\n' +
    'product <- 1\n' +
    'w <- 7\n' +
    'N <- 10\n' +
    '\n' +
    'for (i in 1:(N-1)) {\n' +
    '  sum <- sum + i + w\n' +
    '  product <- product * i\n' +
    '}\n' +
    '\n' +
    'cat("Sum:", sum, "\n' +
    '")\n' +
    'cat("Product:", product, "\n' +
    '")\n',
  getText: [Function: getTextFromString],
  language: {
    name: 'r',
    language: [External: 1df339b47c0],
    nodeTypeInfo: [
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object]
    ],
    nodeSubclasses: [
      [class SyntaxNode],
      [class IdentifierNode extends SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class EscapeSequenceNode extends SyntaxNode],
      [class ReturnNode extends SyntaxNode],
      [class NextNode extends SyntaxNode],
      [class BreakNode extends SyntaxNode],
      [class TrueNode extends SyntaxNode],
      [class FalseNode extends SyntaxNode],
      [class NullNode extends SyntaxNode],
      [class InfNode extends SyntaxNode],
      [class NanNode extends SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class DotsNode extends SyntaxNode],
      [class DotDotINode extends SyntaxNode],
      [class CommentNode extends SyntaxNode],
      [class CommaNode extends SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class ProgramNode extends SyntaxNode],
      [class FunctionDefinitionNode extends SyntaxNode],
      [class ParametersNode extends SyntaxNode],
      [class ParameterNode extends SyntaxNode],
      [class SyntaxNode],
      [class SyntaxNode],
      [class IfStatementNode extends SyntaxNode],
      [class ForStatementNode extends SyntaxNode],
      [class WhileStatementNode extends SyntaxNode],
      [class RepeatStatementNode extends SyntaxNode],
      [class BracedExpressionNode extends SyntaxNode],
      [class ParenthesizedExpressionNode extends SyntaxNode],
      [class CallNode extends SyntaxNode],
      [class SubsetNode extends SyntaxNode],
      [class Subset2Node extends SyntaxNode],
      [class ArgumentsNode extends SyntaxNode],
      [class ArgumentsNode extends SyntaxNode],
      [class ArgumentsNode extends SyntaxNode],
      [class SyntaxNode],
      ... 36 more items
    ]
  }
}

@Ellpeck Ellpeck closed this as completed Nov 19, 2024
@Ellpeck
Copy link
Member

Ellpeck commented Nov 19, 2024

Here's a version of the sandbox code using wasm!

import Parser from 'web-tree-sitter';

if(require.main === module) {
	void doIt();
}

async function doIt() {
	await Parser.init();
	const parser = new Parser();
	parser.setLanguage(await Parser.Language.load(`${__dirname}/tree-sitter-r.wasm`));
	const sourceCode = `
sum <- 0
product <- 1
w <- 7
N <- 10

for (i in 1:(N-1)) {
  sum <- sum + i + w
  product <- product * i
}

cat("Sum:", sum, "\n")
cat("Product:", product, "\n")
`;
	const tree = parser.parse(sourceCode);
	console.log(tree.rootNode.toString());
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
search Code Search (what is possible in the backend of the search API)
Projects
None yet
Development

No branches or pull requests

2 participants