The bson-transpilers
package uses
ANTLR4
to create a parse tree. As ANTLR
is written in Java, you will need to set up a
few tools before being able to compile this locally.
Make sure you have Java installed:
$ brew cask install java
I strongly suggest using an IDE that will help you visualize ANTLR trees (JetBrains has a good plugin).
Otherwise you can use the java version of the grammar and compile it with
javac <Language>*.java && grun <Language> <StartRule> -gui
.
This might be helpful.
Make sure you have run the following from the root directory:
$ npm run bootstrap
Then compile and run tests locally with:
$ npm run compile && npm run test
You can provide a few environmental variables to help you test your specific output and input languages. If none are provided, everything will run.
- INPUT=: comma-separated input languages you want to test
- OUTPUT=: comma-separated output languages you want to test. Also called "target" language.
- MODE=: comma-separated names of the test files (without .yaml) that you want to run
OUTPUT=csharp INPUT=shell MODE=native,bson npm run test
See also the original presentation: https://drive.google.com/file/d/1jvwtR3k9oBUzIjL4z_VtpHvdWahfcjTK/view
Similar to how many transpilers work, this package parses the input string into a tree and then generates code from the tree using the Visitor pattern.
Parsing and tree generation is handled by ANTLR4.
The grammar files are located in the grammars
folder, and the javascript
parser/lexer/etc. generated from the grammar are located in lib/antlr
. To make
changes to the grammar, you have to modify the .g4
file in grammars
, then
run npm run compile
. You should never directly modify files in lib
.
Each grammar will generate a tree that is unique to the grammar. This means that for equivalent input code, the tree generated from ECMAScript.g4 is not going to be exactly the same as the tree generated by Python3.g4. As a result, we need different visitor classes for each tree.
ANTLR generates a "shell" visitor class for each tree in
lib/antlr/<grammar name>Visitor.js
. It contains an empty method
for each node in the parse tree.
Because the project is designed to handle multiple input languages and multiple
output languages, the tree visitation stage is split into parts. The first part
is handled in the visitor class defined in codegeneration/<input language>/Visitor.js
.
It is a (sub)subclass of the "empty" visitor class generated in lib/antlr
.
This visitor class is specific to the input language and can only visit
a tree generated by that grammar. The visitor visits each node and use a
string template defined in either the symbol table
or the type table to generate code in the ouput language.
For expressions that are too complex for a string template, the visitor will call an
emit
method defined in the Generator. The general rule is
that emit methods aren't required unless you're doing something very unusual! Or
if you need to do any tree manipulation, since the templates only have access to the
arguments that are sent to them (usually function name and args).
If the node requires special treatment for all output languages, the visitor will
define a process<type>
method that will do some pre-processing before calling
the appropriate string template or emit
method. An example is processDate
in
the JS visitor, which constructs a date object from the input and passes it to the
Date template.
There is a lot of repeated code between input-language-specific visitors, so a lot of that has been moved into
codegeneration/CodeGenerationVisitor.js
. Each input-language-specific visitor will define get
methods
that provide a layer of indirection around the tree so that the generic visitor or any generators do not need to know or
care about what tree it's visiting. The CodeGenerationVisitor also includes
helper methods that every input language may use.
The other half of the tree visitation stage. Each ouput language will
have a Generator class defined in codegeneration/<ouput language>/Generator.js
.
The Generator class generates code, so it is specific to the ouput language.
The Generator class is a subclass of the input language's visitor class.
So for example, translating between JS and Python, the order of inheritance will be:
lib/antlr/ECMAScriptVisitor.js
["empty" superclass, specific to the tree built by ANTLR]codegeneration/CodeGenerationVisitor.js
["generic" visitor, shared between all languages]codegeneration/javascript/Visitor.js
[specific to input language]codegeneration/python/Generator.js
[specific to output language]
For nodes that cannot be translated using
templates, the Generator class will define a method called emit<type>
which
takes in a tree node, some optional metadata, and returns the transformed string.
Modifying output in the Generator should only be done if it's not possible to
modify it using a string template.
When the visitor in step #1 reaches a function call, variable, attribute access, or other "identifier" expression it needs a way of knowing what that symbol evaluates to in order to know if it is valid.
Each input language has it's own set of symbols that are part of the
language. The majority of symbols supported in the input languages are BSON types
(i.e. Int32
, ObjectId
, etc) but there are a few native types like RegExp
and
Date
that are not BSON-specific. In order for the transpiler to know if a symbol
is undefined, we store symbol information in a
Symbol Table.
The visitor uses the symbol table to determine if a symbol is undefined, but the
symbol table also stores some metadata so the visitor can do type and other validity checks. The symbols
are defined in YAML in the
symbols/<input language>/symbols.yaml
file. A symbol definition looks like:
Decimal128:
id: "Decimal128"
callable: *constructor
args:
- [ *ObjectType ]
type: *Decimal128Type
attr:
fromString:
id: "fromString"
callable: *func
args:
- [ *StringType ]
type: *Decimal128Type
attr: {}
template: *Decimal128SymbolFromStringTemplate
argsTemplate: *Decimal128SymbolFromStringArgsTemplate
template: *Decimal128SymbolTemplate
argsTemplate: *Decimal128SymbolArgsTemplate
Field | Data |
---|---|
id | The name of the attribute. Mostly used for error reporting and the emit or process method names. |
callable | Used for determining if the symbol can be part of a function call. There are three types of symbols: *func : a function call. If the symbol is found as the "left-hand-side" of a function call, it is valid. *constructor : also a function call, but may require a new keyword if the output language requires it. *var : a variable. Indicates to the transpiler that the symbol cannot be invoked, i.e. <symbol>() is invalid. |
args | Used for type checking. If the symbol is callable, i.e. a *func or *constructor , then the expected arguments are defined here as an array. So for example, if the function takes in a string as the first arg, and an optional second arg that can be a object or array, args will look like [ [*StringType], [*ObjectType, *ArrayType, null] ] . Null indicates optional. |
type | The type that the symbol evaluates to. If it is a variable, it will be the type of the variable. If it is a function, it will be the return type. See the Types section. |
attr | Used for determining if attribute access is valid. This field is also a symbol table, but a namespace-prefixed one. So for the example above, Decimal128.fromString is a valid attribute, so we need to define the symbol fromString in the same way we defined the Decimal128 symbol. |
template | Used for code generation. See the Templates section. |
argsTemplate | Used for code generation. See the Templates section. |
Each input language also has a set of types that are part of the language.
The set of types that are universal for all languages (i.e. "primitives",
"literals", like string
, integer
, etc) are defined in the file
symbols/basic_types.yaml
.
Types that are specific to the input language are defined in symbols/<input language>/types.yaml
. These include BSON types, i.e. classes like ObjectId
, and
language-specific types like RegExp
and Date
. The types are defined in the same
pattern as the symbols and contain the same metadata as the symbols.
NOTE: It is important not to mix up symbols and types, especially since they can share
the same identifier and are basically the same thing but we have to make a distinction somewhere
because otherwise we will end up with invalid code.
The symbol ObjectId
has attributes like ObjectId.fromString(...)
and is a constructor, so ObjectId()
is valid. The type ObjectId
has
attributes like ObjectId().toString()
and is a variable, so ObjectId()()
is not valid and will error with ObjectId() is not callable
or similar error.
You can kind of think of types as instantiated symbols, if that's helpful.
So: ObjectId.toString() and ObjectId().fromString('x')
are both invalid, while
ObjectId().toString() and ObjectId.fromString('x')
are both valid.
The symbol table includes an additional piece of metadata, called a template
.
These are functions that accept strings and return strings, and are responsible for
doing the string transformations from one language syntax to another language's syntax.
They are defined in symbols/<ouput language>/templates.yaml
. This is where
the majority of code generation happens, so the templates are specific to the output language.
Some templates take additional arguments, which are commented in symbols/sample_template.yaml.
Templates can be split into template
and argTemplate
. For symbols that are function
calls, the argsTemplate
is a function that gets applied to the arguments in case they
need rearranging between languages.
Entry point to the project is index.js
. It currently exports two
APIs, compiling a string
given inputLang
and outputLang
, and accessing language's import statements
you might need.
To construct a transpiler, index.js
needs 4 components:
lib/antlr/<ANTLR tree visitor
The ANTLR-generated visitor for the generated parse tree.codegeneration/CodeGenerationVisitor.js
The "generic" visitor for all input languages.codegeneration/<input language>/Visitor.js
The visitor for the specific input language.codegeneration/<ouput language>/Generator.js
- The generator for the specific output language.lib/symbol-table/<input language>to<ouput language>.js
- The symbol table for the input+output combination.
- CodeGenerationVisitor: place to store repeated code between all visitors.
- Visitor: visits nodes; processes input language via
processs
methods and sends information to either output language's template found in theSymbol Template
or anemit
method in the Generator`. - Generator: processes output language via
emit
methods, which take in tree nodes and return strings. - Symbol Table: a directory of the defined symbols, types and their metadata, including templates.
- Symbol Template: does string manipulation to provide output.
The class hierarchy of any transpiler is ANTLRVisitor <-- CodeGenerationVisitor <-- [input-language-specific] Visitor <-- [output-language-specific] Generator.
Method name | Summary |
---|---|
visit<*> |
Always auto-generated by ANTLR, overridden by language-specific visitors. Controls the flow of the tree visitation. |
get<*> |
Wrapper methods around nodes, defined by language-specific visitors. They exist because different grammars call equivalent nodes different names. |
generate<*> |
Shared logic between all input languages. Defined in CodeGenerationVisitor and called from language-specific visitor methods. |
return<*> |
Separates string generation from object generation. Defined in CodeGenerationVisitor or object/Generator . |
process<*> |
Special case for entire input language. Defined in visitors, called programmatically within the visitor. |
emit<*> |
Special case for entire output language. Defined in generators, called programmatically by the visitors. |
Tests are written in YAML. Each yaml file requires two keys, first 'runner' which is a
function that runs each individual test. The function signature is always (it, expect, input, output, transpiler, test)
where test
is a copy of a single YAML test. The second key is 'tests' which is an array
of individual tests. Each test has three fields, 'description' (string), 'input', and 'output'.
The 'input' and 'output' fields are objects with the fields as languages and the values as
the code to be either compiled or expected.
The test runner is test/run-yaml.test.js. If it's not possible to run the tests you want to write using run-yaml.test.js, you can add your YAML file to test/yaml/edge-cases and write your own runner. Although I doubt that will end up being needed.
- Create a directory in
symbols
directory for your output language:
mkdir symbols/<output lang>
- Create a
templates.yaml
file to store your language's templates. Inside you'll probably want to copy the contents from thesymbols/sample_templates.yaml
file. That file also includes comments on which template functions require unusual arguments.
cp symbols/sample_template.yaml symbols/<output lang>/templates.yaml
- Add your new language to the
compile-symbol-table.js
file:
const outputLangs = [
'java',
'shell',
'python',
'csharp',
'javascript',
'mylanguage',
];
- You should now run
npm run compile
to generate a complete symbol table. This will be generated inlib/symbol-table/javascriptto<output lang>
andlib/symbol-table/shellto<output lang>
. - You will have to require the generated symbol tables in
index.js
:
const javascript<output lang>symbols = require('lib/symbol-table/javascriptto<output lang>')
const get<output lang>Generator = require('./codegeneration/<output lang>/Generator.js');
// and then add another export to module.exports at the bottom of the file:
module.exports = {
javascript: {
<output lang>: getTranspiler(
getJSTree,
getJavascriptVisitor(getCodeGenerationVisitor(JavascriptANTLRVisitor)),
get<output lang>Generator,
javascript<output lang>symbols)
}
/* ... and every other language that can compile to your language.
* Make sure you update the getTree method, as well as the input-language
* specific visitor and the ANTLR visitor to match the input lang. */
}
- We still don't have a
Generator.js
file required above, so that won't quite work yet. So next, create a new directory incodegeneration
for your output language:
mkidr codegeneration/<output lang>
- And create a generator file:
touch codegeneration/<output lang>/Generator.js
- Most of the Generators are empty nowadays because the work has been moved into the templates. I would expect that 90% of the time, what you're trying to do can be done in the templates without changing the template signature. However, if you really can't do it in the templates, you can copy any of the generators that exist and fill in any emit methods you require. The class should be created with a super class that is passed to the function, like so:
/*
* Class for handling edge cases for node code generation. Defines "emit" methods.
*/
module.exports = (Visitor) => class Generator extends Visitor {
constructor() {
super();
}
emitX(ctx) {
...
}
};
``
- Next thing is tests! You must go through each test file and add the results of
compiling each input into your output language under the
output
field.
Document:
- input:
javascript: "{x: '1'}"
shell: "{x: '1'}"
python: "{'x': '1'}"
output:
javascript: "{\n 'x': '1'\n}"
python: "{\n 'x': '1'\n}"
java: 'eq("x", "1")'
csharp: 'new BsonDocument("x", "1")'
shell: "{\n 'x': '1'\n}"
<your output language>: ...
Make sure to add your output language in the outputLanguages array at the beginning
of run-yaml.test.js
, and to the list near the end of functions.test.js
.
TODO!