-
Notifications
You must be signed in to change notification settings - Fork 7
Parsing
Charlie has in-built support for parsing, through a three-step model:
- Lexing: a string or input file is sequentialised into a sequence of tokens (e.g., IDENTIFIER("myname"), OPENBRACKET, DECLARE, ...)
- Parsing: the tokens are turned into a parse tree
- Reading: the parse tree is turned into internal structures like terms and a TRS
(Parsing and reading can be done in one go, but for the sake of modularity and code reusability, this is not recommended.)
Unlike common approaches, lexing and parsing are not done through parser generators like Antlr, but rather use an internal library (in the charlie.parser.lib package). This has the advantages of not requiring the programmer to learn a separate language to do quite complicated things and offering full flexibility to for instance change the lexer halfway through a file, but does mean some things are harder than they would be in a traditional parser generator.
We will discuss how to use this package through the example of the higher-order "ari" format that is the standard in the higher-order termination competition. (At the time of writing, this format was present in Charlie, but it may have been extended since then.) This format is defined by:
trs ::= (FORMAT higher-order) sort+ fun+ hrule+
sort ::= (SORT IDENTIFIER )
fun ::= (FUN IDENTIFIER type )
type ::= IDENTIFIER | (-> type+ IDENTIFIER )
term ::= IDENTIFIER | ( IDENTIFIER term+ ) | (LAMBDA (var+) term)
var ::= (IDENTIFIER type )
rule ::= (RULE term term)
Here, FORMAT, SORT, FUN, LAMBDA and RULE are keywords (and should be used uncapitalised); -> and ( and ) are all pre-defined as well, and for now let us say that IDENTIFIER is a nonempty string of characters in a-z, A-Z, 0-9, or any of ~ ! @ $ % ^ & * _ - + = < > . ? / that does not start with a digit. Whitespace is ignored. Any occurrence of ; starts a comment, which runs until the end of the line.
The first step is to define a lexer. A lexer reads a file or string, and turns it into a sequence of tokens: a token (charlie.parser.lib.Token) keeps track of a position in a file or string (e.g., append.trs:5:3), a token name (e.g., "IDENTIFIER") and the text that the token represents (e.g., "append").
It is worth noting that a lexer does not read a whole file or string in one go: the text is read token by token, so there is never a long list of tokens in memory, and the programmer could make changes to the lexer after any given token). When reading a file, there is never more than one line read into memory.
The core part of defining a lexer is setting up the tokens. This is done by (a) creating a number of string constants identifying the names of the tokens we will use, and (b) defining a regular expression for each token. The possible components of regular expressions in Java are explained in https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html .
In our example, we create the following file AriTokenData.java:
import java.io.IOException;
import charlie.parser.lib.*;
/**
* This file defines the tokens used to lex and parse a file or string using the higher-order
* ARI format.
*/
public class AriTokenData {
public static String FORMAT = "FORMAT";
public static String SORT = "SORT";
public static String FUN = "FUN";
public static String RULE = "LAMBDA";
public static String IDENTIFIER = "IDENTIFIER";
public static String BRACKETOPEN = "BRACKETOPEN";
public static String BRACKETCLOSE = "BRACKETCLOSE";
public static String ARROW = "ARROW";
private static String _letter = "[a-z]|[A-Z]";
private static String _legalchar = "~|!|@|\\$|%|\\^|&|\\*|_|-|\\+|=|<|>|\\.|\\?|\\/";
private static String _digit = "[0-9]";
/* Next, we define the regular expressions for all tokens. */
private static String[] tokens = new String[] {
"format" , FORMAT,
"sort" , SORT,
"fun" , FUN,
"rule" , RULE,
"->" , ARROW,
"(" + _letter + "|" + _legalchar + ")(" + _letter + "|" + _legalchar + "|" + _digit + ")*"
, IDENTIFIER,
"\\(" , BRACKETOPEN,
"\\)" , BRACKETCLOSE,
"\\s" , Token.SKIP,
";.*$" , Token.SKIP,
};
public static Lexer getStringLexer(String str) {
return LexerFactory.createStringLexer(tokens, str);
}
public static Lexer getFileLexer(String filename) throws IOException {
return LexerFactory.createFileLexer(tokens, filename);
}
}
What this does is:
- It defines constants FORMAT, SORT, FUN, RULE, LAMBDA, IDENTIFIER, BRACKETOPEN, BRACKETCLOSE and ARROW, so that the parser (which we will write next) can refer to them. These are currently just defined as the string describing the constant, but we could use any string, including a string with spaces in it. (However, for error messages it is typically useful if the name is clear). The only strings you are not allowed to use are the reserved names: "EOF", "CATCHALL", and "SKIP".
''Note:'' you don't have to define these constants: you can just the strings describing the names instead in the parser. However, doing that is very sensitive to spelling errors, so it is likely safer to work with the constants. - We create a token array: an array that contains pairs (regular expression, token name for that regular expression). Some notes:
- For convenience of writing this list down, this is a single array of strings rather than an array of pairs. The regular expressions are always given at the even indexes (0, 2, 4, ...), and the corresponding names at the odd indexes (1, 3, 5, ...).
- It is fine to provide multiple expressions for the same token, and whenever we encounter one of them, it will yield an expression by that name. For example, if we allowed the keywords to be presented either in lowercase or uppercase, we could have used: private static String[] tokens = new String[] { "format" , FORMAT, "FORMAT" , FORMAT, "sort" , SORT, "SORT" , SORT, ...
- Token.SKIP is a special token which means "if you encounter this token while reading, ignore it and return the next one instead". Hence, in our example comments and whitespace are ignored.
- When going through a string or file, a lexer will find the longest string starting at the current position in the string or file that matches any of the regular expressions in the list. If more than one regular expression matches this length, then the first one in the list is returned. For example, if we have a string "-> ", then the longest string that is matched is "->". This could either be an ARROW or IDENTIFIER, so the token ARROW("->") is returned because this regular expression occurs first in the list. But if we encounter the string "->a" instead, then the lexer returns IDENTIFIER("->a"), because "->a" is longer than "->".
- Regular expressions are not required to be exhaustive. If we are going through a file or string that has a character or word that is not matched by any regular expression in the list, then a single character token with name Token.CATCHALL is returned.
- While the example uses a static array, this is not mandatory. It is perfectly fine for the array to be dynamically generated, so you could also for instance make custom parsers that use regular expressions (partially) defined by the user. Changing the array after it has been used to construct the lexer does not alter the lexer (but we will see below how lexers can be altered on the fly).
- Tokens may not be on more than one line! This is to avoid having to read a whole file into memory just to check whether a regular expression matches (and also to secure some efficiency with regular expressions matching a very long string). We will discuss below how to get around this restriction for specific multiline tokens.
- It creates two lexers: one that reads tokens from a given string, and one that reads tokens from an input file. Here, Lexer is an interface that just defines one function: public Token nextToken() throws LexerException. Whenever you call nextToken(), one token is returned, and the position in the file or string progresses to just behind this token. If the end of the file or string is reached, the lexer will always return a special token named Token.EOF. If something goes wrong while reading -- including file reading errors -- a LexerException is thrown, but the lexer will always try to move past the problem so that the next time you call nextToken() you will hopefully not get the same exception.
One of the advantage of separating lexing and parser is that we can unit test the lexer on its own. See below an example file that tests all functionality of this lexer.
public class AriTokensTest {
private Lexer createLexer(String str) {
return AriTokenData.getStringLexer(str);
}
@Test
public void testLexSimpleIdentifier() throws LexerException {
Lexer lexer = createLexer("myfun");
assertTrue(lexer.nextToken().toString().equals("1:1: myfun (IDENTIFIER)"));
assertTrue(lexer.nextToken().isEof());
}
@Test
public void testDigitsAndSpecialCharacters() throws LexerException {
Lexer lexer = createLexer("a1^%!2@>9 &45 -");
assertTrue(lexer.nextToken().toString().equals("1:1: a1^%!2@>9 (IDENTIFIER)"));
assertTrue(lexer.nextToken().toString().equals("1:11: &45 (IDENTIFIER)"));
assertTrue(lexer.nextToken().toString().equals("1:15: - (IDENTIFIER)"));
}
@Test
public void testIdentifierCannotStartWithDigit() throws LexerException {
Lexer lexer = createLexer("1ab3");
assertTrue(lexer.nextToken().toString().equals("1:1: 1 (CATCHALL)"));
assertTrue(lexer.nextToken().toString().equals("1:2: ab3 (IDENTIFIER)"));
}
@Test
public void testLexIllegalIdentifier() throws LexerException {
Lexer lexer = createLexer(" #");
assertTrue(lexer.nextToken().toString().equals("1:2: # (CATCHALL)"));
assertTrue(lexer.nextToken().isEof());
}
@Test
public void testIdentifierCanStartWithArrow() throws LexerException {
Lexer lexer = createLexer("->ab");
assertTrue(lexer.nextToken().toString().equals("1:1: ->ab (IDENTIFIER)"));
}
@Test
public void testLexComments() throws LexerException {
Lexer lexer = createLexer("a b\nc ; d e \n f ;g");
assertTrue(lexer.nextToken().toString().equals("1:1: a (IDENTIFIER)"));
assertTrue(lexer.nextToken().toString().equals("1:3: b (IDENTIFIER)"));
assertTrue(lexer.nextToken().toString().equals("2:1: c (IDENTIFIER)"));
assertTrue(lexer.nextToken().toString().equals("3:2: f (IDENTIFIER)"));
assertTrue(lexer.nextToken().isEof());
}
@Test
public void testAllBasicTokens() throws LexerException {
Lexer lexer = createLexer("(format sort) fun\n rule -> rules ->a?+-@$%^&*b");
assertTrue(lexer.nextToken().toString().equals("1:1: ( (BRACKETOPEN)"));
assertTrue(lexer.nextToken().toString().equals("1:2: format (FORMAT)"));
assertTrue(lexer.nextToken().toString().equals("1:9: sort (SORT)"));
assertTrue(lexer.nextToken().toString().equals("1:13: ) (BRACKETCLOSE)"));
assertTrue(lexer.nextToken().toString().equals("1:15: fun (FUN)"));
assertTrue(lexer.nextToken().toString().equals("2:2: rule (RULE)"));
assertTrue(lexer.nextToken().toString().equals("2:7: -> (ARROW)"));
assertTrue(lexer.nextToken().toString().equals("2:10: rules (IDENTIFIER)"));
assertTrue(lexer.nextToken().toString().equals("2:16: ->a?+-@$%^&*b (IDENTIFIER)"));
assertTrue(lexer.nextToken().isEof());
}
}
If a lexer like the one above satisfies your need, feel free to skip on to the section on parsing. However, for those cases where more is required, we here discuss the other support for defining lexers.
In our language, -> is an ARROW token, but ->a is an IDENTIFIER. Suppose that this is not what we want: we do not want to allow -> to occur: any such occurrence should be marked as an ARROW. So the string "a->b" would then be lexed as IDENTIFIER, ARROW, IDENTIFIER.
This is difficult to do in a regular expression if we do still want to allow both - and > to occur inside an identifier, but it is possible using lookahead. Our lexer works as desired if we replace the definition of legalchar by:
private static String _legalchar = "~|!|@|\\$|%|\\^|&|\\*|_|(-(?!>))|\\+|=|<|>|\\.|\\?|\\/";
Here, instead of supporting the character - outright, we support it only if not followed by a >. Java regular expressions also support lookbehind. We can see that this works using the test:
@Test
public void testIdentifierCanNOTStartWithArrow() throws LexerException {
Lexer lexer = createLexer("->c->ab->");
assertTrue(lexer.nextToken().toString().equals("1:1: -> (ARROW)"));
assertTrue(lexer.nextToken().toString().equals("1:3: c (IDENTIFIER)"));
assertTrue(lexer.nextToken().toString().equals("1:4: -> (ARROW)"));
assertTrue(lexer.nextToken().toString().equals("1:6: ab (IDENTIFIER)"));
assertTrue(lexer.nextToken().toString().equals("1:8: -> (ARROW)"));
}
But now suppose that you are not such a hero with regular expressions. It is actually often possible to avoid complex regular expressions and instead combine lexers to great effect.
Lexers are designed using the adapter pattern. That is, we can make alternative classes implementing Lexer which take an existing Lexer as an argument, delegate some work to it, but adapt the results. For example, we would ideally make something like the following (warning: deliberately not complete!)
class ArrowSensitiveLexer implements Lexer {
private Lexer _mylexer;
Token _saved;
public ArrowSensitiveLexer(Lexer l) {
_mylexer = l;
_saved = null;
}
public Token nextToken() {
Token ret = _saved == null ? _mylexer.nextToken() : _saved;
_saved = null;
if (!ret.getName().equals(AriTokenData.IDENTIFIER)) return ret;
if (ret.indexOf("->") < 0) return ret;
// TODO: split ret.getText() into identifiers and arrows, return the first, and store the rest as one token in _saved
}
}
To do this sort of thing, parser.lib provides several pre-existing Lexer-adapters and inheritables. In our case, rather than finishing the code above (which would require manually messing with tokens), we will use the TokenEditLexer. This abstract class handles the logic in nextToken(), and only requires the inheriting class to specify which tokens it should pay attention to, and what should be done with them. Thus, we would add the following code in AriTokenData.java:
public static class AriSplitLexer extends TokenEditLexer {
public AriSplitLexer(Lexer lexer) {
super(lexer, IDENTIFIER);
}
protected void modifyToken(Token token) throws LexerException {
String txt = token.getText();
int index = txt.indexOf("->");
int offset = 0;
while (index >= 0) {
if (index > offset) storeToken(token, offset, IDENTIFIER, txt.substring(offset, index));
storeToken(token, index, ARROW, "->");
offset = index + 2;
index = txt.indexOf("->", offset);
}
if (offset < txt.length()) storeToken(token, offset, IDENTIFIER, txt.substring(offset));
}
}
We also replace getStringLexer by:
public static Lexer getStringLexer(String str) {
Lexer lexer = LexerFactory.createStringLexer(tokens, str);
lexer = new AriSplitLexer(lexer);
return lexer;
}
And do similar for getFileLexer.
In the AriSplitLexer, the constructor tells the inherited class which lexer should be used as a basis, and which token to pay attention to. Then whenever nextToken() is called, lexer.nextToken() is investigated: if this returns anything other than IDENTIFIER the token is returned unmodified, but if it returns IDENTIFIER then modifyToken() is called. We have implemented this function to store all the identifers and arrows occurring in the given token separately. Subsequent calls to nextToken() will now return the stored tokens (unmodified, even if they are IDENTIFIERs) until the stored tokens are exhausted, and the next token is read from lexer (and potentially passed to modifyToken).
Here, storeToken takes an existing token and an offset, token name and text, and creates a new token with the given name and text, starting offset positions to the right of the position of the given token.
While here we used a TokenEditLexer simply to avoid having to write a complicated regular expression, their potential is much greater. For example, there is nothing stopping us from creating a TokenEditLexer that checks for infix symbols from a given list of user-defined strings that might be dynamically updated even during parsing.
So what else can we do with these lexers? Well, there's no obvious limit because you can just define your own Lexer that takes a given Lexer as an argument, but for many practical situations default lexer adapters have been defined.
An example that we may want to use is the error lexer. In our example token list, it is very possible to write something that cannot be tokenised into legal tokens -- for example unicode characters or something like ######. This will return a CATCHALL token, but we may not want to deal with that when we are writing a parser. So, let us throw an error whenever we encounter such a token.
We do this by updating our lexer as follows:
public static Lexer getStringLexer(String str) {
Lexer lexer = LexerFactory.createStringLexer(tokens, str);
lexer = new AriSplitLexer(lexer);
lexer = LexerFactory.createErrorLexer(lexer, Token.CATCHALL,
"Unexpected symbol: @TEXT@. This symbol is not permitted in an ARI input file.");
return lexer;
}
(And similar for getFileLexer.) Now when we for example try to parse the string " #", we get the following error message:
charlie.parser.lib.LexerException: 1:2: Unexpected symbol: #. This symbol is not permitted in an ARI input file.
The syntax to create an Error lexer is simply LexerFactory.createErrorLexer([existing lexer from which we will read tokens], [name of the token for which an error should be generated if it occurs], [string to print when a token with this name is encountered]). In the string, any occurrence of @TEXT@ is replaced by the text of the illegal token.
After the illegal token is encountered, the lexer does move past this token before throwing a LexerException; hence, the next call to nextToken() will return the next token, not another exception:
@Test
public void testIdentifierCannotStartWithDigit() throws LexerException {
Lexer lexer = createLexer("1ab3");
assertThrows(LexerException.class, () -> lexer.nextToken());
assertTrue(lexer.nextToken().toString().equals("1:2: ab3 (IDENTIFIER)"));
}
Of course, while we here used the "catch all" token that should usually be represented by an error, you can define your own tokens that should result in an exception. Typical examples are unterminated strings, or non-existing escape codes.
We have seen how to use lexers directly, and how to adapt existing lexers. For the next part, it will be useful to know a bit more about the inner workings of these lexers.
There are three different kinds of lexers:
- Lexers: the base interface, and the only one we have used so far. It only defines a function nextToken(), and gives no promises where this token comes from.
- Changeable lexers (this interface extends Lexer): these lexers have the possibility to change their token data (i.e., the regular expressions they look for and the corresponding token names). This allows a programmer to for instance define multiple modes. To ensure that this is actually correct, all instances of ChangeableLexer must commit to not progressing the reading of file or string beyond the last token that was returned. This is why for instance a TokenEditLexer is not a changeable lexer: if a long token is read but split in two, and then the internal token list is changed by the programmer, the next returned token would be incorrect as it still represents a token of the old list.
- Token queues (this interface extends Lexer): these lexers explicitly have an option to push back (an arbitrary number of) tokens that have already been read. This is very useful when writing a parser, where it is often needed to know what the next few tokens are to be sure which grammar rule is applicable (as we will discuss later on this page).
The lexers returned from LexerFactory.createStringLexer and LexerFactory.createFileLexer are both instances of ChangeableLexer. Most of the others are not changeable, because even though for instance the error lexer does not progress its input beyond the last read token, changing the underlying tokens would alter its functionality, and consequently introduce a kind of coupling that does not seem appropriate for library code. However, you are of course free to make your own ChangeableLexer variations of such adapters.
You can change the tokens used in a changeable lexer for instance as in the following example:
TokenFinder backup = _mychangeablelexer.getCurrentTokenData();
_mychangeablelexer.changeTokenData(LexerFactory.createTokenData(newtokens));
// do stuff with the alternative tokens
_mychangeablelexer.changeTokenData(backup);
Of course, you don't have to save a backup: if the tokens are permanently changed, you can alter them on the fly.
An important restriction of lexers in charlie.parser.lib is that files or strings are read one line at a time. This means that the regular expression for a Token cannot match past a newline. For most tokens this is fine, but it becomes problematic if we for instance want to support multiline comments or strings.
Fortunately, these sorts of tokens can usually be handled in a ChangeableLexer by temporarily switching tokens when the start of a potential multiline token is encountered. To make this easy, the LongTokenLexer aims to support those multiline tokens that have a shape [start] [middle]* [end], where the regular expressions for [start], [middle] and [end] do not include newlines.
To use this lexer adapter, set up a changeable lexer which recognises the start of the long token as a token. The LongTokenLexer should be initialised with the underlying lexer, starting token, regular expressions defining the middle and end point of the long token, and the name of the result token (or Token.SKIP if the whole long token should be skipped).
Let us continue our AriTokenData example. The actual definition of an "identifier" in the ari format also allows for tokens of the form |text|, where text is a string built of arbitrary characters, including newlines, but only excluding | and . This can be set up as follows. We update String tokens[] to include:
"\\|" , "BAR",
(We don't create a constant for this token name because it will never be returned.)
Then we change our getStringLexer function as follows:
public static Lexer getStringLexer(String str) {
ChangeableLexer clexer = LexerFactory.createStringLexer(tokens, str);
clexer = LexerFactory.createLongTokenLexer(lexer, "BAR", "[^|\\\\]", "\\|", IDENTIFIER);
Lexer lexer = LexerFactory.createErrorLexer(lexer, Token.CATCHALL,
"Unexpected symbol: @TEXT@. This symbol is not permitted in an ARI input file.");
return new AriSplitLexer(lexer);
}
Note that the order matters: we first create a long token lexer, and then an error lexer on top of it. The other way around is not possible because the error lexer is not changeable (and if it was, it would mess up the error messages by catching occurrences of "\" in the middle of a long token).
Our example lexer already supports comments: whenever ; is typed, the rest of the string until the end of the line is a comment. (In fact, in this regular expression we take advantage of the fact that matching is always until the end of the line.)
However, many programming languages support multiline comments: comments of the form /* ... */ or (* ... *). We could handle this through a LongTokenLexer (which is given a token with the result name Token.SKIP), but what if nested comments should be supported? For example, what if /* this /* is */ a nested comment */ were allowed as a full comment? This would be hard to do with a regular expresion.
Again, we have a dedicated lexer adapter to solve this problem. As an argument, it takes a lexer, and the token names for the start and end of a comment. For example:
lexer = LexerFactory.createNestedCommentRemoverLexer(lexer, "COMMENTOPEN", "COMMENTCLOSE");
The resulting lexer skips over any text in a nested comment block. If instead you want a non-nested comment, this is similarly done using:
lexer = LexerFactory.createBasicCommentRemoverLexer(lexer, "COMMENTOPEN", "COMMENTCLOSE");
TODO: describe the overall idea
TODO: things to discuss:
- what's there
- how to finish our parser generator