diff --git a/docs/grammar.md b/docs/grammar.md index 6cf802403..2686ca4fa 100644 --- a/docs/grammar.md +++ b/docs/grammar.md @@ -51,53 +51,48 @@ Lark begins the parse with the rule 'start', unless specified otherwise in the o Names of rules are always in lowercase, while names of terminals are always in uppercase. This distinction has practical effects, for the shape of the generated parse-tree, and the automatic construction of the lexer (aka tokenizer, or scanner). +## EBNF Expressions -## Terminals - -Terminals are used to match text into symbols. They can be defined as a combination of literals and other terminals. - -**Syntax:** - -```html - [. ] : -``` - -Terminal names must be uppercase. +The EBNF expression in a Lark termminal definition is a sequence of items to be matched. +Each item is one of: -Literals can be one of: +* `TERMINAL` - Another terminal, which cannot be defined in terms of this terminal. +* `"string literal"` - Literal, to be matched as-is. +* `"string literal"i` - Literal, to be matched case-insensitively. +* `/regexp literal/[imslux]` - Regular expression literal. Can include the Python stdlib's `re` [flags `imslux`](https://docs.python.org/3/library/re.html#contents-of-module-re) -* `"string"` -* `/regular expression+/` -* `"case-insensitive string"i` -* `/re with flags/imulx` -* Literal range: `"a".."z"`, `"1".."9"`, etc. +* `"character".."character"` - Literal range. The range represends all values between the two literals, inclusively. +* `(item item ..)` - Group items +* `(item | item | ..)` - Alternate items. +* `[item item ..]` - Maybe. Same as `(item item ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match. +* `[item | item | ..]` - Maybe with alternates. Same as `(item | item | ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match. +* `item?` - Zero or one instances of item (a "maybe") +* `item*` - Zero or more instances of item +* `item+` - One or more instances of item +* `item ~ n` - Exactly *n* instances of item +* `item ~ n..m` - Between *n* to *m* instances of item -Terminals also support grammar operators, such as `|`, `+`, `*` and `?`. +The EBNF expression in a Lark rule definition is also a sequence of the same set of items to be matched, with one addition: -Terminals are a linear construct, and therefore may not contain themselves (recursion isn't allowed). +* `rule` - A rule, which can include recursive use of this rule. -### Templates +## Terminals -Templates are expanded when preprocessing the grammar. +Terminals are used to match text into symbols. They can be defined as a combination of literals and other terminals. -Definition syntax: +**Syntax:** -```ebnf - my_template{param1, param2, ...}: +```html + [. ] : ``` -Use syntax: +Terminal names must be uppercase. They must start with an underscore (`_`) or a letter (`A` through `Z`), and may be composed of letters, underscores, and digits (`0` through `9`). Terminal names that start with "_" will not be included in the parse tree, unless the `keep_all_tokens` option is specified, or unless they are part of a containing terminal. Terminals are a linear construct, and therefore may not contain themselves (recursion isn't allowed). -```ebnf -some_rule: my_template{arg1, arg2, ...} -``` +See [EBNF Expressions](#ebnf-expressions) above for the list of items that a terminal can match. -Example: -```ebnf -_separated{x, sep}: x (sep x)* // Define a sequence of 'x sep x sep x ...' +### Templates -num_list: "[" _separated{NUMBER, ","} "]" // Will match "[1, 2, 3]" etc. -``` +Templates are not allowed with terminals. ### Priority @@ -122,7 +117,7 @@ SIGNED_INTEGER: / /x ``` -Supported flags are one of: `imslux`. See Python's regex documentation for more details on each one. +Supported flags are one of: `imslux`. See Python's [regex documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax) for more details on each one. Regexps/strings of different flags can only be concatenated in Python 3.6+ @@ -196,29 +191,19 @@ _ambig **Syntax:** ```html - : [-> ] + : [-> ] | ... ``` -Names of rules and aliases are always in lowercase. +Names of rules and aliases are always in lowercase. They must start with an underscore (`_`) or a letter (`a` through `z`), and may be composed of letters, underscores, and digits (`0` through `9`). Rule names that start with "_" will be inlined into their containing rule. Rule definitions can be extended to the next line by using the OR operator (signified by a pipe: `|` ). -An alias is a name for the specific rule alternative. It affects tree construction. +An alias is a name for the specific rule alternative. It affects tree construction (see [Shaping the tree](tree_construction#shaping_the_tree). +The affect of a rule on the parse tree can be specified by modifiers. The `!` modifier causes the rule to keep all its tokens, regardless of whether they are named or not. The `?` modifier causes the rule to be inlined if it only has a single child. The `?` modifier cannot be used on rules that are named starting with an underscore. -Each item is one of: - -* `rule` -* `TERMINAL` -* `"string literal"` or `/regexp literal/` -* `(item item ..)` - Group items -* `[item item ..]` - Maybe. Same as `(item item ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match. -* `item?` - Zero or one instances of item ("maybe") -* `item*` - Zero or more instances of item -* `item+` - One or more instances of item -* `item ~ n` - Exactly *n* instances of item -* `item ~ n..m` - Between *n* to *m* instances of item (not recommended for wide ranges, due to performance issues) +See [EBNF Expressions](#ebnf_expressions) above for the list of items that a rule can match. **Examples:** ```perl @@ -230,6 +215,29 @@ expr: expr operator expr four_words: word ~ 4 ``` +### Templates + +Templates are expanded when preprocessing rules in the grammar. + +Definition syntax: + +```ebnf + my_template{param1, param2, ...}: +``` + +Use syntax: + +```ebnf +some_rule: my_template{arg1, arg2, ...} +``` + +Example: +```ebnf +_separated{x, sep}: x (sep x)* // Define a sequence of 'x sep x sep x ...' + +num_list: "[" _separated{NUMBER, ","} "]" // Will match "[1, 2, 3]" etc. +``` + ### Priority Like terminals, rules can be assigned a priority. Rule priorities are signed @@ -297,12 +305,24 @@ Note that `%ignore` directives cannot be imported. Imported rules will abide by Declare a terminal without defining it. Useful for plugins. +**Syntax:** +```html +%declare +%declare +``` + ### %override Override a rule or terminals, affecting all references to it, even in imported grammars. Useful for implementing an inheritance pattern when importing grammars. +**Syntax:** +```html +%override +%override +``` + **Example:** ```perl %import my_grammar (start, number, NUMBER) @@ -319,6 +339,12 @@ Useful for splitting up a definition of a complex rule with many different optio Can also be used to implement a plugin system where a core grammar is extended by others. +**Syntax:** +```html +%extend ... additional terminal alternate ... +%extend ... additional rule alternate ... +``` + **Example:** ```perl diff --git a/lark/grammars/lark.lark b/lark/grammars/lark.lark index cdb4d1ca7..0c072eed8 100644 --- a/lark/grammars/lark.lark +++ b/lark/grammars/lark.lark @@ -1,5 +1,15 @@ # Lark grammar of Lark's syntax # Note: Lark is not bootstrapped, its parser is implemented in load_grammar.py +# This grammar matches that one, but does not enforce some rules that it does. +# If you want to enforce those, you can pass the "LarkValidator" over +# the parse tree, like this: + +# from lark import Lark +# from lark.lark_validator import LarkValidator +# +# lark_parser = Lark.open_from_package("lark", "grammars/lark.lark", parser="lalr") +# parse_tree = lark_parser.parse(my_grammar) +# LarkValidator.validate(parse_tree) start: (_item? _NL)* _item? @@ -7,50 +17,54 @@ _item: rule | token | statement -rule: RULE rule_params priority? ":" expansions -token: TOKEN token_params priority? ":" expansions +rule: rule_modifiers RULE rule_params priority ":" expansions +token: TOKEN priority? ":" expansions + +rule_modifiers: RULE_MODIFIERS? rule_params: ["{" RULE ("," RULE)* "}"] -token_params: ["{" TOKEN ("," TOKEN)* "}"] -priority: "." NUMBER +priority: ("." NUMBER)? statement: "%ignore" expansions -> ignore | "%import" import_path ["->" name] -> import | "%import" import_path name_list -> multi_import - | "%override" rule -> override_rule + | "%override" (rule | token) -> override | "%declare" name+ -> declare + | "%extend" (rule | token) -> extend !import_path: "."? name ("." name)* name_list: "(" name ("," name)* ")" -?expansions: alias (_VBAR alias)* +expansions: alias (_VBAR alias)* -?alias: expansion ["->" RULE] +?alias: expansion ("->" RULE)? -?expansion: expr* +expansion: expr* -?expr: atom [OP | "~" NUMBER [".." NUMBER]] +?expr: atom (OP | "~" NUMBER (".." NUMBER)?)? ?atom: "(" expansions ")" | "[" expansions "]" -> maybe | value -?value: STRING ".." STRING -> literal_range +value: STRING ".." STRING -> literal_range | name | (REGEXP | STRING) -> literal - | name "{" value ("," value)* "}" -> template_usage + | RULE "{" value ("," value)* "}" -> template_usage name: RULE | TOKEN _VBAR: _NL? "|" OP: /[+*]|[?](?![a-z])/ -RULE: /!?[_?]?[a-z][_a-z0-9]*/ +RULE_MODIFIERS: /(!|![?]?|[?]!?)(?=[_a-z])/ +RULE: /_?[a-z][_a-z0-9]*/ TOKEN: /_?[A-Z][_A-Z0-9]*/ STRING: _STRING "i"? REGEXP: /\/(?!\/)(\\\/|\\\\|[^\/])*?\/[imslux]*/ _NL: /(\r?\n)+\s*/ +BACKSLASH: /\\[ ]*\n/ %import common.ESCAPED_STRING -> _STRING %import common.SIGNED_INT -> NUMBER @@ -60,3 +74,4 @@ COMMENT: /\s*/ "//" /[^\n]/* | /\s*/ "#" /[^\n]/* %ignore WS_INLINE %ignore COMMENT +%ignore BACKSLASH diff --git a/lark/lark_validator.py b/lark/lark_validator.py new file mode 100644 index 000000000..530165a52 --- /dev/null +++ b/lark/lark_validator.py @@ -0,0 +1,292 @@ +from typing import Any, Dict, List + +from .exceptions import GrammarError +from .grammar import TOKEN_DEFAULT_PRIORITY, RuleOptions +from .lexer import Token +from .load_grammar import eval_escaping +from .tree import Tree + +class Definition: + def __init__(self, is_term, tree, params=(), options=None): + self.is_term = is_term + self.tree = tree + self.params = tuple(params) + +class LarkValidator: + """ + Checks a grammar parsed by `lark.lark` for validity using a variety of checks similar to what + `load_grammar.py` does on parser creation. The only stable public entry point is + `LarkValidator.validate(tree)`. + + Checks: + - Illegal constructs not prevented by the grammar: + - `alias` not in the top expansions of a rule + - Incorrect `%ignore` lines + - Invalid literals (like newlines inside of regex without the `x` flag) + - Rules used inside of Terminals + - Undefined symbols + - Incorrectly used templates + """ + + @classmethod + def validate(cls, tree: Tree): + """ + Checks a grammar parsed by `lark.lark` for validity using a variety of checks similar to what + `load_grammar.py` does on parser creation. + + Checks: + - Illegal constructs not prevented by the grammar: + - `alias` not in the top expansions of a rule + - Incorrect `%ignore` lines + - Invalid literals (like newlines inside of regex without the `x` flag) + - Rules used inside of Terminals + - Undefined symbols + - Incorrectly used templates + """ + visitor = cls(tree) + visitor._cross_check_symbols() + visitor._resolve_term_references() + visitor._check_literals(tree) + return tree + + def __init__(self, tree: Tree): + self._definitions: Dict[str, Definition] = {} + self._ignore_names: List[str] = [] + self._load_grammar(tree) + + def _check_literals(self, tree: Tree) -> None: + for literal in tree.find_data("literal"): + self._literal(literal) + + def _cross_check_symbols(self) -> None: + # Based on load_grammar.GrammarBuilder.validate() + for name, d in self._definitions.items(): + params = d.params + definition = d.tree + for i, p in enumerate(params): + if p in self._definitions: + raise GrammarError("Template Parameter conflicts with rule %s (in template %s)" % (p, name)) + if p in params[:i]: + raise GrammarError("Duplicate Template Parameter %s (in template %s)" % (p, name)) + # Remaining checks don't apply to abstract rules/terminals (i.e., created with %declare) + if definition and isinstance(definition, Tree): + for template in definition.find_data('template_usage'): + if d.is_term: + raise GrammarError("Templates not allowed in terminals") + sym = template.children[0].data + args = template.children[1:] + if sym not in params: + if sym not in self._definitions: + raise GrammarError(f"Template '{sym}' used but not defined (in {('rule', 'terminal')[d.is_term]} {name})") + if len(args) != len(self._definitions[sym].params): + expected, actual = len(self._definitions[sym].params), len(args) + raise GrammarError(f"Wrong number of template arguments used for {expected} " + f"(expected {expected}, got {actual}) (in {('rule', 'terminal')[d.is_term]} {actual})") + for sym in _find_used_symbols(definition): + if sym not in self._definitions and sym not in params: + raise GrammarError(f"{('Rule', 'Terminal')[sym.isupper()]} '{sym}' used but not defined (in {('rule', 'terminal')[d.is_term]} { name})") + if not set(self._definitions).issuperset(self._ignore_names): + raise GrammarError("Terminals %s were marked to ignore but were not defined!" % (set(self._ignore_names) - set(self._definitions))) + + def _declare(self, stmt: Tree) -> None: + for symbol in stmt.children: + if isinstance(symbol, Tree) and symbol.data == 'name': + symbol = symbol.children[0] + if not isinstance(symbol, Token) or symbol.type != "TOKEN": + raise GrammarError("Expecting terminal name") + self._define(symbol.value, True, None) + + def _define(self, name: str, is_term: bool, exp: "Tree|None", params: List[str] = [], options:Any = None, *, override: bool = False, extend: bool = False) -> None: + # Based on load_grammar.GrammarBuilder._define() + if name in self._definitions: + if not override and not extend: + raise GrammarError(f"{('Rule', 'Terminal')[is_term]} '{name}' defined more than once") + if extend: + base_def = self._definitions[name] + if is_term != base_def.is_term: + raise GrammarError("fCannot extend {('rule', 'terminal')[is_term]} {name} - one is a terminal, while the other is not.") + if tuple(params) != base_def.params: + raise GrammarError(f"Cannot extend {('rule', 'terminal')[is_term]} with different parameters: {name}") + if base_def.tree is None: + raise GrammarError(f"Can't extend {('rule', 'terminal')[is_term]} {name} - it is abstract.") + if name.startswith('__'): + raise GrammarError(f'Names starting with double-underscore are reserved (Error at {name})') + if is_term: + if options and not isinstance(options, int): + raise GrammarError(f"Terminal require a single int as 'options' (e.g. priority), got {type(options)}") + else: + if options and not isinstance(options, RuleOptions): + raise GrammarError("Rules require a RuleOptions instance as 'options'") + self._definitions[name] = Definition(is_term, exp, params) + + def _extend(self, stmt: Tree) -> None: + definition = stmt.children[0] + if definition.data == 'token': + name = definition.children[0] + if name not in self._definitions: + raise GrammarError(f"Can't extend terminal {name} as it wasn't defined before") + self._token(definition, extend=True) + else: # definition.data == 'rule' + name = definition.children[1] + if name not in self._definitions: + raise GrammarError(f"Can't extend rule {name} as it wasn't defined before") + self._rule(definition, extend=True) + + def _ignore(self, stmt: Tree) -> None: + # Children: expansions + # - or - + # Children: token + exp_or_name = stmt.children[0] + if isinstance(exp_or_name, str): + self._ignore_names.append(exp_or_name) + else: + assert isinstance(exp_or_name, Tree) + t = exp_or_name + if t.data == 'expansions' and len(t.children) == 1: + t2 ,= t.children + if t2.data=='expansion': + if len(t2.children) > 1: + raise GrammarError("Bad %ignore - must have a Terminal or other value.") + item ,= t2.children + if item.data == 'value': + item ,= item.children + if isinstance(item, Token): + # Keep terminal name, no need to create a new definition + self._ignore_names.append(item.value) + return + if item.data == 'name': + token ,= item.children + if isinstance(token, Token) and token.type == "TOKEN": + # Keep terminal name, no need to create a new definition + self._ignore_names.append(token.value) + return + name = '__IGNORE_%d'% len(self._ignore_names) + self._ignore_names.append(name) + self._definitions[name] = Definition(True, t, options=TOKEN_DEFAULT_PRIORITY) + + def _literal(self, tree: Tree) -> None: + # Based on load_grammar.GrammarBuilder.literal_to_pattern(). + assert tree.data == 'literal' + literal = tree.children[0] + assert isinstance(literal, Token) + v = literal.value + flag_start = max(v.rfind('/'), v.rfind('"'))+1 + assert flag_start > 0 + flags = v[flag_start:] + if literal.type == 'STRING' and '\n' in v: + raise GrammarError('You cannot put newlines in string literals') + if literal.type == 'REGEXP' and '\n' in v and 'x' not in flags: + raise GrammarError('You can only use newlines in regular expressions ' + 'with the `x` (verbose) flag') + v = v[:flag_start] + assert v[0] == v[-1] and v[0] in '"/' + x = v[1:-1] + s = eval_escaping(x) + if s == "": + raise GrammarError("Empty terminals are not allowed (%s)" % literal) + + def _load_grammar(self, tree: Tree) -> None: + for stmt in tree.children: + if stmt.data == 'declare': + self._declare(stmt) + elif stmt.data == 'extend': + self._extend(stmt) + elif stmt.data == 'ignore': + self._ignore(stmt) + elif stmt.data in ['import', 'multi_import']: + # TODO How can we process imports in the validator? + pass + elif stmt.data == 'override': + self._override(stmt) + elif stmt.data == 'rule': + self._rule(stmt) + elif stmt.data == 'token': + self._token(stmt) + else: + assert False, f"Unknown statement type: {stmt}" + + def _override(self, stmt: Tree) -> None: + definition = stmt.children[0] + if definition.data == 'token': + name = definition.children[0] + if name not in self._definitions: + raise GrammarError(f"Cannot override a nonexisting terminal {name}") + self._token(definition, override=True) + else: # definition.data == 'rule' + name = definition.children[1] + if name not in self._definitions: + raise GrammarError(f"Cannot override a nonexisting rule {name}") + self._rule(definition, override=True) + + def _resolve_term_references(self) -> None: + # Based on load_grammar.resolve_term_references() + # and the bottom of load_grammar.GrammarBuilder.load_grammar() + term_dict = { name: d.tree + for name, d in self._definitions.items() + if d.is_term + } + while True: + changed = False + for name, token_tree in term_dict.items(): + if token_tree is None: # Terminal added through %declare + continue + for exp in token_tree.find_data('value'): + item ,= exp.children + if isinstance(item, Tree) and item.data == 'name' and isinstance(item.children[0], Token) and item.children[0].type == 'RULE' : + raise GrammarError("Rules aren't allowed inside terminals (%s in %s)" % (item, name)) + elif isinstance(item, Token): + try: + term_value = term_dict[item.value] + except KeyError: + raise GrammarError("Terminal used but not defined: %s" % item.value) + assert term_value is not None + exp.children[0] = term_value + changed = True + else: + assert isinstance(item, Tree) + if not changed: + break + + for name, term in term_dict.items(): + if term: # Not just declared + for child in term.children: + ids = [id(x) for x in child.iter_subtrees()] + if id(term) in ids: + raise GrammarError("Recursion in terminal '%s' (recursion is only allowed in rules, not terminals)" % name) + + def _rule(self, tree, override=False, extend=False) -> None: + # Children: modifiers, name, params, priority, expansions + name = tree.children[1] + if tree.children[0].data == "rule_modifiers" and tree.children[0].children: + modifiers = tree.children[0].children[0] + if '?' in modifiers and name.startswith('_'): + raise GrammarError("Inlined rules (_rule) cannot use the ?rule modifier.") + if tree.children[2].children[0] is not None: + params = [t.value for t in tree.children[2].children] # For the grammar parser + else: + params = [] + self._define(name, False, tree.children[4], params=params, override=override, extend=extend) + + def _token(self, tree, override=False, extend=False) -> None: + # Children: name, priority, expansions + # - or - + # Children: name, expansions + if tree.children[1].data == "priority" and tree.children[1].children: + opts = int(tree.children[1].children[0]) # priority + else: + opts = TOKEN_DEFAULT_PRIORITY + for item in tree.children[-1].find_data('alias'): + raise GrammarError("Aliasing not allowed in terminals (You used -> in the wrong place)") + self._define(tree.children[0].value, True, tree.children[-1], [], opts, override=override, extend=extend) + +def _find_used_symbols(tree) -> List[str]: + # Based on load_grammar.GrammarBuilder._find_used_symbols() + assert tree.data == 'expansions' + results = [] + for expansion in tree.find_data('expansion'): + for item in expansion.scan_values(lambda t: True): + if isinstance(item, Tree) and item.data == 'name': + results.append(item.data) + elif isinstance(item, Token) and item.type not in ['NUMBER', 'OP', 'STRING', 'REGEXP']: + results.append(item.value) + return results diff --git a/lark/load_grammar.py b/lark/load_grammar.py index 362a845d2..08531391b 100644 --- a/lark/load_grammar.py +++ b/lark/load_grammar.py @@ -660,6 +660,9 @@ def maybe(self, expr): def alias(self, t): raise GrammarError("Aliasing not allowed in terminals (You used -> in the wrong place)") + def template_usage(self, t): + raise GrammarError("Templates not allowed in terminals") + def value(self, v): return v[0] @@ -1099,9 +1102,10 @@ def __init__(self, global_keep_all_tokens: bool=False, import_paths: Optional[Li self._definitions: Dict[str, Definition] = {} self._ignore_names: List[str] = [] - def _grammar_error(self, is_term, msg, *names): + def _grammar_error(self, msg, *subs): args = {} - for i, name in enumerate(names, start=1): + for i, sub in enumerate(subs, start=1): + name, is_term = sub postfix = '' if i == 1 else str(i) args['name' + postfix] = name args['type' + postfix] = lowercase_type = ("rule", "terminal")[is_term] @@ -1127,28 +1131,28 @@ def _check_options(self, is_term, options): def _define(self, name, is_term, exp, params=(), options=None, *, override=False): if name in self._definitions: if not override: - self._grammar_error(is_term, "{Type} '{name}' defined more than once", name) + self._grammar_error("{Type} '{name}' defined more than once", (name, is_term)) elif override: - self._grammar_error(is_term, "Cannot override a nonexisting {type} {name}", name) + self._grammar_error("Cannot override a nonexisting {type} {name}", (name, is_term)) if name.startswith('__'): - self._grammar_error(is_term, 'Names starting with double-underscore are reserved (Error at {name})', name) + self._grammar_error('Names starting with double-underscore are reserved (Error at {name})', (name, is_term)) self._definitions[name] = Definition(is_term, exp, params, self._check_options(is_term, options)) def _extend(self, name, is_term, exp, params=(), options=None): if name not in self._definitions: - self._grammar_error(is_term, "Can't extend {type} {name} as it wasn't defined before", name) + self._grammar_error("Can't extend {type} {name} as it wasn't defined before", (name, is_term)) d = self._definitions[name] if is_term != d.is_term: - self._grammar_error(is_term, "Cannot extend {type} {name} - one is a terminal, while the other is not.", name) + self._grammar_error("Cannot extend {type} {name} - one is a terminal, while the other is not.", (name, is_term)) if tuple(params) != d.params: - self._grammar_error(is_term, "Cannot extend {type} with different parameters: {name}", name) + self._grammar_error("Cannot extend {type} with different parameters: {name}", (name, is_term)) if d.tree is None: - self._grammar_error(is_term, "Can't extend {type} {name} - it is abstract.", name) + self._grammar_error("Can't extend {type} {name} - it is abstract.", (name, is_term)) # TODO: think about what to do with 'options' base = d.tree @@ -1164,7 +1168,9 @@ def _ignore(self, exp_or_name): t = exp_or_name if t.data == 'expansions' and len(t.children) == 1: t2 ,= t.children - if t2.data=='expansion' and len(t2.children) == 1: + if t2.data=='expansion': + if len(t2.children) > 1: + raise GrammarError("Bad %ignore - must have a Terminal or other value.") item ,= t2.children if item.data == 'value': item ,= item.children @@ -1238,7 +1244,6 @@ def _unpack_definition(self, tree, mangle): def load_grammar(self, grammar_text: str, grammar_name: str="", mangle: Optional[Callable[[str], str]]=None) -> None: tree = _parse_grammar(grammar_text, grammar_name) - imports: Dict[Tuple[str, ...], Tuple[Optional[str], Dict[str, str]]] = {} for stmt in tree.children: @@ -1269,13 +1274,14 @@ def load_grammar(self, grammar_text: str, grammar_name: str="", mangle: Optio self._ignore(*stmt.children) elif stmt.data == 'declare': for symbol in stmt.children: - assert isinstance(symbol, Symbol), symbol - is_term = isinstance(symbol, Terminal) + if isinstance(symbol, NonTerminal): + raise GrammarError("Expecting terminal name") + assert isinstance(symbol, Terminal), symbol if mangle is None: name = symbol.name else: name = mangle(symbol.name) - self._define(name, is_term, None) + self._define(name, True, None) elif stmt.data == 'import': pass else: @@ -1358,15 +1364,15 @@ def validate(self) -> None: args = temp.children[1:] if sym not in params: if sym not in self._definitions: - self._grammar_error(d.is_term, "Template '%s' used but not defined (in {type} {name})" % sym, name) + self._grammar_error("Template '%s' used but not defined (in {type} {name})" % sym, (name, d.is_term)) if len(args) != len(self._definitions[sym].params): expected, actual = len(self._definitions[sym].params), len(args) - self._grammar_error(d.is_term, "Wrong number of template arguments used for {name} " - "(expected %s, got %s) (in {type2} {name2})" % (expected, actual), sym, name) + self._grammar_error("Wrong number of template arguments used for {name} " + "(expected %s, got %s) (in {type2} {name2})" % (expected, actual), (sym, sym.isupper()), (name, d.is_term)) for sym in _find_used_symbols(exp): if sym not in self._definitions and sym not in params: - self._grammar_error(d.is_term, "{Type} '{name}' used but not defined (in {type2} {name2})", sym, name) + self._grammar_error("{Type} '{name}' used but not defined (in {type2} {name2})", (sym, sym.isupper()), (name, d.is_term)) if not set(self._definitions).issuperset(self._ignore_names): raise GrammarError("Terminals %s were marked to ignore but were not defined!" % (set(self._ignore_names) - set(self._definitions))) diff --git a/tests/__main__.py b/tests/__main__.py index c5298a770..875b5b715 100644 --- a/tests/__main__.py +++ b/tests/__main__.py @@ -8,7 +8,6 @@ from .test_trees import TestTrees from .test_tools import TestStandalone from .test_cache import TestCache -from .test_grammar import TestGrammar from .test_reconstructor import TestReconstructor from .test_tree_forest_transformer import TestTreeForestTransformer from .test_lexer import TestLexer @@ -26,6 +25,7 @@ from .test_logger import Testlogger from .test_parser import * # We define __all__ to list which TestSuites to run +from .test_grammar import * # We define __all__ to list which TestSuites to run if sys.version_info >= (3, 10): from .test_pattern_matching import TestPatternMatching diff --git a/tests/test_grammar.py b/tests/test_grammar.py index 624b0799a..59121f2c9 100644 --- a/tests/test_grammar.py +++ b/tests/test_grammar.py @@ -1,4 +1,5 @@ from __future__ import absolute_import +import re import os from unittest import TestCase, main @@ -6,26 +7,36 @@ from lark import Lark, Token, Tree, ParseError, UnexpectedInput from lark.load_grammar import GrammarError, GRAMMAR_ERRORS, find_grammar_errors, list_grammar_imports from lark.load_grammar import FromPackageLoader +from lark.lark_validator import LarkValidator -class TestGrammar(TestCase): - def setUp(self): - pass +__all__ = ['TestGrammarLarkOnly'] +class LarkDotLark: + def __init__(self, grammar, **kwargs): + options = {} + options.update(kwargs) + if "start" in options and options["start"] != "start": + # We're not going to parse with the parser, so just override it. + options["start"] = "start" + lark_parser = Lark.open_from_package("lark", "grammars/lark.lark", **options) + tree = lark_parser.parse(grammar) + LarkValidator.validate(tree) + + def parse(self, text: str, start=None, on_error=None): + raise Exception("Cannot test cases with lark.lark that try to parse using the tested grammar.") + + +# Test cases that LarkDotLark can't implement +class TestGrammarLarkOnly(TestCase): + # Needs rewriting to work with lark.lark def test_errors(self): for msg, examples in GRAMMAR_ERRORS: for example in examples: - try: - p = Lark(example) - except GrammarError as e: - assert msg in str(e) - else: - assert False, "example did not raise an error" - - def test_empty_literal(self): - # Issues #888 - self.assertRaises(GrammarError, Lark, "start: \"\"") + with self.subTest(example=example): + self.assertRaisesRegex(GrammarError, re.escape(msg), Lark, example) + # Cannot test cases with lark.lark that try to parse using the tested grammar. def test_ignore_name(self): spaces = [] p = Lark(""" @@ -36,8 +47,8 @@ def test_ignore_name(self): assert p.parse("a b") == p.parse("a b") assert len(spaces) == 5 - - def test_override_rule(self): + # Test fails for lark.lark because it does not execute %import. + def test_override_rule1(self): # Overrides the 'sep' template in existing grammar to add an optional terminating delimiter # Thus extending it beyond its original capacity p = Lark(""" @@ -51,16 +62,15 @@ def test_override_rule(self): b = p.parse('[1, 2, 3, ]') assert a == b - self.assertRaises(GrammarError, Lark, """ + # Test fails for lark.lark because it does not execute %import. + def test_override_rule2(self): + self.assertRaisesRegex(GrammarError, "Rule 'delim' used but not defined \(in rule sep\)", Lark, """ %import .test_templates_import (start, sep) %override sep{item}: item (delim item)* delim? """, source_path=__file__) - self.assertRaises(GrammarError, Lark, """ - %override sep{item}: item (delim item)* delim? - """, source_path=__file__) - + # Test fails for lark.lark because it does not execute %import. def test_override_terminal(self): p = Lark(""" @@ -73,7 +83,8 @@ def test_override_terminal(self): a = p.parse('cd') self.assertEqual(a.children[0].children, [Token('A', 'c'), Token('B', 'd')]) - def test_extend_rule(self): + # Test fails for lark.lark because it does not execute %import. + def test_extend_rule1(self): p = Lark(""" %import .grammars.ab (startab, A, B, expr) @@ -82,10 +93,7 @@ def test_extend_rule(self): a = p.parse('abab') self.assertEqual(a.children[0].children, ['a', Tree('expr', ['b', 'a']), 'b']) - self.assertRaises(GrammarError, Lark, """ - %extend expr: B A - """) - + # Test fails for lark.lark because it does not execute %import. def test_extend_term(self): p = Lark(""" %import .grammars.ab (startab, A, B, expr) @@ -95,6 +103,7 @@ def test_extend_term(self): a = p.parse('acbb') self.assertEqual(a.children[0].children, ['a', Tree('expr', ['c', 'b']), 'b']) + # Cannot test cases with lark.lark that try to parse using the tested grammar. def test_extend_twice(self): p = Lark(""" start: x+ @@ -106,41 +115,8 @@ def test_extend_twice(self): assert p.parse("abccbba") == p.parse("cbabbbb") - def test_undefined_ignore(self): - g = """!start: "A" - - %ignore B - """ - self.assertRaises( GrammarError, Lark, g) - - g = """!start: "A" - - %ignore start - """ - self.assertRaises( GrammarError, Lark, g) - - def test_alias_in_terminal(self): - g = """start: TERM - TERM: "a" -> alias - """ - self.assertRaises( GrammarError, Lark, g) - - def test_undefined_rule(self): - self.assertRaises(GrammarError, Lark, """start: a""") - - def test_undefined_term(self): - self.assertRaises(GrammarError, Lark, """start: A""") - - def test_token_multiline_only_works_with_x_flag(self): - g = r"""start: ABC - ABC: / a b c - d - e f - /i - """ - self.assertRaises( GrammarError, Lark, g) - - def test_import_custom_sources(self): + # Test fails for lark.lark because it does not execute %import. + def test_import_custom_sources1(self): custom_loader = FromPackageLoader(__name__, ('grammars', )) grammar = """ @@ -153,6 +129,7 @@ def test_import_custom_sources(self): self.assertEqual(p.parse('ab'), Tree('start', [Tree('startab', [Tree('ab__expr', [Token('ab__A', 'a'), Token('ab__B', 'b')])])])) + # Test fails for lark.lark because it does not execute %import. def test_import_custom_sources2(self): custom_loader = FromPackageLoader(__name__, ('grammars', )) @@ -165,6 +142,7 @@ def test_import_custom_sources2(self): x = p.parse('N') self.assertEqual(next(x.find_data('rule_to_import')).children, ['N']) + # Test fails for lark.lark because it does not execute %import. def test_import_custom_sources3(self): custom_loader2 = FromPackageLoader(__name__) grammar = """ @@ -175,7 +153,8 @@ def test_import_custom_sources3(self): x = p.parse('12 capybaras') self.assertEqual(x.children, ['12', 'capybaras']) - def test_find_grammar_errors(self): + # Test forces use of Lark. + def test_find_grammar_errors1(self): text = """ a: rule b rule @@ -186,6 +165,8 @@ def test_find_grammar_errors(self): assert [e.line for e, _s in find_grammar_errors(text)] == [3, 5] + # Test forces use of Lark. + def test_find_grammar_errors2(self): text = """ a: rule b rule @@ -197,6 +178,8 @@ def test_find_grammar_errors(self): assert [e.line for e, _s in find_grammar_errors(text)] == [3, 4, 6] + # Test forces use of Lark. + def test_find_grammar_errors3(self): text = """ a: rule @#$#@$@&& b: rule @@ -209,7 +192,8 @@ def test_find_grammar_errors(self): x = find_grammar_errors(text) assert [e.line for e, _s in find_grammar_errors(text)] == [2, 6] - def test_ranged_repeat_terms(self): + # Cannot test cases with lark.lark that try to parse using the tested grammar. + def test_ranged_repeat_terms1(self): g = u"""!start: AAA AAA: "A"~3 """ @@ -218,6 +202,8 @@ def test_ranged_repeat_terms(self): self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AA') self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AAAA') + # Cannot test cases with lark.lark that try to parse using the tested grammar. + def test_ranged_repeat_terms2(self): g = u"""!start: AABB CC AABB: "A"~0..2 "B"~2 CC: "C"~1..2 @@ -231,7 +217,8 @@ def test_ranged_repeat_terms(self): self.assertRaises((ParseError, UnexpectedInput), l.parse, u'ABB') self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AAAABB') - def test_ranged_repeat_large(self): + # Test depends on Lark. + def test_ranged_repeat_large1(self): g = u"""!start: "A"~60 """ l = Lark(g, parser='lalr') @@ -240,6 +227,8 @@ def test_ranged_repeat_large(self): self.assertRaises(ParseError, l.parse, u'A' * 59) self.assertRaises((ParseError, UnexpectedInput), l.parse, u'A' * 61) + # Cannot test cases with lark.lark that try to parse using the tested grammar. + def test_ranged_repeat_large2(self): g = u"""!start: "A"~15..100 """ l = Lark(g, parser='lalr') @@ -249,6 +238,8 @@ def test_ranged_repeat_large(self): else: self.assertRaises(UnexpectedInput, l.parse, u'A' * i) + # Cannot test cases with lark.lark that try to parse using the tested grammar. + def test_ranged_repeat_large3(self): # 8191 is a Mersenne prime g = u"""start: "A"~8191 """ @@ -257,48 +248,186 @@ def test_ranged_repeat_large(self): self.assertRaises(UnexpectedInput, l.parse, u'A' * 8190) self.assertRaises(UnexpectedInput, l.parse, u'A' * 8192) + # Cannot test cases with lark.lark that try to parse using the tested grammar. def test_large_terminal(self): g = "start: NUMBERS\n" g += "NUMBERS: " + '|'.join('"%s"' % i for i in range(0, 1000)) l = Lark(g, parser='lalr') for i in (0, 9, 99, 999): - self.assertEqual(l.parse(str(i)), Tree('start', [str(i)])) + with self.subTest(i=i): + self.assertEqual(l.parse(str(i)), Tree('start', [str(i)])) for i in (-1, 1000): - self.assertRaises(UnexpectedInput, l.parse, str(i)) + with self.subTest(i=i): + self.assertRaises(UnexpectedInput, l.parse, str(i)) + # Test forces use of Lark. def test_list_grammar_imports(self): - grammar = """ - %import .test_templates_import (start, sep) - - %override sep{item, delim}: item (delim item)* delim? - %ignore " " - """ - - imports = list_grammar_imports(grammar, [os.path.dirname(__file__)]) - self.assertEqual({os.path.split(i)[-1] for i in imports}, {'test_templates_import.lark', 'templates.lark'}) - - imports = list_grammar_imports('%import common.WS', []) - assert len(imports) == 1 and imports[0].pkg_name == 'lark' + grammar = """ + %import .test_templates_import (start, sep) - def test_inline_with_expand_single(self): - grammar = r""" - start: _a - !?_a: "A" + %override sep{item, delim}: item (delim item)* delim? + %ignore " " """ - self.assertRaises(GrammarError, Lark, grammar) + imports = list_grammar_imports(grammar, [os.path.dirname(__file__)]) + self.assertEqual({os.path.split(i)[-1] for i in imports}, {'test_templates_import.lark', 'templates.lark'}) + imports = list_grammar_imports('%import common.WS', []) + assert len(imports) == 1 and imports[0].pkg_name == 'lark' + + # Cannot test cases with lark.lark that try to parse using the tested grammar. def test_line_breaks(self): p = Lark(r"""start: "a" \ "b" """) p.parse('ab') +# Tests that both Lark and LarkDotLark can implement +def _make_tests(parser): + class _TestGrammar(TestCase): + def test_empty_literal(self): + # Issues #888 + self.assertRaisesRegex(GrammarError, "Empty terminals are not allowed \(\"\"\)", parser, "start: \"\"") + + def test_override_rule3(self): + self.assertRaisesRegex(GrammarError, "Cannot override a nonexisting rule sep", parser, """ + %override sep{item}: item (delim item)* delim? + """, source_path=__file__) + + def test_extend_rule2(self): + self.assertRaisesRegex(GrammarError, "Can't extend rule expr as it wasn't defined before", parser, """ + %extend expr: B A + """) + + def test_undefined_ignore1(self): + g = """!start: "A" + + %ignore B + """ + self.assertRaisesRegex( GrammarError, "Terminals {'B'} were marked to ignore but were not defined!", parser, g) + + def test_undefined_ignore2(self): + g = """!start: "A" + + %ignore start + """ + self.assertRaisesRegex( GrammarError, "Rules aren't allowed inside terminals ", parser, g) + + def test_alias_in_terminal(self): + g = """start: TERM + TERM: "a" -> alias + """ + self.assertRaisesRegex( GrammarError, "Aliasing not allowed in terminals \(You used -> in the wrong place\)", parser, g) + + def test_undefined_rule(self): + self.assertRaisesRegex(GrammarError, "Rule 'a' used but not defined \(in rule start\)", parser, """start: a""") + + def test_undefined_term(self): + self.assertRaisesRegex(GrammarError, "Terminal 'A' used but not defined \(in rule start\)", parser, """start: A""") + + def test_token_multiline_only_works_with_x_flag(self): + g = r"""start: ABC + ABC: / a b c + d + e f + /i + """ + self.assertRaisesRegex( GrammarError, "You can only use newlines in regular expressions with the `x` \(verbose\) flag", parser, g) + + def test_inline_with_expand_single(self): + grammar = r""" + start: _a + !?_a: "A" + """ + self.assertRaisesRegex(GrammarError, "Inlined rules \(_rule\) cannot use the \?rule modifier", parser, grammar) + + def test_declare_rule(self): + g = """ + %declare a + start: b + b: "c" + """ + self.assertRaisesRegex(GrammarError, "Expecting terminal name", parser, g) + def test_declare_token(self): + g = """ + %declare A + start: b + b: "c" + """ + parser(g) + + def test_import_multiple(self): + g = """ + %ignore A B + start: rule1 + rule1: "c" + A: "a" + B: "b" + """ + self.assertRaisesRegex(GrammarError, "Bad %ignore - must have a Terminal or other value", parser, g) + + def test_no_rule_aliases_below_top_level(self): + g = """start: rule + rule: ("a" -> alias + | "b") + """ + self.assertRaisesRegex( GrammarError, "Rule 'alias' used but not defined", parser, g) + + def test_no_term_templates(self): + g = """start: TERM + separated{x, sep}: x (sep x)* + TERM: separated{"A", " "} + """ + self.assertRaisesRegex( GrammarError, "Templates not allowed in terminals", parser, g) + + def test_term_no_call_rule(self): + g = """start: TERM + TERM: rule + rule: "a" + """ + self.assertRaisesRegex( GrammarError, "Rules aren't allowed inside terminals", parser, g) + + def test_no_rule_modifiers_in_references(self): + g = """start: rule1 + rule1: !?rule2 + rule2: "a" + """ + self.assertRaisesRegex(GrammarError, "Expecting a value", Lark, g) + + def test_rule_modifier_query_bang(self): + g = """start: rule1 + rule1: rule2 + ?!rule2: "a" + """ + parser(g) + + def test_alias_top_level_ok(self): + g = """ + start: rule1 + rule1: rule2 -> alias2 + rule2: "a" + """ + parser(g) + + def test_terminal_alias_bad(self): + g = """ + start: rule1 + rule1: TOKEN2 + TOKEN2: "a" -> alias2 + """ + self.assertRaisesRegex(GrammarError, "Aliasing not allowed in terminals", parser, g) + _NAME = "TestGrammar" + parser.__name__ + _TestGrammar.__name__ = _NAME + _TestGrammar.__qualname__ = _NAME + globals()[_NAME] = _TestGrammar + __all__.append(_NAME) +for parser in [Lark, LarkDotLark]: + _make_tests(parser) if __name__ == '__main__': main()