-
Notifications
You must be signed in to change notification settings - Fork 411
Description
[pyiceberg_diag.html](https://github.com/user-attachments/files/24314065/pyiceberg_diag.html)
Feature Request / Improvement
I'm happy to see that pyparsing is part of the pyiceberg project. I've attached two files that are railroad diagrams for this parser:
- pyiceberg_diag.html - browseable railroad diagram (SVG), non-terrminal elements are links to subdiagrams
- pyiceberg_parser.png - static PNG version of the same diagram
Overall I think the parser is well-structured, and makes good use of the infix_notation, one_of, and DelimitedList helpers.
My suggestions are largely cosmetic, but may add some parse-time performance improvements.
Recommendations to use set_name are either for better diagramming, better exception messages, or both.
set_results_name() vs. set_name()
I'd like to point out the distinction between set_name and set_results_name. set_name is most useful to describe the expression itself, in abstract. set_results_name should be used to mark expressions with some contextual name, so that their parsed value can be retrieved by that name. I usually think of set_name describes what the expression is, and set_results_name is how the expression is used.
For example, these expressions should probably use set_name, not set_results_name:
boolean = one_of(["true", "false"], caseless=True).set_results_name("boolean")
string = sgl_quoted_string.set_results_name("raw_quoted_string")
decimal = common.real().set_results_name("decimal")
integer = common.signed_integer().set_results_name("integer")
literal = Group(string | decimal | integer | boolean).set_results_name("literal")
literal_set = Group(
DelimitedList(string) | DelimitedList(decimal) | DelimitedList(integer) | DelimitedList(boolean)
).set_results_name("literal_set")You'll also find that exception messages are less cryptic with more use of set_name.
Word(nums).parse_string("NAN")
# raises ParseException: Expected W:(0-9), found 'NAN'
Word(nums).set_name("integer").parse_string("NAN")
# raises ParseException: Expected integer, found 'NAN'There are 11 calls to set_results_name.
Many of the set_results_name calls are probably better as set_name calls. I try to reserve set_results_name for expressions that need to be referenced in a parse action, or other post-parsing code. "op" is a good example of using set_results_name.
"column" is an excellent example of being an exception to this guideline. It is used in many places, but always as the subject of some larger expression. "column" should be defined using both set_results_name and set_name.
Lastly, you can keep your set_name and set_results_name calls to a minimum:
- insert a call to pyparsing.autoname_elements(), which will call set_name on all locally defined ParserElements that have not already been given a name (making this call before calling the
infix_notationfunction will use the names as part of that functions internal expression building) - use the
expr("results_name")short-cut forexpr.set_results_name("results_name")
Collapse IS/IS NOT calls
You can reduce the number of terms in your parser by merging IS and IS NOT expressions (fewer terms can translate into faster parsing). Here you define separate is_null and not_null expressions in building null_check:
is_null = column + IS + NULL
not_null = column + IS + NOT + NULL
null_check = is_null | not_nullI propose defining an IS_NOT expression, so that null_check can simplify down (and also reduce 2 parse action functions down to one):
IS_NOT = (IS + NOT).add_parse_action(lambda: "IS_NOT")
null_check = column + (IS_NOT | IS)("op") + NULL
@null_check.set_parse_action
def _(result: ParseResults) -> BooleanExpression:
expr_class = IsNull if result.op == "IS" else NotNull
return expr_class(result.column)Similar treatment can be given to IN/NOT IN and LIKE/NOT LIKE.
Creating the diagram
I wrote this little script to build the attached diagrams:
import pyiceberg.expressions.parser as ibp
ibp.boolean_expression.create_diagram("pyiceberg_parser_diag.html")Run this before making any changes, and you'll get a diagram that is difficult to navigate, probably even difficult to view! Make a few set_name calls (like adding set_name("column") to column) and regenerate the diagram and see how the structure becomes clearer. The non-terminals in the diagram are actually links to their corresponding sub-diagrams, so navigation in a large diagram is much easier. I've gone back and refactored a number of the pyparsing examples, after generating their diagrams.
Other notes
Not really a pyparsing note, but just offering this alternative style for your if-elif-else chains:
@left_ref.set_parse_action
def _(result: ParseResults) -> BooleanExpression:
op_classes = {
">": GreaterThan,
">=": GreaterThanOrEqual,
"<": LessThan,
"<=": LessThanOrEqual,
"=": EqualTo,
"==": EqualTo,
"!=": NotEqualTo,
"<>": NotEqualTo,
}
op_class = op_classes.get(result.op)
if op_class is not None:
return op_class(result.column, result.literal)
raise ValueError(f"Unsupported operation type: {result.op!r}")I encourage people at work, when echoing erroneous values in an exception, to add the "!r" format marker. This encloses the value in single quotes and expands any non-printing characters to hex notation. A big help for when the parser accidentally includes trailing space in the operator string, and so it doesn't match any of the expected patterns.
Finally
Again, your parser is fine as-is, these suggestions just may make it a bit easier to maintain, and even a bit faster.