EDIT: Here's a proper task list:
Original:
Named captures would be awesome. Here's my idea: add a third member to the captures struct in MatchState, which would be an fixed array of char. start_capture would then read the upcoming characters, and if a named capture is detected, obtain the name and insert it into the array, followed by '\0'. (We can't just store a pointer in the MatchState, since then the array would expire when start_capture returns, or worse, when the inner if block ends. I'd rather not get into malloc stuff here.) Then it would proceed with the match as normal. In order to allow position captures to be named, the check for a position capture would be moved out of match and into start_capture.
Alternatively, a separate array can be added to MatchState, mapping group names to group numbers. This would simplify backreferences, but complicate result table building, and would also require either a NULL sentinel or an extra member tracking the number of named groups. I think this is better, though.
Table matching functions would include named captures in the table (obviously), but it's a bit tricky: because Lua (unlike Python) does not distinguish between indexing and attribute access, and some special fields are already defined. Thus, a new table, groups will be added, which will contain all captures, both numbered and named, instead of the main table. Next, to avoid having it shadowed by a capture named "expand", the expand method will be placed directly in the table. Finally, the table's __index metamethod will point to the groups subtable, thereby making all groups accessible by direct indexing, except for those that share the name of a special field. (Note that this will require a separate metatable for each result table, instead of a single metatable stored in the registry that all result tables share. expand will also have to be modified to pull from the groups subtable.
startpos and endpos would also carry the named fields, but fortunately they do not suffer from this problem. The documentation will officially recommend accessing named captures through groups rather than directly, especially when the name is unknown (e.g. user input). Using the groups subtable will also make pairs iteration over captures easier.
The syntax would be the same as most regex: (?<name>...). Excluding the subpattern (...) would turn it into a position capture. Backreferences within the pattern would use the PCRE syntax (but with Lua's escape character) %k<name>, while group references in a gsub replacement pattern would use simply %<name>. As a bonus, these backreference syntaxes can be overloaded to allow referencing numbered captures 10 or higher.
Valid group names would be anything that is a valid variable in Lua, but with a cap on length (maybe 15). Duplicate names would throw an error when encountered. Backreferences to non-existent groups would also throw an error.
Backwards-compatibility: I do not consider this proposed syntax to be backward incompatible with PUC-Rio Lua. In PUC-Lua, (? will open a capture and then match ? literally, but this is undefined. The manual defines ? as a special character which always has special meaning; the fact that it matches literally when following another special sequence (e.g. try it after another quantifier) is thus undefined behavior, and I do not consider a change in behavior that was previously undefined to be backwards-incompatible. Likewise, %< in a PUC-Lua replacement string would throw an error for an invalid escape; thus, all replacement strings that previously worked will continue to work. Finally, %k is currently not a defined character class, and thus the fact that it matches itself is undefined. Therefore, named captures can be included in the basic functions, even though their only use will be backreferences (since the basic functions don't have a means to return names).
(I think this explanation is more long-winded than the implementation will be. :P
EDIT: Here's a proper task list:
MatchStatestart_capture()to parse the namematch()to parse and handle named and named-style numbered backreferencesbuild_result_tableto check for and handle named capturesadd_s()to handle named and named-style numbered backreferencesmatchobj_expand()to handle named and named-style numbered backreferencesOriginal:
Named captures would be awesome. Here's my idea: add a third member to the
capturesstruct inMatchState, which would be an fixed array ofchar.start_capturewould then read the upcoming characters, and if a named capture is detected, obtain the name and insert it into the array, followed by'\0'. (We can't just store a pointer in theMatchState, since then the array would expire whenstart_capturereturns, or worse, when the inner if block ends. I'd rather not get intomallocstuff here.) Then it would proceed with the match as normal. In order to allow position captures to be named, the check for a position capture would be moved out ofmatchand intostart_capture.Alternatively, a separate array can be added to
MatchState, mapping group names to group numbers. This would simplify backreferences, but complicate result table building, and would also require either aNULLsentinel or an extra member tracking the number of named groups. I think this is better, though.Table matching functions would include named captures in the table (obviously), but it's a bit tricky: because Lua (unlike Python) does not distinguish between indexing and attribute access, and some special fields are already defined. Thus, a new table,
groupswill be added, which will contain all captures, both numbered and named, instead of the main table. Next, to avoid having it shadowed by a capture named "expand", theexpandmethod will be placed directly in the table. Finally, the table's __index metamethod will point to thegroupssubtable, thereby making all groups accessible by direct indexing, except for those that share the name of a special field. (Note that this will require a separate metatable for each result table, instead of a single metatable stored in the registry that all result tables share.expandwill also have to be modified to pull from thegroupssubtable.startposandendposwould also carry the named fields, but fortunately they do not suffer from this problem. The documentation will officially recommend accessing named captures throughgroupsrather than directly, especially when the name is unknown (e.g. user input). Using thegroupssubtable will also makepairsiteration over captures easier.The syntax would be the same as most regex:
(?<name>...). Excluding the subpattern (...) would turn it into a position capture. Backreferences within the pattern would use the PCRE syntax (but with Lua's escape character)%k<name>, while group references in agsubreplacement pattern would use simply%<name>. As a bonus, these backreference syntaxes can be overloaded to allow referencing numbered captures 10 or higher.Valid group names would be anything that is a valid variable in Lua, but with a cap on length (maybe 15). Duplicate names would throw an error when encountered. Backreferences to non-existent groups would also throw an error.
Backwards-compatibility: I do not consider this proposed syntax to be backward incompatible with PUC-Rio Lua. In PUC-Lua,
(?will open a capture and then match?literally, but this is undefined. The manual defines?as a special character which always has special meaning; the fact that it matches literally when following another special sequence (e.g. try it after another quantifier) is thus undefined behavior, and I do not consider a change in behavior that was previously undefined to be backwards-incompatible. Likewise,%<in a PUC-Lua replacement string would throw an error for an invalid escape; thus, all replacement strings that previously worked will continue to work. Finally,%kis currently not a defined character class, and thus the fact that it matches itself is undefined. Therefore, named captures can be included in the basic functions, even though their only use will be backreferences (since the basic functions don't have a means to return names).(I think this explanation is more long-winded than the implementation will be. :P