-
Notifications
You must be signed in to change notification settings - Fork 2
Tutorial
Here's an oprex to match CSS color (hashtag-syntax only, e.g. #ff0000). It compiles to #(?:[\dA-Fa-f]{6}|[\dA-Fa-f]{3}).
/hash/hexes/
hash = '#'
hexes = <<|
|6 of hexdigit
|3 of hexdigit
hexdigit: digit A..F a..f
The first line:
/hash/hexes/
Specifies that we want a regex that matches hash-then-hexes. We then define what those hash and hexes are using indented sub-block:
--
hash = '#'
This defines hash as a string literal. It should match # literally.
--
hexes = <<|
This defines hexes as an alternation. The <<| starts an alternation.
--
|6 of hexdigit
This is the first alternative in our alternation: match hexdigit six times.
The of keyword is oprex's operator for doing quantification/repetition.
All | in an alternation must vertically align.
--
|3 of hexdigit
This is the second alternative in our alternation: match hexdigit three times. For matching e.g. #f00
-- Then we have a blank line.
In oprex, blank line means end-of-alternation/lookaround, so here we close our alternation block.
--
The definition of hexes above refers to something called hexdigit so now we need to define it. Again, to define something we use indented sub-block, this time with deeper indentation since the definition of hexes is already indented:
hexdigit: digit A..F a..f
This defines hexdigit as a character-class. Character classes are defined using colon : then, after the colon, we list the character-class' members, separated by space:
-
digitis a built-in character-class (it compiles to regex's\d). Character classes can include other character classes. -
A..Fanda..fare character ranges.
--
The following oprex matches stuff-inside-brackets, with the brackets can be one of ()``{}``<> or [].
- Sample matches:
<html>{x < 0}[1]and(see footnote [1]) - Will not match e.g.
f(g(x)) - Will only partially match
{{citation needed}}as{{citation needed} - To also match those, we can combine this example with the balanced parentheses example. But to keep it short and clear, we'll go with the above restrictions.
The output regex will be:
(?>(?P<paren>\()|(?P<curly>\{)|(?P<angle><)|(?P<square>\[))
(?:[^(){}<>\[\]]|(?s:.)(?<!(?>(?(paren)\)|(?(curly)\}|(?(angle)>|(?(square)\]|(?!))))))))*+
(?>(?(paren)\)|(?(curly)\}|(?(angle)>|(?(square)\]|(?!))))))Here's the oprex:
-- Line 0, things after "--" are comments
/open/contents?/close/ -- Line 1
open = @| -- 2
|paren -- 3
|curly -- 4
|angle -- 5
|square -- 6
-- 7
[paren]: ( -- 8
[curly]: { -- 9
[angle]: < -- 10
[square]: [ -- 11
-- 12
close = @| -- 13
|[paren] ? ')' -- 14
|[curly] ? '}' -- 15
|[angle] ? '>' -- 16
|[square] ? ']' -- 17
|FAIL! -- 18
-- 19
contents = @1.. of <<| -- 20
|not: ( ) { } < > [ ] -- 21
|not_close -- 22
-- 23
not_close = <@> -- 24
|any| -- 25
<!close| -- 26
-- 27
As you can see, comments in oprex starts with -- (a la SQL) not with # like in python. The reason for that is demonstrated by Line 21. Also, while first and last lines must be blank, comments-only lines are counted as blanks.
--
Line 1: contents? means that it is optional. It behaves just like ? in regex.
Line 2: @| starts an alternation, just like <<| before (see the previous CSS-color example). The difference is @ means atomic while << means allow backtracking. So, alternations started using @| will be atomic while ones started using <<| may backtrack. (If you don't know about regex atomic vs. backtracking, here's an excellent reference).
Line 7: Blank line (comments-only lines are treated as blank). It marks the end-of-alternation.
--
Lines 8-11: [paren] [curly] [angle] [square] define named capture groups. (Capturing group is one of regex-basics, we will not cover it here in an oprex tutorial). They need to be defined as capture groups because we will refer back to them later on lines 14-17.
--
close = @| -- 13
|[paren] ? ')' -- 14
|[curly] ? '}' -- 15
|[angle] ? '>' -- 16
|[square] ? ']' -- 17
|FAIL! -- 18
Line 13: @| start-of-alternation, atomic.
Line 14-17: are conditionals, e.g. [paren] ? ')' means: match literal ) only if capture-group paren is defined. In the definition of open (lines 2-7), only one of paren-curly-angle-square will be defined because they are in an alternation. So this means that close must match ) if open is (, } if open is {, etc.
Line 18: FAIL! compiles to regex (?!). It always fails. So if none of the conditions in met, it should just fail.
Line 19: Blank line, end-of-alternation.
--
Line 20: @1.. of <<|
-
ofis quantification operator. -
@1..is the quantifier. Again,@means atomic, i.e. make it a possessive quantifier.1..means one-or-more.@1..compiles to regex++. - Things after
ofare what-should-be-repeated. So here we are quantifying an alternation. -
<<|starts an alternation (one that allows backtracking, as previously explained).
Line 21: not: ( ) { } < > [ ]
-
not:is oprex operator for doing negated character-class. -
( ) { } < > [ ]are the character-class members, separated by space. - Line 21 also demonstrates why comments are started with
---- a#there will be interpreted as a character class member.
Line 23: Blank line, end-of-alternation.
-- Lines 24-26:
not_close = <@> -- 24
|any| -- 25
<!close| -- 26
-
<@>starts a lookaround block. The@reminds you that lookarounds are atomic. -
anyis an oprex-built-in variable. It matches any character -- like regex's.with DOTALL turned on. -
<!close|is a negative lookbehind.<means lookbehind.!negates.closeis already defined on Line 13. - Like in alternation, all corresponding
|in a lookaround must vertically align.
Line 27: Blank line, end-of-lookaround. Like alternation, lookaround block is closed with blank line.
The Tutorial should explain most of the syntax used in Examples. The rest are covered here:
The IPv4 Address, Date, and Time examples use the String-of-Digits Range Literal feature to match number-strings between two specified values (and reject non-numbers string/numbers but having value outside the range):
-
byte = '0'..'255'(leading-zero not allowed, e.g. won't match007) hh = '1'..'12'-
mm = '01'..'12'(leading-zero mandatory for single-digits, e.g. will match02but not2) dd = '01'..'31'mm = ss = '00'..'59'-
HH = 'o0'..'23'(theomeans optional leading-zero, e.g. will match both2and02)
In the Date example:
/yyyy/separator/mm/=separator/dd/
And the Quoted String example:
/opening_quote/contents/=opening_quote/
The =separator and =opening_quote parts are Backreference.
(ignorecase) in Time and (unicode) in Password Checks are example of Flags usage.
The __ in BEGIN-something-END example:
/begin/__/end/
And in Password Checks:
has_number = /__?/digit/
has_min_2_symbols = 2 of /__?/non-alnum/
Are sample uses of the Match-Until Operator.
more_values = /comma/value/more_values?/
The more_values refers to itself. This is an example of Recursion. Other examples can be seen in E-mail Address:
subdomain = /hostname/dot/subdomain?/
balanced_parens = /open/contents?/close/
open: (
close: )
contents = @1.. of <<|
|non_parens
|balanced_parens
And Palindrome:
palindrome = <<|
|/letter/palindrome/=letter/
|/letter/=letter/
|letter
In Quoted String:
*) quote: ' "
*) comma: ,
The *) in the definitions of quote and comma marks the variables as Global Variable which makes the variables accessible in latter, different scopes.
In Comma-Separated Values, Balanced Parentheses, and Palindrome:
//value/more_values?//
/non_parens?/balanced_parens/non_parens?/.
./palindrome/.
./, /., and // are Anchors.
- IPv4 Address
- BEGIN-something-END
- Date
- Time
- Blood Type
- Quoted String
- Comma-Separated Values
- Password Checks
- Balanced Parentheses
- Number-string Range Literal
- Backreference
- Flags
- Match-anything-until
- Recursion
- Global Variables
- Anchors
- Built-in Character Classes
- Built-in Expressions
-
Special Built-ins:
WOB,wordchar,non-linechar