Skip to content

Commit d082e20

Browse files
committed
Add named captures
1 parent 63f5ff3 commit d082e20

File tree

1 file changed

+93
-12
lines changed

1 file changed

+93
-12
lines changed

source-code/regexes/regexes.ipynb

Lines changed: 93 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -316,16 +316,17 @@
316316
"cell_type": "markdown",
317317
"metadata": {},
318318
"source": [
319-
"# Composition"
319+
"# Maintainable expressions"
320320
]
321321
},
322322
{
323323
"cell_type": "markdown",
324324
"metadata": {},
325325
"source": [
326326
"Sophisticated regular expressions tend to be very hard to read. There are a couple of things you can do to mitigate that issue.\n",
327-
"* Use `re.VERBOSE` so that you can add whitespace and comments to the regular expression defintions.\n",
328-
"* Use composition, i.e., define regular expressions that describe part of the match, and compose those t match the entire expression."
327+
"* Use composition, i.e., define regular expressions that describe part of the match, and compose those t match the entire expression.\n",
328+
"* Use named captures.\n",
329+
"* Use `re.VERBOSE` so that you can add whitespace and comments to the regular expression defintions."
329330
]
330331
},
331332
{
@@ -457,13 +458,86 @@
457458
"cell_type": "markdown",
458459
"metadata": {},
459460
"source": [
460-
"Although the final regular expression is still rather long, it is easier to read and to maintain. Using `re.VERBOSE` and triple-quoted strings helps to further make the regular expression more maintainable."
461+
"To avoid a long and tedius argument list, it is more convenient to store the subexpressions into a dictionary."
461462
]
462463
},
463464
{
464465
"cell_type": "code",
465466
"execution_count": 15,
466467
"metadata": {},
468+
"outputs": [],
469+
"source": [
470+
"regex_parts = {\n",
471+
" 'date': r'\\d{4}-\\d{2}-\\d{2}',\n",
472+
" 'time': r'\\d{2}:\\d{2}:\\d{2}\\.\\d+',\n",
473+
"}"
474+
]
475+
},
476+
{
477+
"cell_type": "markdown",
478+
"metadata": {},
479+
"source": [
480+
"Overall, this can be further improved by using named capture groups."
481+
]
482+
},
483+
{
484+
"cell_type": "code",
485+
"execution_count": 16,
486+
"metadata": {},
487+
"outputs": [],
488+
"source": [
489+
"regex_parts['datetime'] = r'(?P<datetime>{date}\\s+{time})'.format(**regex_parts)"
490+
]
491+
},
492+
{
493+
"cell_type": "markdown",
494+
"metadata": {},
495+
"source": [
496+
"Now the match can be retrieved by name, rather than index, this makes the code less error prone and more robust to change."
497+
]
498+
},
499+
{
500+
"cell_type": "code",
501+
"execution_count": 17,
502+
"metadata": {},
503+
"outputs": [
504+
{
505+
"data": {
506+
"text/plain": [
507+
"'2021-08-25 17:04:23.439405'"
508+
]
509+
},
510+
"execution_count": 17,
511+
"metadata": {},
512+
"output_type": "execute_result"
513+
}
514+
],
515+
"source": [
516+
"match = re.match(regex_parts['datetime'], log_entry)\n",
517+
"match.group('datetime')"
518+
]
519+
},
520+
{
521+
"cell_type": "code",
522+
"execution_count": 18,
523+
"metadata": {},
524+
"outputs": [],
525+
"source": [
526+
"regex_parts['log_level'] = r'\\[(?P<log_level>\\w+)\\]'\n",
527+
"regex_parts['log_msg'] = r'end\\s+process\\s+(?P<process_id>\\d+)\\s+exited\\s+with\\s+(?P<exit_status>\\d+)'"
528+
]
529+
},
530+
{
531+
"cell_type": "markdown",
532+
"metadata": {},
533+
"source": [
534+
"Although the final regular expression is still rather long, it is easier to read and to maintain. Using `re.VERBOSE` and triple-quoted strings helps to further make the regular expression more maintainable."
535+
]
536+
},
537+
{
538+
"cell_type": "code",
539+
"execution_count": 19,
540+
"metadata": {},
467541
"outputs": [
468542
{
469543
"name": "stdout",
@@ -478,15 +552,22 @@
478552
],
479553
"source": [
480554
"regex = re.compile(r'''\n",
481-
" ({date}\\s+{time})\\s+ # date-time, up to microsecond presision\n",
482-
" {level}\\s*:\\s* # log level of the log message\n",
483-
" {msg} # actual log message\n",
484-
" '''.format(date=date, time=time, level=level, msg=msg), re.VERBOSE)\n",
555+
" {datetime}\\s+ # date-time, up to microsecond presision\n",
556+
" {log_level}\\s*:\\s* # log level of the log message\n",
557+
" {log_msg} # actual log message\n",
558+
" '''.format(**regex_parts), re.VERBOSE)\n",
485559
"match = regex.match(log_entry)\n",
486-
"print(f'datetime = {match.group(1)}')\n",
487-
"print(f'log level: {match.group(2)}')\n",
488-
"print(f'process = {match.group(3)}')\n",
489-
"print(f'exit status = {match.group(4)}')"
560+
"print(f\"datetime = {match.group('datetime')}\")\n",
561+
"print(f\"log level: {match.group('log_level')}\")\n",
562+
"print(f\"process = {match.group('process_id')}\")\n",
563+
"print(f\"exit status = {match.group('exit_status')}\")"
564+
]
565+
},
566+
{
567+
"cell_type": "markdown",
568+
"metadata": {},
569+
"source": [
570+
"**Note:** as up to Python 3.9 (and perhaps later versions), f-strings can not contain backslashes, hence the use of the `format` method for string substitution."
490571
]
491572
}
492573
],

0 commit comments

Comments
 (0)