|
316 | 316 | "cell_type": "markdown",
|
317 | 317 | "metadata": {},
|
318 | 318 | "source": [
|
319 |
| - "# Composition" |
| 319 | + "# Maintainable expressions" |
320 | 320 | ]
|
321 | 321 | },
|
322 | 322 | {
|
323 | 323 | "cell_type": "markdown",
|
324 | 324 | "metadata": {},
|
325 | 325 | "source": [
|
326 | 326 | "Sophisticated regular expressions tend to be very hard to read. There are a couple of things you can do to mitigate that issue.\n",
|
327 |
| - "* Use `re.VERBOSE` so that you can add whitespace and comments to the regular expression defintions.\n", |
328 |
| - "* Use composition, i.e., define regular expressions that describe part of the match, and compose those t match the entire expression." |
| 327 | + "* Use composition, i.e., define regular expressions that describe part of the match, and compose those t match the entire expression.\n", |
| 328 | + "* Use named captures.\n", |
| 329 | + "* Use `re.VERBOSE` so that you can add whitespace and comments to the regular expression defintions." |
329 | 330 | ]
|
330 | 331 | },
|
331 | 332 | {
|
|
457 | 458 | "cell_type": "markdown",
|
458 | 459 | "metadata": {},
|
459 | 460 | "source": [
|
460 |
| - "Although the final regular expression is still rather long, it is easier to read and to maintain. Using `re.VERBOSE` and triple-quoted strings helps to further make the regular expression more maintainable." |
| 461 | + "To avoid a long and tedius argument list, it is more convenient to store the subexpressions into a dictionary." |
461 | 462 | ]
|
462 | 463 | },
|
463 | 464 | {
|
464 | 465 | "cell_type": "code",
|
465 | 466 | "execution_count": 15,
|
466 | 467 | "metadata": {},
|
| 468 | + "outputs": [], |
| 469 | + "source": [ |
| 470 | + "regex_parts = {\n", |
| 471 | + " 'date': r'\\d{4}-\\d{2}-\\d{2}',\n", |
| 472 | + " 'time': r'\\d{2}:\\d{2}:\\d{2}\\.\\d+',\n", |
| 473 | + "}" |
| 474 | + ] |
| 475 | + }, |
| 476 | + { |
| 477 | + "cell_type": "markdown", |
| 478 | + "metadata": {}, |
| 479 | + "source": [ |
| 480 | + "Overall, this can be further improved by using named capture groups." |
| 481 | + ] |
| 482 | + }, |
| 483 | + { |
| 484 | + "cell_type": "code", |
| 485 | + "execution_count": 16, |
| 486 | + "metadata": {}, |
| 487 | + "outputs": [], |
| 488 | + "source": [ |
| 489 | + "regex_parts['datetime'] = r'(?P<datetime>{date}\\s+{time})'.format(**regex_parts)" |
| 490 | + ] |
| 491 | + }, |
| 492 | + { |
| 493 | + "cell_type": "markdown", |
| 494 | + "metadata": {}, |
| 495 | + "source": [ |
| 496 | + "Now the match can be retrieved by name, rather than index, this makes the code less error prone and more robust to change." |
| 497 | + ] |
| 498 | + }, |
| 499 | + { |
| 500 | + "cell_type": "code", |
| 501 | + "execution_count": 17, |
| 502 | + "metadata": {}, |
| 503 | + "outputs": [ |
| 504 | + { |
| 505 | + "data": { |
| 506 | + "text/plain": [ |
| 507 | + "'2021-08-25 17:04:23.439405'" |
| 508 | + ] |
| 509 | + }, |
| 510 | + "execution_count": 17, |
| 511 | + "metadata": {}, |
| 512 | + "output_type": "execute_result" |
| 513 | + } |
| 514 | + ], |
| 515 | + "source": [ |
| 516 | + "match = re.match(regex_parts['datetime'], log_entry)\n", |
| 517 | + "match.group('datetime')" |
| 518 | + ] |
| 519 | + }, |
| 520 | + { |
| 521 | + "cell_type": "code", |
| 522 | + "execution_count": 18, |
| 523 | + "metadata": {}, |
| 524 | + "outputs": [], |
| 525 | + "source": [ |
| 526 | + "regex_parts['log_level'] = r'\\[(?P<log_level>\\w+)\\]'\n", |
| 527 | + "regex_parts['log_msg'] = r'end\\s+process\\s+(?P<process_id>\\d+)\\s+exited\\s+with\\s+(?P<exit_status>\\d+)'" |
| 528 | + ] |
| 529 | + }, |
| 530 | + { |
| 531 | + "cell_type": "markdown", |
| 532 | + "metadata": {}, |
| 533 | + "source": [ |
| 534 | + "Although the final regular expression is still rather long, it is easier to read and to maintain. Using `re.VERBOSE` and triple-quoted strings helps to further make the regular expression more maintainable." |
| 535 | + ] |
| 536 | + }, |
| 537 | + { |
| 538 | + "cell_type": "code", |
| 539 | + "execution_count": 19, |
| 540 | + "metadata": {}, |
467 | 541 | "outputs": [
|
468 | 542 | {
|
469 | 543 | "name": "stdout",
|
|
478 | 552 | ],
|
479 | 553 | "source": [
|
480 | 554 | "regex = re.compile(r'''\n",
|
481 |
| - " ({date}\\s+{time})\\s+ # date-time, up to microsecond presision\n", |
482 |
| - " {level}\\s*:\\s* # log level of the log message\n", |
483 |
| - " {msg} # actual log message\n", |
484 |
| - " '''.format(date=date, time=time, level=level, msg=msg), re.VERBOSE)\n", |
| 555 | + " {datetime}\\s+ # date-time, up to microsecond presision\n", |
| 556 | + " {log_level}\\s*:\\s* # log level of the log message\n", |
| 557 | + " {log_msg} # actual log message\n", |
| 558 | + " '''.format(**regex_parts), re.VERBOSE)\n", |
485 | 559 | "match = regex.match(log_entry)\n",
|
486 |
| - "print(f'datetime = {match.group(1)}')\n", |
487 |
| - "print(f'log level: {match.group(2)}')\n", |
488 |
| - "print(f'process = {match.group(3)}')\n", |
489 |
| - "print(f'exit status = {match.group(4)}')" |
| 560 | + "print(f\"datetime = {match.group('datetime')}\")\n", |
| 561 | + "print(f\"log level: {match.group('log_level')}\")\n", |
| 562 | + "print(f\"process = {match.group('process_id')}\")\n", |
| 563 | + "print(f\"exit status = {match.group('exit_status')}\")" |
| 564 | + ] |
| 565 | + }, |
| 566 | + { |
| 567 | + "cell_type": "markdown", |
| 568 | + "metadata": {}, |
| 569 | + "source": [ |
| 570 | + "**Note:** as up to Python 3.9 (and perhaps later versions), f-strings can not contain backslashes, hence the use of the `format` method for string substitution." |
490 | 571 | ]
|
491 | 572 | }
|
492 | 573 | ],
|
|
0 commit comments