Add named captures

gjbex · gjbex · commit d082e209ddc9 · 2021-08-26T09:56:40.000+02:00
diff --git a/source-code/regexes/regexes.ipynb b/source-code/regexes/regexes.ipynb
@@ -316,16 +316,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Composition"
+    "# Maintainable expressions"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Sophisticated regular expressions tend to be very hard to read.  There are a couple of things you can do to mitigate that issue.\n",
-    "* Use `re.VERBOSE` so that you can add whitespace and comments to the regular expression defintions.\n",
-    "* Use composition, i.e., define regular expressions that describe part of the match, and compose those t match the entire expression."
+    "* Use composition, i.e., define regular expressions that describe part of the match, and compose those t match the entire expression.\n",
+    "* Use named captures.\n",
+    "* Use `re.VERBOSE` so that you can add whitespace and comments to the regular expression defintions."
    ]
   },
   {
@@ -457,13 +458,86 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Although the final regular expression is still rather long, it is easier to read and to maintain.  Using `re.VERBOSE` and triple-quoted strings helps to further make the regular expression more maintainable."
+    "To avoid a long and tedius argument list, it is more convenient to store the subexpressions into a dictionary."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 15,
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "regex_parts = {\n",
+    "    'date': r'\\d{4}-\\d{2}-\\d{2}',\n",
+    "    'time': r'\\d{2}:\\d{2}:\\d{2}\\.\\d+',\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Overall, this can be further improved by using named capture groups."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "regex_parts['datetime'] = r'(?P<datetime>{date}\\s+{time})'.format(**regex_parts)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now the match can be retrieved by name, rather than index, this makes the code less error prone and more robust to change."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'2021-08-25 17:04:23.439405'"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "match = re.match(regex_parts['datetime'], log_entry)\n",
+    "match.group('datetime')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "regex_parts['log_level'] = r'\\[(?P<log_level>\\w+)\\]'\n",
+    "regex_parts['log_msg'] = r'end\\s+process\\s+(?P<process_id>\\d+)\\s+exited\\s+with\\s+(?P<exit_status>\\d+)'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Although the final regular expression is still rather long, it is easier to read and to maintain.  Using `re.VERBOSE` and triple-quoted strings helps to further make the regular expression more maintainable."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -478,15 +552,22 @@
    ],
    "source": [
     "regex = re.compile(r'''\n",
-    "    ({date}\\s+{time})\\s+        # date-time, up to microsecond presision\n",
-    "    {level}\\s*:\\s*              # log level of the log message\n",
-    "    {msg}                       # actual log message\n",
-    "    '''.format(date=date, time=time, level=level, msg=msg), re.VERBOSE)\n",
+    "    {datetime}\\s+             # date-time, up to microsecond presision\n",
+    "    {log_level}\\s*:\\s*        # log level of the log message\n",
+    "    {log_msg}                 # actual log message\n",
+    "    '''.format(**regex_parts), re.VERBOSE)\n",
     "match = regex.match(log_entry)\n",
-    "print(f'datetime = {match.group(1)}')\n",
-    "print(f'log level: {match.group(2)}')\n",
-    "print(f'process = {match.group(3)}')\n",
-    "print(f'exit status = {match.group(4)}')"
+    "print(f\"datetime = {match.group('datetime')}\")\n",
+    "print(f\"log level: {match.group('log_level')}\")\n",
+    "print(f\"process = {match.group('process_id')}\")\n",
+    "print(f\"exit status = {match.group('exit_status')}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Note:** as up to Python 3.9 (and perhaps later versions), f-strings can not contain backslashes, hence the use of the `format` method for string substitution."
    ]
   }
  ],