From 3bace0ab40f16e0b539ee6416d6adbaf9379e910 Mon Sep 17 00:00:00 2001
From: Petr Viktorin <encukou@gmail.com>
Date: Wed, 5 Mar 2025 17:58:33 +0100
Subject: [PATCH 1/5] Start on the Identifiers section

---
 Doc/reference/lexical_analysis.rst | 43 ++++++++++++++++++------------
 1 file changed, 26 insertions(+), 17 deletions(-)
diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst
index ff801a7d4fc494..7ecdef822a425a 100644
--- a/Doc/reference/lexical_analysis.rst
+++ b/Doc/reference/lexical_analysis.rst
@@ -277,29 +277,38 @@ Identifiers and keywords
 
 .. index:: identifier, name
 
-Identifiers (also referred to as *names*) are described by the following lexical
-definitions.
+:data:`~token.NAME` tokens represent *identifiers*, *keywords*, and
+*soft keywords*.
 
-The syntax of identifiers in Python is based on the Unicode standard annex
-UAX-31, with elaboration and changes as defined below; see also :pep:`3131` for
-further details.
-
-Within the ASCII range (U+0001..U+007F), the valid characters for identifiers
-include the uppercase and lowercase letters ``A`` through
-``Z``, the underscore ``_`` and, except for the first character, the digits
+Within the ASCII range (U+0001..U+007F), the valid characters for names
+include the uppercase and lowercase letters (``A`` through
+``Z``), the underscore ``_`` and, except for the first character, the digits
 ``0`` through ``9``.
-Python 3.0 introduced additional characters from outside the ASCII range (see
-:pep:`3131`).  For these characters, the classification uses the version of the
-Unicode Character Database as included in the :mod:`unicodedata` module.
 
-Identifiers are unlimited in length.  Case is significant.
+Names must contain at least one character, but have no upper length limit.
+Case is significant.
+
+Besizes ``A-Z`` and ``0-9``, names can also use "letter-like" and "number-like"
+characters from outside the ASCII range.  For these characters, the
+classification uses the version of the Unicode Character Database as included
+in the :mod:`unicodedata` module.
+
+The exact definition of "letter-like" and "number-like" characters is based on
+the Unicode standard annex `UAX-31`_, with elaboration and changes as
+defined below. See also :pep:`3131` for further details.
+
+All identifiers are converted into the normal form NFKC while parsing;
+comparison of identifiers is based on NFKC.
+
+Formally, names are described by the following lexical definitions.
 
 .. productionlist:: python-grammar
-   identifier: `xid_start` `xid_continue`*
+   NAME: `xid_start` `xid_continue`*
    id_start: <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>
    id_continue: <all characters in `id_start`, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
    xid_start: <all characters in `id_start` whose NFKC normalization is in "id_start xid_continue*">
    xid_continue: <all characters in `id_continue` whose NFKC normalization is in "id_continue*">
+   identifier: <`NAME`, except keywords>
 
 The Unicode category codes mentioned above stand for:
 
@@ -318,14 +327,14 @@ The Unicode category codes mentioned above stand for:
   compatibility
 * *Other_ID_Continue* - likewise
 
-All identifiers are converted into the normal form NFKC while parsing; comparison
-of identifiers is based on NFKC.
-
 A non-normative HTML file listing all valid identifier characters for Unicode
 16.0.0 can be found at
 https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt
 
 
+.. _UAX-31: https://www.unicode.org/reports/tr31/
+
+
 .. _keywords:
 
 Keywords

From 1b4e8fa368e385a09d4cae31878a1826e857eaed Mon Sep 17 00:00:00 2001
From: Petr Viktorin <encukou@gmail.com>
Date: Wed, 19 Mar 2025 17:57:28 +0100
Subject: [PATCH 2/5] Lexical analysis: improve section on names

---
 Doc/reference/lexical_analysis.rst | 104 ++++++++++++++++-------------
 1 file changed, 59 insertions(+), 45 deletions(-)

diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst
index 7ecdef822a425a..9257113617930a 100644
--- a/Doc/reference/lexical_analysis.rst
+++ b/Doc/reference/lexical_analysis.rst
@@ -272,8 +272,8 @@ possible string that forms a legal token, when read from left to right.
 
 .. _identifiers:
 
-Identifiers and keywords
-========================
+Names (identifiers and keywords)
+================================
 
 .. index:: identifier, name
 
@@ -281,51 +281,62 @@ Identifiers and keywords
 *soft keywords*.
 
 Within the ASCII range (U+0001..U+007F), the valid characters for names
-include the uppercase and lowercase letters (``A`` through
-``Z``), the underscore ``_`` and, except for the first character, the digits
+include the uppercase and lowercase letters (``A`` through ``Z`` and ``a`` to
+``z``), the underscore ``_`` and, except for the first character, the digits
 ``0`` through ``9``.
 
 Names must contain at least one character, but have no upper length limit.
 Case is significant.
 
-Besizes ``A-Z`` and ``0-9``, names can also use "letter-like" and "number-like"
-characters from outside the ASCII range.  For these characters, the
-classification uses the version of the Unicode Character Database as included
-in the :mod:`unicodedata` module.
+Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can also use "letter-like"
+and "number-like" characters from outside the ASCII range, as detailed below.
 
-The exact definition of "letter-like" and "number-like" characters is based on
-the Unicode standard annex `UAX-31`_, with elaboration and changes as
-defined below. See also :pep:`3131` for further details.
+All identifiers are converted into the `normalization form`_ NFKC while
+parsing; comparison of identifiers is based on NFKC.
 
-All identifiers are converted into the normal form NFKC while parsing;
-comparison of identifiers is based on NFKC.
+Formally, the first character of a normalized identifier must belong to the
+set ``id_start``, which is the union of:
 
-Formally, names are described by the following lexical definitions.
+* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
+* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
+* Unicode category ``<Lt>`` - titlecase letters
+* Unicode category ``<Lm>`` - modifier letters
+* Unicode category ``<Lo>`` - other letters
+* Unicode category ``<Nl>`` - letter numbers
+* {``"_"``} - the underscore
+* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
+  to support backwards compatibility
 
-.. productionlist:: python-grammar
-   NAME: `xid_start` `xid_continue`*
-   id_start: <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>
-   id_continue: <all characters in `id_start`, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
-   xid_start: <all characters in `id_start` whose NFKC normalization is in "id_start xid_continue*">
-   xid_continue: <all characters in `id_continue` whose NFKC normalization is in "id_continue*">
-   identifier: <`NAME`, except keywords>
-
-The Unicode category codes mentioned above stand for:
-
-* *Lu* - uppercase letters
-* *Ll* - lowercase letters
-* *Lt* - titlecase letters
-* *Lm* - modifier letters
-* *Lo* - other letters
-* *Nl* - letter numbers
-* *Mn* - nonspacing marks
-* *Mc* - spacing combining marks
-* *Nd* - decimal numbers
-* *Pc* - connector punctuations
-* *Other_ID_Start* - explicit list of characters in `PropList.txt
-  <https://www.unicode.org/Public/16.0.0/ucd/PropList.txt>`_ to support backwards
-  compatibility
-* *Other_ID_Continue* - likewise
+The remaining characters must belong to the set ``id_continue``, which is the
+union of:
+
+* all characters in ``id_start``
+* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
+* Unicode category ``<Pc>`` - connector punctuations
+* Unicode category ``<Mn>`` - nonspacing marks
+* Unicode category ``<Mc>`` - spacing combining marks
+* ``<Other_ID_Continue>`` - another explicit set of characters in
+  `PropList.txt`_ to support backwards compatibility
+
+Unicode categories use the version of the Unicode Character Database as
+included in the :mod:`unicodedata` module.
+
+These sets are based on the Unicode standard annex `UAX-31`_.
+See also :pep:`3131` for further details.
+
+Even more formally, names are described by the following lexical definitions:
+
+.. grammar-snippet::
+   :group: python-grammar
+
+   NAME:         `xid_start` `xid_continue`*
+   id_start:     <Lu> | <Ll> | <Lt> | <Lm> | <Lo> | <Nl> | "_" | <Other_ID_Start>
+   id_continue:  `id_start` | <Nd> | <Pc> | <Mn> | <Mc> | <Other_ID_Continue>
+   xid_start:    <all characters in `id_start` whose NFKC normalization is
+                  in (`id_start` `xid_continue`*)">
+   xid_continue: <all characters in `id_continue` whose NFKC normalization is
+                  in (`id_continue`*)">
+   identifier:   <`NAME`, except keywords>
 
 A non-normative HTML file listing all valid identifier characters for Unicode
 16.0.0 can be found at
@@ -333,6 +344,8 @@ https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt
 
 
 .. _UAX-31: https://www.unicode.org/reports/tr31/
+.. _PropList.txt: https://www.unicode.org/Public/16.0.0/ucd/PropList.txt
+.. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms
 
 
 .. _keywords:
@@ -344,7 +357,7 @@ Keywords
    single: keyword
    single: reserved word
 
-The following identifiers are used as reserved words, or *keywords* of the
+The following names are used as reserved words, or *keywords* of the
 language, and cannot be used as ordinary identifiers.  They must be spelled
 exactly as written here:
 
@@ -368,18 +381,19 @@ Soft Keywords
 
 .. versionadded:: 3.10
 
-Some identifiers are only reserved under specific contexts. These are known as
-*soft keywords*.  The identifiers ``match``, ``case``, ``type`` and ``_`` can
-syntactically act as keywords in certain contexts,
+Some names are only reserved under specific contexts. These are known as
+*soft keywords*:
+
+- ``match``, ``case``, and ``_``, when used in the :keyword:`match` statement.
+- ``type``, when used in the :keyword:`type` statement.
+
+These syntactically act as keywords in their specific contexts,
 but this distinction is done at the parser level, not when tokenizing.
 
 As soft keywords, their use in the grammar is possible while still
 preserving compatibility with existing code that uses these names as
 identifier names.
 
-``match``, ``case``, and ``_`` are used in the :keyword:`match` statement.
-``type`` is used in the :keyword:`type` statement.
-
 .. versionchanged:: 3.12
    ``type`` is now a soft keyword.
 

From 4f334598f54339bca0900c1cffc7c3b6091d124d Mon Sep 17 00:00:00 2001
From: Petr Viktorin <encukou@gmail.com>
Date: Thu, 20 Mar 2025 11:26:48 +0100
Subject: [PATCH 3/5] Update Doc/reference/lexical_analysis.rst

Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
---
 Doc/reference/lexical_analysis.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst
index 9257113617930a..5664537633b978 100644
--- a/Doc/reference/lexical_analysis.rst
+++ b/Doc/reference/lexical_analysis.rst
@@ -338,7 +338,7 @@ Even more formally, names are described by the following lexical definitions:
                   in (`id_continue`*)">
    identifier:   <`NAME`, except keywords>
 
-A non-normative HTML file listing all valid identifier characters for Unicode
+A non-normative text file listing all valid identifier characters for Unicode
 16.0.0 can be found at
 https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt
 

From 912bf0014aa9128cf09ebadcb04598c27b90dffa Mon Sep 17 00:00:00 2001
From: Petr Viktorin <encukou@gmail.com>
Date: Wed, 7 May 2025 16:29:30 +0200
Subject: [PATCH 4/5] Apply suggestions from code review

---
 Doc/reference/lexical_analysis.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst
index 5664537633b978..a7bb692eeae73e 100644
--- a/Doc/reference/lexical_analysis.rst
+++ b/Doc/reference/lexical_analysis.rst
@@ -281,8 +281,8 @@ Names (identifiers and keywords)
 *soft keywords*.
 
 Within the ASCII range (U+0001..U+007F), the valid characters for names
-include the uppercase and lowercase letters (``A`` through ``Z`` and ``a`` to
-``z``), the underscore ``_`` and, except for the first character, the digits
+include the uppercase and lowercase letters (``A-Z`` and ``a-z``),
+the underscore ``_`` and, except for the first character, the digits
 ``0`` through ``9``.
 
 Names must contain at least one character, but have no upper length limit.

From e6101d1809e20fc059324872c9dc09ee7d38ecbc Mon Sep 17 00:00:00 2001
From: Petr Viktorin <encukou@gmail.com>
Date: Wed, 7 May 2025 16:49:43 +0200
Subject: [PATCH 5/5] Use an inline link for DerivedCoreProperties.txt

---
 Doc/reference/lexical_analysis.rst | 7 ++++---
 Tools/unicode/makeunicodedata.py   | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst
index a7bb692eeae73e..0a4e918dc1e447 100644
--- a/Doc/reference/lexical_analysis.rst
+++ b/Doc/reference/lexical_analysis.rst
@@ -338,13 +338,14 @@ Even more formally, names are described by the following lexical definitions:
                   in (`id_continue`*)">
    identifier:   <`NAME`, except keywords>
 
-A non-normative text file listing all valid identifier characters for Unicode
-16.0.0 can be found at
-https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt
+A non-normative listing of all valid identifier characters as defined by
+Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode
+Character Database.
 
 
 .. _UAX-31: https://www.unicode.org/reports/tr31/
 .. _PropList.txt: https://www.unicode.org/Public/16.0.0/ucd/PropList.txt
+.. _DerivedCoreProperties.txt: https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt
 .. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms
 
 
diff --git a/Tools/unicode/makeunicodedata.py b/Tools/unicode/makeunicodedata.py
index 889ae8fc869b8a..d4cca68c3e3e71 100644
--- a/Tools/unicode/makeunicodedata.py
+++ b/Tools/unicode/makeunicodedata.py
@@ -43,7 +43,7 @@
 # When changing UCD version please update
 #   * Doc/library/stdtypes.rst, and
 #   * Doc/library/unicodedata.rst
-#   * Doc/reference/lexical_analysis.rst (two occurrences)
+#   * Doc/reference/lexical_analysis.rst (three occurrences)
 UNIDATA_VERSION = "16.0.0"
 UNICODE_DATA = "UnicodeData%s.txt"
 COMPOSITION_EXCLUSIONS = "CompositionExclusions%s.txt"