Skip to content

gh-127833: lexical analysis: Improve section on Names #131474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 73 additions & 50 deletions Doc/reference/lexical_analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -272,60 +272,82 @@ possible string that forms a legal token, when read from left to right.

.. _identifiers:

Identifiers and keywords
========================
Names (identifiers and keywords)
================================

.. index:: identifier, name

Identifiers (also referred to as *names*) are described by the following lexical
definitions.
:data:`~token.NAME` tokens represent *identifiers*, *keywords*, and
*soft keywords*.

The syntax of identifiers in Python is based on the Unicode standard annex
UAX-31, with elaboration and changes as defined below; see also :pep:`3131` for
further details.

Within the ASCII range (U+0001..U+007F), the valid characters for identifiers
include the uppercase and lowercase letters ``A`` through
``Z``, the underscore ``_`` and, except for the first character, the digits
Within the ASCII range (U+0001..U+007F), the valid characters for names
include the uppercase and lowercase letters (``A`` through ``Z`` and ``a`` to
``z``), the underscore ``_`` and, except for the first character, the digits
Comment on lines +284 to +285
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
include the uppercase and lowercase letters (``A`` through ``Z`` and ``a`` to
``z``), the underscore ``_`` and, except for the first character, the digits
include the uppercase and lowercase letters, the underscore ``_`` and, except
for the first character, the digits

Do we really need to explain this here too?

This is just below:

Besides A-Z, a-z, _ and 0-9 ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely keep it here; making it clear that “letters in the ASCII range” is A-Z and a-z.
Perhaps shorten it:

Suggested change
include the uppercase and lowercase letters (``A`` through ``Z`` and ``a`` to
``z``), the underscore ``_`` and, except for the first character, the digits
include the uppercase and lowercase letters (``A-Z`` and ``a-z``),
the underscore ``_`` and, except for the first character, the digits

I'd be OK with deduplicating the later occurence, but something like “ASCII characters as listed above” is almost as long as repeating the lists. This isn't code, it can be a little WET.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better :-)

``0`` through ``9``.
Python 3.0 introduced additional characters from outside the ASCII range (see
:pep:`3131`). For these characters, the classification uses the version of the
Unicode Character Database as included in the :mod:`unicodedata` module.

Identifiers are unlimited in length. Case is significant.
Names must contain at least one character, but have no upper length limit.
Case is significant.

.. productionlist:: python-grammar
identifier: `xid_start` `xid_continue`*
id_start: <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>
id_continue: <all characters in `id_start`, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
xid_start: <all characters in `id_start` whose NFKC normalization is in "id_start xid_continue*">
xid_continue: <all characters in `id_continue` whose NFKC normalization is in "id_continue*">

The Unicode category codes mentioned above stand for:

* *Lu* - uppercase letters
* *Ll* - lowercase letters
* *Lt* - titlecase letters
* *Lm* - modifier letters
* *Lo* - other letters
* *Nl* - letter numbers
* *Mn* - nonspacing marks
* *Mc* - spacing combining marks
* *Nd* - decimal numbers
* *Pc* - connector punctuations
* *Other_ID_Start* - explicit list of characters in `PropList.txt
<https://www.unicode.org/Public/16.0.0/ucd/PropList.txt>`_ to support backwards
compatibility
* *Other_ID_Continue* - likewise

All identifiers are converted into the normal form NFKC while parsing; comparison
of identifiers is based on NFKC.

A non-normative HTML file listing all valid identifier characters for Unicode
Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can also use "letter-like"
and "number-like" characters from outside the ASCII range, as detailed below.

All identifiers are converted into the `normalization form`_ NFKC while
parsing; comparison of identifiers is based on NFKC.

Formally, the first character of a normalized identifier must belong to the
set ``id_start``, which is the union of:

* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
* Unicode category ``<Lt>`` - titlecase letters
* Unicode category ``<Lm>`` - modifier letters
* Unicode category ``<Lo>`` - other letters
* Unicode category ``<Nl>`` - letter numbers
* {``"_"``} - the underscore
* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
to support backwards compatibility

The remaining characters must belong to the set ``id_continue``, which is the
union of:

* all characters in ``id_start``
* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
* Unicode category ``<Pc>`` - connector punctuations
* Unicode category ``<Mn>`` - nonspacing marks
* Unicode category ``<Mc>`` - spacing combining marks
* ``<Other_ID_Continue>`` - another explicit set of characters in
`PropList.txt`_ to support backwards compatibility

Unicode categories use the version of the Unicode Character Database as
included in the :mod:`unicodedata` module.

These sets are based on the Unicode standard annex `UAX-31`_.
See also :pep:`3131` for further details.

Even more formally, names are described by the following lexical definitions:

.. grammar-snippet::
:group: python-grammar

NAME: `xid_start` `xid_continue`*
id_start: <Lu> | <Ll> | <Lt> | <Lm> | <Lo> | <Nl> | "_" | <Other_ID_Start>
id_continue: `id_start` | <Nd> | <Pc> | <Mn> | <Mc> | <Other_ID_Continue>
xid_start: <all characters in `id_start` whose NFKC normalization is
in (`id_start` `xid_continue`*)">
xid_continue: <all characters in `id_continue` whose NFKC normalization is
in (`id_continue`*)">
identifier: <`NAME`, except keywords>

A non-normative text file listing all valid identifier characters for Unicode
16.0.0 can be found at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
16.0.0 can be found at
15.1.0 can be found at

Unicode categories use the version of the Unicode Character Database as
included in the :mod:unicodedata module.

According to the docs that is 15.1.0 at the moment, maybe we should link to that one instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh apologies

https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt
Comment on lines 342 to 343
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the link could be moved below too? Inline links are nice ;-)



.. _UAX-31: https://www.unicode.org/reports/tr31/
.. _PropList.txt: https://www.unicode.org/Public/16.0.0/ucd/PropList.txt
.. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms


.. _keywords:

Keywords
Expand All @@ -335,7 +357,7 @@ Keywords
single: keyword
single: reserved word

The following identifiers are used as reserved words, or *keywords* of the
The following names are used as reserved words, or *keywords* of the
language, and cannot be used as ordinary identifiers. They must be spelled
exactly as written here:

Expand All @@ -359,18 +381,19 @@ Soft Keywords

.. versionadded:: 3.10

Some identifiers are only reserved under specific contexts. These are known as
*soft keywords*. The identifiers ``match``, ``case``, ``type`` and ``_`` can
syntactically act as keywords in certain contexts,
Some names are only reserved under specific contexts. These are known as
*soft keywords*:

- ``match``, ``case``, and ``_``, when used in the :keyword:`match` statement.
- ``type``, when used in the :keyword:`type` statement.

These syntactically act as keywords in their specific contexts,
but this distinction is done at the parser level, not when tokenizing.

As soft keywords, their use in the grammar is possible while still
preserving compatibility with existing code that uses these names as
identifier names.

``match``, ``case``, and ``_`` are used in the :keyword:`match` statement.
``type`` is used in the :keyword:`type` statement.

.. versionchanged:: 3.12
``type`` is now a soft keyword.

Expand Down
Loading