This proposal specifies lexical rules for constant characters in Carbon:
Put character literals in single quotes, like 'a'
. Character literals work
like numeric literals:
- Every different literal value has its own type.
- The literal itself doesn't have a bit width as a consequence. Instead, variables use explicitly sized character types and character literals can be converted to these types when representable.
- A character literal must contain exactly one code point.
Follows the plan from open design idea #1934: Character Literals.
Carbon currently has no lexical syntax for character literals, and only provides string literals and numeric literals. We wish to provide a distinct lexical syntax for character literals versus string literals.
The advantage of having an explicit character type fundamentally comes down to characters being represented as integers whereas strings are represented as buffers. This will allow characters to have different operations, and be more familiar to use. For example:
if (c >= 'A' and c <= 'Z') {
c += 'a' - 'A';
}
The example above shows how we would be able to use operations similar to integers. Being able to use the comparison operations and supporting arithmetic operations provides an intuitive approach to using characters. This allows us to remove unnecessary logic of type conversion and other control flow logic, that is needed to work with a single element string. See Rationale for more examples showing more appropriate use of characters over using strings.
Character Literals by definition is a type of literal in programming for the representation of a single character's value within the source code of a computer program. Character literals between languages have some minor nuances but are fundamentally designed for the same purpose. Languages that have a dedicated character data type generally include character literals, for example C++, Java, Swift to name a few. Whereas other languages that lack distinct character type, like Python use strings of length one to serve the same purpose a character data type. For more information see Character Literals Wiki, Character Literals DBpedia
Put character literals in single quotes, like 'a'
. Character literals work
like numeric literals:
- Every different literal value has its own type.
- The literal itself doesn't have a bit width as a consequence. Instead, variables use explicitly sized character types and character literals can be converted to these types when representable. Follows the plan from #1934.
- A character literal will model single Unicode code points that have a single concrete numerical representation. We will not be supporting other formulations like code unit sequences or grapheme clusters as these will be modeled with normal string literals.
- A character literal is a sequence enclosed with single quotes delimiter ('), of UTF-8 code units that must be a valid encoding. This matches the UTF-8 encoding of Carbon source files.
- A character literal must encode exactly one code point.
- It supports addition and subtraction, as described below.
- Character literals will support the relevant subset of the backslash (
\
) escape sequences in string literals, including\t
,\n
,\r
,\"
,\'
,\\
,\0
, and\u{HHHH...}
. See String Literals: Escape sequence.- Escape sequences which would result in non-UTF-8 encodings or more than one code point are not included.
- The escape of an embedded newline is also excluded as it isn't expected to be relevant for character literals.
We will not support:
- character literals that don't contain exactly one Unicode code point;
- multi-line literals;
- "raw" literals (using #'x'#);
\x
escape sequences;- character literals with a single quote (
'
) or back-slash (\
), except as part of an escape sequence; - empty character literals (
''
); - a backslash followed by an (unescaped) newline;
- ASCII control codes (0...31), including whitespace characters other than word space (tab, line feed, carriage return, form feed, and vertical tab), except when specified with an escape sequence.
For the time being, Carbon will support three character types: Char8
,
Char16
, and Char32
. These types are capable of representing both code units
and code points. It’s important to note that the support for different
UTF-encoding code unit types will be addressed in a separate proposal. Please
refer to the UTF code unit types proposalfor
more information on that topic.
In Carbon, the type CharN
represents a character, where N
corresponds to the
bit size of the character type (8
, 16
, or 32
). We will only allow
character literals that map directly to a complete value of a code point. Here
are examples of character literals for each specific type:
Char8
: The character literal consists of a single Unicode code point that can be represented within 8 bits. For example:
let allowed: Char8 = ‘a’
In this example, the character literal ’a’
corresponds to the Unicode code
point 97
, which is within the valid range of Char8
since 97
is less than
or equal to 0x7F
.
Char16
: The character literal represents a Unicode code point that can be represented within 16 bits. Here’s an example:
let smiley: Char16 = ‘\u{1F600}’
The character literal ’\u{1F600}’
represents the smiley face emoji, which has
the Unicode code point 128512
. Since 128512
can be represented within 16
bits, it can be assigned to a variable of type Char16
.
Char32
: This character type allows the representation of Unicode code points within 32 bits. Here’s an example:
let musicalNote: Char32 = ‘🎵’
In this case, the character literal ’🎵’
corresponds to the musical note emoji
with the Unicode code point 127925
. Since 127925
falls within the range that
can be represented by Char32
, it can be assigned to a variable of type
Char32
.
By restricting character literals to those that can be directly mapped to code points within the specific character types, we ensure accurate representation and compatibility with the chosen character encoding scheme.
Character literals representing a single code point support the following operators:
- Comparison:
<
,>
,<=
,>=
==
- Plus:
+
. This doesn't concatenate, but allows numerically adjusting the value:- Only one operand may be a character literal, the other must be an integer literal.
- The result is the character literal whose numeric value is the sum of numeric value of the operands. If that sum is not a valid Unicode code point, it is an error.
- Subtract:
-
. This will subtract the value of the two characters, or a character followed by an integer literal:- If the
-
is used between two character literals, the result will be an integer constant. For example,'z' - 'a'
is equivalent to25
. - If the
-
is used between a character literal followed by a integer literal, this will produce a character constant. For example'z' - 4
is equivalent to'v'
. - If the
-
is used between a integer literal followed by a character literal100 - 'a'
, this will be rejected unless the integer is cast to a character.
- If the
There is intentionally no implicit conversion from character literals to integer types, but explicit conversions are permitted between character literals and integer types. Carbon will separate the integer types from character types entirely.
This proposal supports the goal of making Carbon code
easy to read, understand, and write.
Adding support for a specific character literal supports clean, readable,
concise use and is a much more familiar concept that will make it easier to
adopt Carbon coming from other languages. Have a distinct character literal will
also allow us support useful operations designed to manipulate the literal's
value. When working with an explicit character type we can use operators that
have unique behavior, for example say we wanted to advance a character to the
next literal. In other languages the +
operator is often used for
concatenation, so using a String
will produce a type error: "a" + 1
. However
with a character literal, we can support operations for these use cases:
var b: u8;
b = 'a' + 1;
b + 1 == 'c';
See Operations and No Distinct Character Literal for more information.
Further, this design follows other standards set in place by previous proposals. For example following the String Literals: Escaping Sequence and representing characters as integers with the behaviour inline with Integer Literals.
This also supports our goal for Interoperability with and migration from existing C++ code by ensuring that every kind of character literal that exists in C++ can be represented in a Carbon character literal. This is done in a way that is natural to adopt, understand, easy to read by having explicit character types mapped to the C++ character types and the correct associated encoding.
Finally, the choice to use Unicode and UTF-8 by default reflects the Carbon goal to prioritize modern OS platforms, hardware architectures, and environments. This reflects the growing adoption of UTF-8.
Unlike C++, Carbon will separate the integer and the character types. We
considered using u8
, u16
, and u32
instead of Char8
, Char16
, and
Char32
to reduce the number of different types users needed to be aware of and
convert between. We decided against it because it came with a number of
disadvantages:
u8
,u16
, andu32
have the wrong arithmetic semantics: we don't want wrapping, and manyuN
operations, like multiplication, division, and shift, are not meaningful on code units. There may be rare cases where you want to use those operations, such as if you're implementing a conversion to or from code units. But in those rare cases it would be reasonable for the user to convert to an integer type to perform that operation and convert back when done.- Some operations want to be able to tell the difference between values that are intended to be UTF-8 instead of having no specified encoding.
- Some operations want to be able to know that they've been given text rather
than random bytes of data. For example,
Print(0x41 as u8)
would be expected to print"65"
whilePrint('\u{41}')
andPrint(0x41 as Char8)
would be expected to print"A"
. - It's useful for developers to document the intended meaning of a value, and using a distinct type is one way to do that.
See UTF code unit types proposal for more information about UTF encoding types for a future proposal.
In principle, a character literal can be represented by reusing string literals
similar to how Python handles character literals, however this would prevent
performing operations on characters as integers. For example, the +
operator
on strings is used for concatenation, but +
on a character would change its
value.
// `digit` must be in the range 0..9.
fn DigitToChar(digit: i32) -> Char8 {
return '0' + digit;
}
Furthermore, many properties of Unicode characters are defined on ranges of code points, motivating supporting comparison operators on code points.
fn IsDingBatCodePoint(c: Char32) -> bool {
return c >= '\u{2700}' and c <= '\u{27BF}';
}
No support is proposed for prefix declarations like u
, U
, or L
. In
practice they are used to specify the character literal types and their encoding
in languages like C and C++. There are a several benefits to omitting prefix
declarations; improved readablitly, simplifying how a character's type is
determined, and how we are encoding character literals. When declaring a
character literal, the type is based on the contents of the character so that
var c: u8 = 'a'
is a valid character that can be converted to u8
, in order
to support prefix declarations we would need to extend our type system to have
other exlpicit type checks like in C++; a UTF-16 u'
, UTF-32 U'
, and wide
characters L'
. This would be more familiar for individuals coming to Carbon
from a C++ background, and simplify our approach for C++ Interoperability. At
the cost of diverge from existing standards, for example
Proposal 142
states all of Carbon source code should be UTF-8 encoded. Prefix declarations
would detract the readability of the character literals and increase the
complexity of character literal Types.
This proposal does not support numeric escape sequences using \x
. This
simplifies the design of character types and literals, making them only
represent code points and not code units. However this does come with the
disadvantage of less consistency of character literals with string literals,
since they now accept different escape sequences. We don't want to remove
numeric escape sequence from string literals, so we can support string use cases
like representing invalid encodings.
This approach has the additional concern that if character literals don't
support numeric escape sequences, developers may choose to use numeric literals
instead, at a cost of type-safety and readability. For example, it isn't clear
in var first_digit: Char8 = 0;
whether 0
is supposed to be a NUL
character
or the encoding of the '0'
character (48). We addressed this concern, and type
safety concerns about distinguishing numbers and characters, by making the
integer to character conversions explicit.
Rather than explicitly limiting characters literals to a more integer-like representation of a single Unicode code point, we could represent characters literal formulations of grapheme clusters and non-code-point code units. What humans tend to think of as a "character" corresponds to a "grapheme cluster." The encoding of a grapheme cluster can be arbitrarily long and complex, which would sacrifice the ability to perform integer operations. If we wanted to add support for other character formulations, we would need to use separate spellings to represent a small set of operations that are today expressed with integer-based math on C++'s character literals. This includes things like converting an integer between 0 and 9 into the corresponding digit character, or computing the difference between two digits/two other characters. For these reasons, we have decided to start out by representing character literals as single Unicode code points following a more integer-like model. However this topic should be revisited if we find that there is a significant need for the additional functionality and attendant complexity for these other character formulations.
There have been several ideas and discussions around how we would like to handle UTF code units. This section will hopefully provide some guidance for a future proposal when the topic is revisited for how we would like to build out encoding/decoding for character literals.
We will have the types Char8
, Char16
, and Char32
representing code units
in UTF-8, UTF-16, and UTF-32, but we will not support all code units, but only
those which map directly to the complete value of a code point. However,
character literals will use their own types distinct from these:
- We will support value preserving implicit conversions from character
literals to code point or code unit types. In particular, a character
literal converts to a
Char8
UTF-8 code unit if it is less than or equal to 0x7F, andChar16
UTF-16 code unit if it is less than or equal to 0xFFFF. - Conversions from string or character literals to a non-value-preserving encoding must be explicit.
- Conversions from string literals to Unicode strings are implicit, even though the numeric values of the encoding may change.
We can see whether the particular literal is represented in the variable's type by only looking at the types.
let allowed: Char8 = 'a';
The above is allowed because the type of 'a'
is the character literal
consisting of the single Unicode code point 97, which can be converted to
Char8
since 97 is less than or equal to 0x7F.
let error1: Char8 = '😃';
let error2: Char8 = 'AB';
However these should produce errors. The type of '😃'
is the character literal
consisting of the single Unicode code point 0x1F603
, which is greater than
0x7F. The type of 'AB'
is a character literal that is a sequence of two
Unicode code points, which has no conversion to a type that only handles a
single UTF-8 code unit.
All of '\n'
, and '\u{A}'
represent the same character and so have the same
type. However, explicitly converting this character literal to another character
set might result in a character with a different value, but that still
represents the newline character.