Skip to content

Commit 02b5a6f

Browse files
committed
Minor typo and/or grammar fixes
1 parent e9b83b9 commit 02b5a6f

11 files changed

+48
-52
lines changed

definitions.rst

+9-9
Original file line numberDiff line numberDiff line change
@@ -78,8 +78,8 @@ different implementations of character strings:
7878
* array of 16 bits unsigned integers with :ref:`surrogate pairs
7979
<surrogates>` (:ref:`UTF-16 <utf16>`): full Unicode range
8080

81-
UCS-4 use twice as much memory than UCS-2, but it supports all Unicode
82-
character. UTF-16 is a compromise between UCS-2 and UCS-4: characters in the
81+
UCS-4 uses twice as much memory than UCS-2, but it supports all Unicode
82+
characters. UTF-16 is a compromise between UCS-2 and UCS-4: characters in the
8383
BMP range use one UTF-16 unit (16 bits), characters outside this range use two
8484
UTF-16 units (a :ref:`surrogate pair <surrogates>`, 32 bits). This advantage is
8585
also the main disadvantage of this kind of character string.
@@ -116,7 +116,7 @@ to :ref:`ASCII <ascii>` is called an "ASCII encoded string", or simply an
116116
"ASCII string".
117117

118118
The :ref:`character range <charset>` supported by a byte string depends on its
119-
encoding, because an encoding is associated to a :ref:`charset <charset>`. For
119+
encoding, because an encoding is associated with a :ref:`charset <charset>`. For
120120
example, an ASCII string can only store characters in the range U+0000—U+007F.
121121

122122
The encoding is not stored explicitly in a byte string. If the encoding is not
@@ -168,21 +168,21 @@ An **encoding** describes how to :ref:`encode <encode>` :ref:`code points <code
168168
point>` to bytes and how to :ref:`decode <decode>` :ref:`bytes <bytes>` to code
169169
points.
170170

171-
An encoding is always associated to a :ref:`charset <charset>`. For example,
172-
the UTF-8 encoding is associated to the Unicode charset. So we can say that an
171+
An encoding is always associated with a :ref:`charset <charset>`. For example,
172+
the UTF-8 encoding is associated with the Unicode charset. So we can say that an
173173
encoding :ref:`encodes <encode>` characters to bytes and decode bytes to characters, or more
174174
generally, it encodes a :ref:`character string <str>` to a :ref:`byte string
175175
<bytes>` and decodes a byte string to a character string.
176176

177-
The 7 and 8 bits charsets have most simple encoding: store a code point as a
178-
single byte. These charsets are also called encodings, it is easy to confuse
177+
The 7 and 8 bits charsets have the simplest encoding: store a code point as a
178+
single byte. Since these charsets are also called encodings, it is easy to confuse
179179
them. The best example is the :ref:`ISO-8859-1 encoding <ISO-8859-1>`: all of
180180
the 256 possible bytes are considered as 8 bit code points (0 through 255) and
181-
are associated to characters. For example, the character A (U+0041) has the
181+
are mapped to characters. For example, the character A (U+0041) has the
182182
code point 65 (0x41 in hexadecimal) and is stored as the byte ``0x41``.
183183

184184
Charsets with more than 256 entries cannot encode all code points into a single
185-
byte. The encoding encode all code points into byte sequences of the same
185+
byte. The encoding encodes all code points into byte sequences of the same
186186
length or of variable length. For example, :ref:`UTF-8` is a variable length
187187
encoding: code points lower than 128 use a single byte, whereas higher code
188188
points take 2, 3 or 4 bytes. The :ref:`UCS-2 <ucs2>` encoding encodes all

encodings.rst

+3-5
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Encodings
88

99
There are many encodings around the world. Before Unicode, each manufacturer
1010
invented its own encoding to fit its client market and its usage. Most
11-
encodings are incompatible on at least one code, except some exceptions.
11+
encodings are incompatible on at least one code, with some exceptions.
1212
A document stored in :ref:`ASCII` can be read using :ref:`ISO-8859-1` or
1313
UTF-8, because ISO-8859-1 and UTF-8 are supersets of ASCII. Each encoding can
1414
have multiple aliases, examples:
@@ -55,7 +55,7 @@ these numbers should be reliable. In 2001, the most used encodings were:
5555

5656
In december 2007, for the first time: :ref:`UTF-8` becomes the most used encoding
5757
(near 25%). In january 2010, UTF-8 was close to 50%, and ASCII and Western
58-
Europe encodings were near 20%. The usage of the other encodings don't change.
58+
Europe encodings were near 20%. The usage of other encodings doesn't change.
5959

6060
.. todo:: add an explicit list of top3 in 2010
6161

@@ -101,13 +101,11 @@ Handle undecodable bytes and unencodable characters
101101
Undecodable byte sequences
102102
''''''''''''''''''''''''''
103103

104-
When a :ref:`byte string <bytes>` is :ref:`decoded <decode>` from an encoding, the decoder may
104+
When a :ref:`byte string <bytes>` is :ref:`decoded <decode>`, the decoder may
105105
fail to decode a specific byte sequence. For example, ``0x61 0x62 0x63 0xE9``
106106
is not decodable from :ref:`ASCII` nor :ref:`UTF-8`, but it is decodable from
107107
:ref:`ISO-8859-1`.
108108

109-
.. TODO:: NELLE "is decoded from an encoding" => "is decoded"
110-
111109
Some encodings are able to decode any byte sequences. All encodings of the
112110
:ref:`ISO-8859 family <ISO-8859>` have this property, because all of the 256
113111
code points of these 8 bits encodings are assigned.

good_practices.rst

+4-4
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ To limit or avoid issues with Unicode, try to follow these rules:
1515
* always store and manipulate text as :ref:`character strings <str>`
1616
* if you have to encode text and you can choose the encoding: prefer the :ref:`UTF-8` encoding.
1717
It is able to encode all Unicode 6.0 characters (including :ref:`non-BMP
18-
characters <bmp>`), has no endian issue, is well supported by most
19-
programs, and its good compromise is size.
18+
characters <bmp>`), does not depend on endianness, is well supported by most
19+
programs, and its size is a good compromise.
2020

2121
.. TODO:: problem grammatical dans la dernière phrase du dernier point
2222

@@ -58,7 +58,7 @@ arguments and environment variables). The ``unicodedata`` module is a first
5858
step for a full Unicode support.
5959

6060
Most UNIX and Windows programs don't support Unicode. Firefox web browser and
61-
OpenOffice.org office suite have a full Unicode support. Slowly, more and more programs
61+
OpenOffice.org office suite have full Unicode support. Slowly, more and more programs
6262
have basic Unicode support.
6363

6464
.. NELLE : juste en anecdote: OOo supporte complétement l'unicode, mais les
@@ -67,7 +67,7 @@ have basic Unicode support.
6767
6868
Je pense qu'elle va être remise un jour ou un autre dans ces branches.
6969
70-
Don't expect to have directly a full Unicode support: it requires a lot of work. Your
70+
Don't expect to have full Unicode support directly: it requires a lot of work. Your
7171
project may be fully Unicode compliant for a specific task (e.g. :ref:`filenames <filename>`), but
7272
only have basic Unicode support for the other parts of the project.
7373

guess_encoding.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,7 @@ UTF-8.
125125

126126
.. highlight:: c
127127

128-
Example of a strict :ref:`C <c>` function to check if a string is encoded to
128+
Example of a strict :ref:`C <c>` function to check if a string is encoded with
129129
UTF-8. It rejects :ref:`overlong sequences <strict utf8 decoder>` (e.g. ``0xC0
130130
0x80``) and :ref:`surrogate characters <surrogates>` (e.g. ``0xED 0xB2 0x80``,
131131
U+DC80). ::

historical_encodings.rst

+7-9
Original file line numberDiff line numberDiff line change
@@ -5,18 +5,16 @@ Between 1950 and 2000, each manufacturer and each operating system created its
55
own 8 bits encoding. The problem was that 8 bits (256 code points) are not
66
enough to store any character, and so the encoding tries to fit the user's
77
language. Most 8 bits encodings are able to encode multiple languages, usually
8-
geograpically close (e.g. ISO-8859-1 is intented for Western Europe).
8+
geographically close (e.g. ISO-8859-1 is intented for Western Europe).
99

1010
.. TODO:: NELLE : "the problem was" & "The problem is" est plus une expression
1111
francaise traduite: ce n'est pas faux grammaticallement en anglais, mais ne
1212
sonne pas juste:
1313

1414
8 bits (256 code points) are not enought so store all (Unicode?) characters
1515

16-
It was difficult to exchange documents of different languages, because if a
17-
document was encoded to an encoding different than the user encoding, it
18-
leaded to :ref:`mojibake <mojibake>`.
19-
16+
It was difficult to exchange documents with different languages, because using an
17+
invalid encoding while loading the document leads to :ref:`mojibake <mojibake>`.
2018

2119
.. TODO:: NELLE : un exemple serait le bienvenu
2220

@@ -89,9 +87,9 @@ Year Norm Description Variant
8987
1987 ISO 8859-2 Central European: Croatian, Polish, Czech, … cp1250
9088
1988 ISO 8859-3 South European: Turkish and Esperanto -
9189
1988 ISO 8859-4 North European -
92-
1988 ISO 8859-5 Latin/Cyrillic: Macadonian, Russian, … KOI family
90+
1988 ISO 8859-5 Latin/Cyrillic: Macedonian, Russian, … KOI family
9391
1987 ISO 8859-6 Latin/Arabic: Arabic language characters cp1256
94-
1987 ISO 8859-7 Latin/Greek: modern greek language cp1253
92+
1987 ISO 8859-7 Latin/Greek: modern Greek language cp1253
9593
1988 ISO 8859-8 Latin/Hebrew: modern Hebrew alphabet cp1255
9694
1989 ISO 8859-9 Turkish: Largely the same as ISO 8859-1 cp1254
9795
1992 ISO 8859-10 Nordic: a rearrangement of Latin-4 -
@@ -172,8 +170,8 @@ cp1252
172170
''''''
173171

174172
Windows :ref:`code page <codepage>` 1252, best known as cp1252, is a variant
175-
of :ref:`ISO-8859-1`. It is the default encoding of all English and western
176-
europe Windows setups. It is used as a fallback by web browsers if the webpage
173+
of :ref:`ISO-8859-1`. It is the default encoding of all English and Western
174+
Europe Windows setups. It is used as a fallback by web browsers if the webpage
177175
doesn't provide any encoding information (not in HTML, nor in HTTP).
178176

179177
cp1252 shares 224 code points with ISO-8859-1, the range 0x80—0x9F (32

issues.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Security vulnerabilities
77
Special characters
88
''''''''''''''''''
99

10-
Fullwidth (U+FF01—U+FF60) and halfwidth (U+FF61—U+FFEE) characters has been
10+
Fullwidth (U+FF01—U+FF60) and halfwidth (U+FF61—U+FFEE) characters have been
1111
used in 2007 to bypass security checks. Examples with the :ref:`Unicode
1212
normalization <normalization>`:
1313

@@ -32,7 +32,7 @@ IDS/IPS/WAF Bypass Vulnerability
3232
Non-strict UTF-8 decoder: overlong byte sequences and surrogates
3333
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
3434

35-
An :ref:`UTF-8 <utf8>` :ref:`decoder <decode>` have to reject overlong byte sequences, or an attacker can use
35+
An :ref:`UTF-8 <utf8>` :ref:`decoder <decode>` has to reject overlong byte sequences, or an attacker can use
3636
them to bypass security checks (e.g. check rejecting string containing nul bytes,
3737
``0x00``). For example, ``0xC0 0x80`` byte sequence must raise an error and
3838
not be decoded as U+0000, and "." (U+002E) can be encoded to ``0xC0 0xAE`` (two
@@ -57,7 +57,7 @@ to encode and decode surrogates.
5757

5858
.. note::
5959

60-
The :ref:`Java <java>` and Tcl languages uses a variant of :ref:`UTF-8`
60+
The :ref:`Java <java>` and Tcl languages use a variant of :ref:`UTF-8`
6161
which encodes the nul character (U+0000) as the overlong byte sequence
6262
``0xC0 0x80``, instead of ``0x00``, for practical reasons.
6363

nightmare.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Unicode nightmare
44
:ref:`Unicode <unicode>` is the nightmare of many developers (and users) for
55
different, and sometimes good reasons.
66

7-
In the 1980's, only few people read documents in languages other their mother
7+
In the 1980s, only few people read documents in languages other their mother
88
tongue and English. A computer supported only a small number of
99
languages, the user configured his region to support languages of close
1010
countries. Memories and disks were expensive, all applications were written to
@@ -30,14 +30,14 @@ whereas UNIX and BSD operating systems use bytes.
3030

3131
Mixing documents stored as bytes is possible, even if they use different
3232
encodings, but leads to :ref:`mojibake <mojibake>`. Because libraries and programs do also ignore
33-
encode and decode :ref:`warnings or errors <errors>`, write a single character with a diacritic
33+
encode and decode :ref:`warnings or errors <errors>`, writing a single character with a diacritic
3434
(any non-:ref:`ASCII` character) is sometimes enough to get an error.
3535

3636
Full Unicode support is complex because the Unicode charset is bigger than any
3737
other charset. For example, :ref:`ISO-8859-1` contains 256 code points including 191
3838
characters, whereas Unicode version 6.0 contains :ref:`248,966
3939
assigned code points <unicode stats>`. The Unicode standard is larger than just a
40-
charset: it explains also how to display characters (e.g. left-to-right for
40+
charset: it also explains how to display characters (e.g. left-to-right for
4141
English and right-to-left for persian), how to :ref:`normalize <normalization>` a :ref:`character string <str>`
4242
(e.g. precomposed characters versus the decomposed form), etc.
4343

operating_systems.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -325,9 +325,9 @@ If the console is unable to render a character, it tries to use a
325325
replacment character can be found, "?" (U+003F) is displayed instead.
326326
327327
In a console (``cmd.exe``), ``chcp`` command can be used to display or to
328-
change the :ref:`OEM code page <codepage>` (and console code page). Change the
328+
change the :ref:`OEM code page <codepage>` (and console code page). Changing the
329329
console code page is not a good idea because the ANSI API of the console still
330-
expect characters encoded to the previous console code page.
330+
expects characters encoded to the previous console code page.
331331
332332
.. seealso::
333333

programming_languages.rst

+6-6
Original file line numberDiff line numberDiff line change
@@ -229,7 +229,7 @@ synchronization between C (``std*``) and C++ (``std::c*``) streams using: ::
229229
Python
230230
------
231231

232-
Python supports Unicode since its version 2.0 released in october 2000.
232+
Python supports Unicode since its version 2.0 released in October 2000.
233233
:ref:`Byte <bytes>` and :ref:`Unicode <str>` strings store their length, so
234234
it's possible to embed nul byte/character.
235235

@@ -263,7 +263,7 @@ In Python 2, ``str + unicode`` gives ``unicode``: the byte string is
263263
because it was the source of a lot of confusion. At the same time, it was not
264264
possible to switch completely to Unicode in 2000: computers were slower and
265265
there were fewer Python core developers. It took 8 years to switch completely to
266-
Unicode: Python 3 was relased in december 2008.
266+
Unicode: Python 3 was relased in December 2008.
267267

268268
Narrow build of Python 2 has a partial support of :ref:`non-BMP <bmp>`
269269
characters. The unichr() function raises an error for code bigger than U+FFFF,
@@ -331,7 +331,7 @@ It is possible to make Python 2 behave more like Python 3 with
331331
Codecs
332332
''''''
333333

334-
The ``codecs`` and ``encodings`` module provide text encodings. They supports a lot of
334+
The ``codecs`` and ``encodings`` modules provide text encodings. They support a lot of
335335
encodings. Some examples: ASCII, ISO-8859-1, UTF-8, UTF-16-LE,
336336
ShiftJIS, Big5, cp037, cp950, EUC_JP, etc.
337337

@@ -411,7 +411,7 @@ filesystem encoding, ``sys.getfilesystemencoding()``:
411411
Python uses the ``strict`` :ref:`error handler <errors>` in Python 2, and
412412
``surrogateescape`` (PEP 383) in Python 3. In Python 2, if ``os.listdir(u'.')``
413413
cannot decode a filename, it keeps the bytes filename unchanged. Thanks to
414-
``surrogateescape``, decode a filename does never fail in Python 3. But
414+
``surrogateescape``, decoding a filename never fails in Python 3. But
415415
encoding a filename can fail in Python 2 and 3 depending on the filesystem
416416
encoding. For example, on Linux with the C locale, the Unicode filename
417417
``"h\xe9.py"`` cannot be encoded because the filesystem encoding is ASCII.
@@ -496,7 +496,7 @@ In PHP 5, a literal string (e.g. ``"abc"``) is a :ref:`byte string <bytes>`.
496496
PHP has no :ref:`character string <str>` type, only a "string" type which is a
497497
:ref:`byte string <bytes>`.
498498

499-
PHP have "multibyte" functions to manipulate byte strings using their encoding.
499+
PHP has "multibyte" functions to manipulate byte strings using their encoding.
500500
These functions have an optional encoding argument. If the encoding is not
501501
specified, PHP uses the default encoding (called "internal encoding"). Some
502502
multibyte functions:
@@ -522,7 +522,7 @@ process byte strings as UTF-8 encoded strings.
522522

523523
.. todo:: u flag: instead of which encoding?
524524

525-
PHP includes also a binding of the :ref:`iconv <iconv>` library.
525+
PHP also includes a binding for the :ref:`iconv <iconv>` library.
526526

527527
* ``iconv()``: :ref:`decode <decode>` a :ref:`byte string <bytes>` from an
528528
encoding and :ref:`encode <encode>` to another encoding, you can use

unicode.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -95,8 +95,8 @@ Normalization
9595
Unicode standard explains how to decompose a character. For example, the precomposed
9696
character ç (U+00C7, Latin capital letter C with cedilla) can be written as
9797
the sequence of two characters: {¸ (U+0327, Combining cedilla), c (U+0043, Latin capital letter C)}.
98-
This decomposition can be useful to search a substring in a
99-
text, e.g. remove diacritic is pratical for the user. The decomposed form is
98+
This decomposition can be useful when searching for a substring in a
99+
text, e.g. removing the diacritic is pratical for the user. The decomposed form is
100100
called Normal Form D (**NFD**) and the precomposed form is called Normal Form
101101
C (**NFC**).
102102

@@ -108,7 +108,7 @@ C (**NFC**).
108108
| NFD | ¸c | {U+0327, U+0043} |
109109
+------+--------+------------------+
110110

111-
Unicode database contains also a compatibility layer: if a character cannot be
111+
Unicode database also contains a compatibility layer: if a character cannot be
112112
rendered (no font contain the requested character) or encoded to a specific
113113
encoding, Unicode proposes a :ref:`replacment character sequence which looks
114114
like the character <translit>`, but may have a different meaning.

unicode_encodings.rst

+7-7
Original file line numberDiff line numberDiff line change
@@ -71,11 +71,11 @@ code points, whereas UCS-2 is limited to :ref:`BMP <bmp>` characters. These
7171
encodings are practical because the length in units is the number of
7272
characters.
7373

74-
**UTF-16** and **UTF-32** encodings use, respectivelly, 16 and 32 bits units.
74+
**UTF-16** and **UTF-32** encodings use, respectively, 16 and 32 bits units.
7575
UTF-16 encodes code points bigger than U+FFFF using two units: a
7676
:ref:`surrogate pair <surrogates>`. UCS-2 can be :ref:`decoded <decode>` from UTF-16. UTF-32
7777
is also supposed to use more than one unit for big code points, but in
78-
practical, it only requires one unit to store all code points of Unicode 6.0.
78+
practice, it only requires one unit to store all code points of Unicode 6.0.
7979
That's why UTF-32 and UCS-4 are the same encoding.
8080

8181
+----------+-----------+-----------------+
@@ -116,10 +116,10 @@ Byte order marks (BOM)
116116
----------------------
117117

118118
:ref:`UTF-16 <utf16>` and :ref:`UTF-32 <utf32>` use units bigger than 8 bits,
119-
and so hit endian issue. A single unit can be stored in the big endian (most
120-
significant bits first) or little endian (less significant bits first). BOM
121-
are short byte sequences to indicate the encoding and the endian. It's the
122-
U+FEFF code point encoded to the UTF encodings.
119+
and so are sensitive to endianness. A single unit can be stored as big endian (most
120+
significant bits first) or little endian (less significant bits first). BOM
121+
is a short byte sequence to indicate the encoding and the endian. It's the
122+
U+FEFF code point encoded with the given UTF encoding.
123123

124124
Unicode defines 6 different BOM:
125125

@@ -147,7 +147,7 @@ UTF-32 in the host endian without BOM. On Windows, "UTF-16" usually means
147147
UTF-16-LE.
148148

149149
Some Windows applications, like notepad.exe, use UTF-8 BOM, whereas many
150-
applications are unable to detect the BOM, and so the BOM causes troubles.
150+
applications are unable to detect the BOM, and so the BOM causes trouble.
151151
UTF-8 BOM should not be used for better interoperability.
152152

153153
.. todo:: which troubles?

0 commit comments

Comments
 (0)