Minor typo and/or grammar fixes

lighterowl · lighterowl · commit 02b5a6ff54b1 · 2017-06-23T11:29:14.000+02:00
diff --git a/definitions.rst b/definitions.rst
@@ -78,8 +78,8 @@ different implementations of character strings:
  * array of 16 bits unsigned integers with :ref:`surrogate pairs
    <surrogates>` (:ref:`UTF-16 <utf16>`): full Unicode range
 
-UCS-4 use twice as much memory than UCS-2, but it supports all Unicode
-character. UTF-16 is a compromise between UCS-2 and UCS-4: characters in the
+UCS-4 uses twice as much memory than UCS-2, but it supports all Unicode
+characters. UTF-16 is a compromise between UCS-2 and UCS-4: characters in the
 BMP range use one UTF-16 unit (16 bits), characters outside this range use two
 UTF-16 units (a :ref:`surrogate pair <surrogates>`, 32 bits). This advantage is
 also the main disadvantage of this kind of character string.
@@ -116,7 +116,7 @@ to :ref:`ASCII <ascii>` is called an "ASCII encoded string", or simply an
 "ASCII string".
 
 The :ref:`character range <charset>` supported by a byte string depends on its
-encoding, because an encoding is associated to a :ref:`charset <charset>`. For
+encoding, because an encoding is associated with a :ref:`charset <charset>`. For
 example, an ASCII string can only store characters in the range U+0000—U+007F.
 
 The encoding is not stored explicitly in a byte string. If the encoding is not
@@ -168,21 +168,21 @@ An **encoding** describes how to :ref:`encode <encode>` :ref:`code points <code
 point>` to bytes and how to :ref:`decode <decode>` :ref:`bytes <bytes>` to code
 points.
 
-An encoding is always associated to a :ref:`charset <charset>`. For example,
-the UTF-8 encoding is associated to the Unicode charset. So we can say that an
+An encoding is always associated with a :ref:`charset <charset>`. For example,
+the UTF-8 encoding is associated with the Unicode charset. So we can say that an
 encoding :ref:`encodes <encode>` characters to bytes and decode bytes to characters, or more
 generally, it encodes a :ref:`character string <str>` to a :ref:`byte string
 <bytes>` and decodes a byte string to a character string.
 
-The 7 and 8 bits charsets have most simple encoding: store a code point as a
-single byte. These charsets are also called encodings, it is easy to confuse
+The 7 and 8 bits charsets have the simplest encoding: store a code point as a
+single byte. Since these charsets are also called encodings, it is easy to confuse
 them. The best example is the :ref:`ISO-8859-1 encoding <ISO-8859-1>`: all of
 the 256 possible bytes are considered as 8 bit code points (0 through 255) and
-are associated to characters. For example, the character A (U+0041) has the
+are mapped to characters. For example, the character A (U+0041) has the
 code point 65 (0x41 in hexadecimal) and is stored as the byte ``0x41``.
 
 Charsets with more than 256 entries cannot encode all code points into a single
-byte. The encoding encode all code points into byte sequences of the same
+byte. The encoding encodes all code points into byte sequences of the same
 length or of variable length. For example, :ref:`UTF-8` is a variable length
 encoding: code points lower than 128 use a single byte, whereas higher code
 points take 2, 3 or 4 bytes. The :ref:`UCS-2 <ucs2>` encoding encodes all
diff --git a/encodings.rst b/encodings.rst
@@ -8,7 +8,7 @@ Encodings
 
 There are many encodings around the world. Before Unicode, each manufacturer
 invented its own encoding to fit its client market and its usage. Most
-encodings are incompatible on at least one code, except some exceptions.
+encodings are incompatible on at least one code, with some exceptions.
 A document stored in :ref:`ASCII` can be read using :ref:`ISO-8859-1` or
 UTF-8, because ISO-8859-1 and UTF-8 are supersets of ASCII. Each encoding can
 have multiple aliases, examples:
@@ -55,7 +55,7 @@ these numbers should be reliable. In 2001, the most used encodings were:
 
 In december 2007, for the first time: :ref:`UTF-8` becomes the most used encoding
 (near 25%). In january 2010, UTF-8 was close to 50%, and ASCII and Western
-Europe encodings were near 20%. The usage of the other encodings don't change.
+Europe encodings were near 20%. The usage of other encodings doesn't change.
 
 .. todo:: add an explicit list of top3 in 2010
 
@@ -101,13 +101,11 @@ Handle undecodable bytes and unencodable characters
 Undecodable byte sequences
 ''''''''''''''''''''''''''
 
-When a :ref:`byte string <bytes>` is :ref:`decoded <decode>` from an encoding, the decoder may
+When a :ref:`byte string <bytes>` is :ref:`decoded <decode>`, the decoder may
 fail to decode a specific byte sequence. For example, ``0x61 0x62 0x63 0xE9``
 is not decodable from :ref:`ASCII` nor :ref:`UTF-8`, but it is decodable from
 :ref:`ISO-8859-1`.
 
-.. TODO:: NELLE "is decoded from an encoding" => "is decoded"
-
 Some encodings are able to decode any byte sequences. All encodings of the
 :ref:`ISO-8859 family <ISO-8859>` have this property, because all of the 256
 code points of these 8 bits encodings are assigned.
diff --git a/good_practices.rst b/good_practices.rst
@@ -15,8 +15,8 @@ To limit or avoid issues with Unicode, try to follow these rules:
  * always store and manipulate text as :ref:`character strings <str>`
  * if you have to encode text and you can choose the encoding: prefer the :ref:`UTF-8` encoding.
    It is able to encode all Unicode 6.0 characters (including :ref:`non-BMP
-   characters <bmp>`), has no endian issue, is well supported by most
-   programs, and its good compromise is size.
+   characters <bmp>`), does not depend on endianness, is well supported by most
+   programs, and its size is a good compromise.
 
 .. TODO:: problem grammatical dans la dernière phrase du dernier point
 
@@ -58,7 +58,7 @@ arguments and environment variables). The ``unicodedata`` module is a first
 step for a full Unicode support.
 
 Most UNIX and Windows programs don't support Unicode. Firefox web browser and
-OpenOffice.org office suite have a full Unicode support. Slowly, more and more programs
+OpenOffice.org office suite have full Unicode support. Slowly, more and more programs
 have basic Unicode support.
 
 .. NELLE : juste en anecdote: OOo supporte complétement l'unicode, mais les
@@ -67,7 +67,7 @@ have basic Unicode support.
 
   Je pense qu'elle va être remise un jour ou un autre dans ces branches.
 
-Don't expect to have directly a full Unicode support: it requires a lot of work. Your
+Don't expect to have full Unicode support directly: it requires a lot of work. Your
 project may be fully Unicode compliant for a specific task (e.g. :ref:`filenames <filename>`), but
 only have basic Unicode support for the other parts of the project.
 
diff --git a/guess_encoding.rst b/guess_encoding.rst
@@ -125,7 +125,7 @@ UTF-8.
 
 .. highlight:: c
 
-Example of a strict :ref:`C <c>` function to check if a string is encoded to
+Example of a strict :ref:`C <c>` function to check if a string is encoded with
 UTF-8. It rejects :ref:`overlong sequences <strict utf8 decoder>` (e.g.  ``0xC0
 0x80``) and :ref:`surrogate characters <surrogates>` (e.g. ``0xED 0xB2 0x80``,
 U+DC80). ::
diff --git a/historical_encodings.rst b/historical_encodings.rst
@@ -5,18 +5,16 @@ Between 1950 and 2000, each manufacturer and each operating system created its
 own 8 bits encoding. The problem was that 8 bits (256 code points) are not
 enough to store any character, and so the encoding tries to fit the user's
 language. Most 8 bits encodings are able to encode multiple languages, usually
-geograpically close (e.g. ISO-8859-1 is intented for Western Europe).
+geographically close (e.g. ISO-8859-1 is intented for Western Europe).
 
 .. TODO:: NELLE : "the problem was" & "The problem is" est plus une expression
   francaise traduite: ce n'est pas faux grammaticallement en anglais, mais ne
   sonne pas juste:
 
   8 bits (256 code points) are not enought so store all (Unicode?) characters
 
-It was difficult to exchange documents of different languages, because if a
-document was encoded to an encoding different than the user encoding, it
-leaded to :ref:`mojibake <mojibake>`.
-
+It was difficult to exchange documents with different languages, because using an
+invalid encoding while loading the document leads to :ref:`mojibake <mojibake>`.
 
 .. TODO:: NELLE : un exemple serait le bienvenu
 
@@ -89,9 +87,9 @@ Year  Norm         Description                                   Variant
 1987  ISO 8859-2   Central European: Croatian, Polish, Czech, …  cp1250
 1988  ISO 8859-3   South European: Turkish and Esperanto         -
 1988  ISO 8859-4   North European        -
-1988  ISO 8859-5   Latin/Cyrillic: Macadonian, Russian, …        KOI family
+1988  ISO 8859-5   Latin/Cyrillic: Macedonian, Russian, …        KOI family
 1987  ISO 8859-6   Latin/Arabic: Arabic language characters      cp1256
-1987  ISO 8859-7   Latin/Greek: modern greek language            cp1253
+1987  ISO 8859-7   Latin/Greek: modern Greek language            cp1253
 1988  ISO 8859-8   Latin/Hebrew: modern Hebrew alphabet          cp1255
 1989  ISO 8859-9   Turkish: Largely the same as ISO 8859-1       cp1254
 1992  ISO 8859-10  Nordic: a rearrangement of Latin-4            -
@@ -172,8 +170,8 @@ cp1252
 ''''''
 
 Windows :ref:`code page <codepage>` 1252, best known as cp1252, is a variant
-of :ref:`ISO-8859-1`. It is the default encoding of all English and western
-europe Windows setups. It is used as a fallback by web browsers if the webpage
+of :ref:`ISO-8859-1`. It is the default encoding of all English and Western
+Europe Windows setups. It is used as a fallback by web browsers if the webpage
 doesn't provide any encoding information (not in HTML, nor in HTTP).
 
 cp1252 shares 224 code points with ISO-8859-1, the range 0x80—0x9F (32
diff --git a/issues.rst b/issues.rst
@@ -7,7 +7,7 @@ Security vulnerabilities
 Special characters
 ''''''''''''''''''
 
-Fullwidth (U+FF01—U+FF60) and halfwidth (U+FF61—U+FFEE) characters has been
+Fullwidth (U+FF01—U+FF60) and halfwidth (U+FF61—U+FFEE) characters have been
 used in 2007 to bypass security checks. Examples with the :ref:`Unicode
 normalization <normalization>`:
 
@@ -32,7 +32,7 @@ IDS/IPS/WAF Bypass Vulnerability
 Non-strict UTF-8 decoder: overlong byte sequences and surrogates
 ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
 
-An :ref:`UTF-8 <utf8>` :ref:`decoder <decode>` have to reject overlong byte sequences, or an attacker can use
+An :ref:`UTF-8 <utf8>` :ref:`decoder <decode>` has to reject overlong byte sequences, or an attacker can use
 them to bypass security checks (e.g. check rejecting string containing nul bytes,
 ``0x00``). For example, ``0xC0 0x80`` byte sequence must raise an error and
 not be decoded as U+0000, and "." (U+002E) can be encoded to ``0xC0 0xAE`` (two
@@ -57,7 +57,7 @@ to encode and decode surrogates.
 
 .. note::
 
-   The :ref:`Java <java>` and Tcl languages uses a variant of :ref:`UTF-8`
+   The :ref:`Java <java>` and Tcl languages use a variant of :ref:`UTF-8`
    which encodes the nul character (U+0000) as the overlong byte sequence
    ``0xC0 0x80``, instead of ``0x00``, for practical reasons.
 
diff --git a/nightmare.rst b/nightmare.rst
@@ -4,7 +4,7 @@ Unicode nightmare
 :ref:`Unicode <unicode>` is the nightmare of many developers (and users) for
 different, and sometimes good reasons.
 
-In the 1980's, only few people read documents in languages other their mother
+In the 1980s, only few people read documents in languages other their mother
 tongue and English. A computer supported only a small number of
 languages, the user configured his region to support languages of close
 countries. Memories and disks were expensive, all applications were written to
@@ -30,14 +30,14 @@ whereas UNIX and BSD operating systems use bytes.
 
 Mixing documents stored as bytes is possible, even if they use different
 encodings, but leads to :ref:`mojibake <mojibake>`. Because libraries and programs do also ignore
-encode and decode :ref:`warnings or errors <errors>`, write a single character with a diacritic
+encode and decode :ref:`warnings or errors <errors>`, writing a single character with a diacritic
 (any non-:ref:`ASCII` character) is sometimes enough to get an error.
 
 Full Unicode support is complex because the Unicode charset is bigger than any
 other charset. For example, :ref:`ISO-8859-1` contains 256 code points including 191
 characters, whereas Unicode version 6.0 contains :ref:`248,966
 assigned code points <unicode stats>`. The Unicode standard is larger than just a
-charset: it explains also how to display characters (e.g. left-to-right for
+charset: it also explains how to display characters (e.g. left-to-right for
 English and right-to-left for persian), how to :ref:`normalize <normalization>` a :ref:`character string <str>`
 (e.g. precomposed characters versus the decomposed form), etc.
 
diff --git a/operating_systems.rst b/operating_systems.rst
@@ -325,9 +325,9 @@ If the console is unable to render a character, it tries to use a
 replacment character can be found, "?" (U+003F) is displayed instead.
 
 In a console (``cmd.exe``), ``chcp`` command can be used to display or to
-change the :ref:`OEM code page <codepage>` (and console code page). Change the
+change the :ref:`OEM code page <codepage>` (and console code page). Changing the
 console code page is not a good idea because the ANSI API of the console still
-expect characters encoded to the previous console code page.
+expects characters encoded to the previous console code page.
 
 .. seealso::
 
diff --git a/programming_languages.rst b/programming_languages.rst
@@ -229,7 +229,7 @@ synchronization between C (``std*``) and C++ (``std::c*``) streams using: ::
 Python
 ------
 
-Python supports Unicode since its version 2.0 released in october 2000.
+Python supports Unicode since its version 2.0 released in October 2000.
 :ref:`Byte <bytes>` and :ref:`Unicode <str>` strings store their length, so
 it's possible to embed nul byte/character.
 
@@ -263,7 +263,7 @@ In Python 2, ``str + unicode`` gives ``unicode``: the byte string is
 because it was the source of a lot of confusion. At the same time, it was not
 possible to switch completely to Unicode in 2000: computers were slower and
 there were fewer Python core developers. It took 8 years to switch completely to
-Unicode: Python 3 was relased in december 2008.
+Unicode: Python 3 was relased in December 2008.
 
 Narrow build of Python 2 has a partial support of :ref:`non-BMP <bmp>`
 characters. The unichr() function raises an error for code bigger than U+FFFF,
@@ -331,7 +331,7 @@ It is possible to make Python 2 behave more like Python 3 with
 Codecs
 ''''''
 
-The ``codecs`` and ``encodings`` module provide text encodings. They supports a lot of
+The ``codecs`` and ``encodings`` modules provide text encodings. They support a lot of
 encodings. Some examples: ASCII, ISO-8859-1, UTF-8, UTF-16-LE,
 ShiftJIS, Big5, cp037, cp950, EUC_JP, etc.
 
@@ -411,7 +411,7 @@ filesystem encoding, ``sys.getfilesystemencoding()``:
 Python uses the ``strict`` :ref:`error handler <errors>` in Python 2, and
 ``surrogateescape`` (PEP 383) in Python 3. In Python 2, if ``os.listdir(u'.')``
 cannot decode a filename, it keeps the bytes filename unchanged. Thanks to
-``surrogateescape``, decode a filename does never fail in Python 3. But
+``surrogateescape``, decoding a filename never fails in Python 3. But
 encoding a filename can fail in Python 2 and 3 depending on the filesystem
 encoding. For example, on Linux with the C locale, the Unicode filename
 ``"h\xe9.py"`` cannot be encoded because the filesystem encoding is ASCII.
@@ -496,7 +496,7 @@ In PHP 5, a literal string (e.g. ``"abc"``) is a :ref:`byte string <bytes>`.
 PHP has no :ref:`character string <str>` type, only a "string" type which is a
 :ref:`byte string <bytes>`.
 
-PHP have "multibyte" functions to manipulate byte strings using their encoding.
+PHP has "multibyte" functions to manipulate byte strings using their encoding.
 These functions have an optional encoding argument. If the encoding is not
 specified, PHP uses the default encoding (called "internal encoding"). Some
 multibyte functions:
@@ -522,7 +522,7 @@ process byte strings as UTF-8 encoded strings.
 
 .. todo:: u flag: instead of which encoding?
 
-PHP includes also a binding of the :ref:`iconv <iconv>` library.
+PHP also includes a binding for the :ref:`iconv <iconv>` library.
 
  * ``iconv()``: :ref:`decode <decode>` a :ref:`byte string <bytes>` from an
    encoding and :ref:`encode <encode>` to another encoding, you can use
diff --git a/unicode.rst b/unicode.rst
@@ -95,8 +95,8 @@ Normalization
 Unicode standard explains how to decompose a character. For example, the precomposed
 character ç (U+00C7, Latin capital letter C with cedilla) can be written as
 the sequence of two characters: {¸ (U+0327, Combining cedilla), c (U+0043, Latin capital letter C)}.
-This decomposition can be useful to search a substring in a
-text, e.g. remove diacritic is pratical for the user. The decomposed form is
+This decomposition can be useful when searching for a substring in a
+text, e.g. removing the diacritic is pratical for the user. The decomposed form is
 called Normal Form D (**NFD**) and the precomposed form is called Normal Form
 C (**NFC**).
 
@@ -108,7 +108,7 @@ C (**NFC**).
 | NFD  | ¸c     | {U+0327, U+0043} |
 +------+--------+------------------+
 
-Unicode database contains also a compatibility layer: if a character cannot be
+Unicode database also contains a compatibility layer: if a character cannot be
 rendered (no font contain the requested character) or encoded to a specific
 encoding, Unicode proposes a :ref:`replacment character sequence which looks
 like the character <translit>`, but may have a different meaning.
diff --git a/unicode_encodings.rst b/unicode_encodings.rst
@@ -71,11 +71,11 @@ code points, whereas UCS-2 is limited to :ref:`BMP <bmp>` characters. These
 encodings are practical because the length in units is the number of
 characters.
 
-**UTF-16** and **UTF-32** encodings use, respectivelly, 16 and 32 bits units.
+**UTF-16** and **UTF-32** encodings use, respectively, 16 and 32 bits units.
 UTF-16 encodes code points bigger than U+FFFF using two units: a
 :ref:`surrogate pair <surrogates>`. UCS-2 can be :ref:`decoded <decode>` from UTF-16. UTF-32
 is also supposed to use more than one unit for big code points, but in
-practical, it only requires one unit to store all code points of Unicode 6.0.
+practice, it only requires one unit to store all code points of Unicode 6.0.
 That's why UTF-32 and UCS-4 are the same encoding.
 
 +----------+-----------+-----------------+
@@ -116,10 +116,10 @@ Byte order marks (BOM)
 ----------------------
 
 :ref:`UTF-16 <utf16>` and :ref:`UTF-32 <utf32>` use units bigger than 8 bits,
-and so hit endian issue. A single unit can be stored in the big endian (most
-significant bits first) or little endian (less significant bits first). BOM
-are short byte sequences to indicate the encoding and the endian. It's the
-U+FEFF code point encoded to the UTF encodings.
+and so are sensitive to endianness. A single unit can be stored as big endian (most
+significant bits first) or little endian (less significant bits first). BOM 
+is a short byte sequence to indicate the encoding and the endian. It's the
+U+FEFF code point encoded with the given UTF encoding.
 
 Unicode defines 6 different BOM:
 
@@ -147,7 +147,7 @@ UTF-32 in the host endian without BOM. On Windows, "UTF-16" usually means
 UTF-16-LE.
 
 Some Windows applications, like notepad.exe, use UTF-8 BOM, whereas many
-applications are unable to detect the BOM, and so the BOM causes troubles.
+applications are unable to detect the BOM, and so the BOM causes trouble.
 UTF-8 BOM should not be used for better interoperability.
 
 .. todo:: which troubles?