File specification for the diacritics database
- 1. What Information Must Be Collected
- 2. Master Branch
- 3. Dist Branch
Since there's no trustworthy and complete source, it's necessary to collect all diacritics, ligatures and symbols mapping information manually. It's also necessary to collect meta information for each language such as links to sources documenting the characters.
The master branch is used as the branch for source files.
The structure of this branch is divided into the following folders:
database/
├── spec/
└── src/
Used to specify the structure.
The structure in the src/
folder can look like:
src/
├── de/
│ └── de.js
└── fr/
└── fr.js
If there are language variants it could look like:
src/
├── de/
│ ├── de.js
│ ├── at.js
│ └── ch.js
└── fr/
└── fr.js
NOTE: German does not have a language variant that differs from the root language; therefore the at.js
and ch.js
files do not actually exist.
- As there might be multiple files for a language, each language has its own folder.
- The language files in this folder will be in a
.json
format and may include comments (they will be removed before handling the contents) - Folder and file names must be according to ISO 639-1 for the root language.
- Each language variant has its own file:
- If the variants (e.g.
de-DE
,de-AT
,de-CH
) don't have any differences from the root language, then only the root file is necessary. - When creating a language variant file all mappings need to be created (redundant) – not incrementally.
- Any variant file will be contained within the root language folder.
- The variant file will be named according to IETF language tag extended language (
extlang
) subtag. So, if ade_AT
would be necessary, then the file would be namedat.js
; by the extended language subtag and in all lower-case. - See this table for a quick reference of available IETF language tags.
- If the variants (e.g.
- Changes should only be done within the source files.
Files within the src/
folder are called language files.
Example structure for the German language source file, which only contains lower case characters for brevity:
{
"metadata": {
"alphabet": "Latn",
"continent": [
"EU"
],
"language": "German",
"languageNative": "Deutsch",
"source": [
"https://en.wikipedia.org/wiki/German_orthography#Special_characters"
]
},
"data": {
"ü": {
"mapping": {
"base": "u",
"decompose": {
"value": "ue"
}
}
},
"ö": {
"mapping": {
"base": "o",
"decompose": {
"value": "oe"
}
}
},
"ä": {
"mapping": {
"base": "a",
"decompose": {
"value": "ae"
}
}
},
"ß": {
"mapping": {
"decompose": {
"value": "ss"
}
}
}
}
}
Required
Type: String
An alphabet code based on ISO 15924 that specifies the associated alphabet.
Required
Type: Array
An array of continent codes based on ISO-3166 that specifies the associated continent.
Required
Type: String
The associated language written in English.
Required
Type: String
The associated language written in the native language.
Optional
Type: String
The associated language variant written in English, if applicable. For example, if the IETF language tag was de_AT
, this entry would include the full name of the variant country, Austria
for the AT
variant.
Optional
Type: String
The associated language variant written in the native language, if applicable. Required in case a metadata.variant
is provided.
Optional
Type: Array
An array containing links to diacritic, ligature and symbol sources which includes mapping and other details.
Required
Type: Object
An object containing the actual mapping information. Every character has its own object key. The value for each diacritic, ligature or symbol is an object specified below:
Required
Type: String
The case value has three values:
'upper'
- Signifies that the character is capitalized, or an uppercase character.'lower'
- Signifies that the character is a lower case character.'none'
- Signifies that the character has no case; e.g. punctuation (¿
,¡
and⸘
) and Arabic characters.
Required
Type: Object
An object containing mapping values for the given character. This must contain a base or decompose value, or both. It can not be empty.
Optional
Type: String
This is the base of the character (e.g. the diacritic ü
has a base of u
, an unaccented character).
Optional
Type: Object
This is the character, or combination of characters used to represent the diacritic (e.g. ö
decomposes into oe
in German), ligature (e.g. æ
decomposes into ae
), or symbol (e.g. ‽
decomposes into ?!
).
Each decompose value may need to be transformed into lower or upper and title case. These entries have been included within the decompose
values and depend on the 'case'
setting.
Optional
Type: String
This value is only available when the character case is set to 'upper'
, and when a decompose value comprises two or more characters. This setting is provided because the case of the first character may differ from the other character(s). This property was implemented in case a diacritic is the first character of a word written in title case. Include the necessary changes in this entry (e.g. Æ
decomposes into Ae
in a title case).
Optional
Type: String
This value contains the decompose character or characters. This property may contain the decompose value in either all lower case or capital letters depending on the case
setting. Include the necessary changes in this entry (e.g. æ
decomposed into ae
in all lower case, or Æ
decomposes into AE
in all upper case).
The dist
branch contains generated data from master's src/
files combined into a single file, diacritics.json
. This file has almost identical structure like the files within src/
and will be in a valid .json
format which can not contain comments.
This distribution file is automatically generated when a pull request targeting the master branch is merged using a build that is running on Travis CI. This workflow has been chosen to allow contributions without a Git clone or command line things in general.
The file will be generated to a uniquely named folder, e.g. v1
. In case the JSON structure changes, it's then possible to generate it to a different folder. This makes sure that components depending on the old file structure will still work.
While the dist branch contains all data, a server-side component is used to serve and filter them.
The structure will use the language folder name (ISO 639-1) as the key. This key will contain data for the root language and any variants. This structure is used to simplify access to the data through the API; see the API spec for more details.
Within each language entry:
- The root language code will be added as an entry that uses the same language code (e.g.
"de": { "de": {...} }
). The value will contain the internal file data (e.g. fromde/de.js
) as the value. - Any variants of the language that differ from the root language will also be included. The value will contain the internal file data of the named variant file.
In this example, German and French languages are included:
{
"de": {
"de": {
"metadata": {...},
"data": {...}
},
"at": {
"metadata": {...},
"data": {...}
}
},
"fr": {
"fr": {
"metadata": {...},
"data": {...}
}
}
}
NOTE: German does not have a language variant that differs from the root, so the at
entry would not exist in the final data. It is shown here only as an example!
Beside of generating all the data into one file the build automatically adds visual equivalents to diacritics.json
. Behind each diacritic, ligature or symbol may be several identical looking characters that have a different Unicode code point. To make sure the mapping process will catch all of them, these equivalents need to be mapped too.
These equivalents are added to the language file under data.{character}.equivalents
({character}
is a placeholder for a character).
Example (only showing one lower case diacritic):
{
"de": {
"de": {
"metadata": {...},
"data": {
"ü": {
"case": "lower",
"mapping": {
"base": "u",
"decompose": {
"value": "ue"
}
},
"equivalents": [
{
"raw": "ü"
"unicode": "\\u00fc",
"htmlDecimal": "ü",
"htmlHex": "ü",
"encodedUri": "%C3%BC",
"htmlEntity": "ü"
},
{
"raw": "ü",
"unicode": "u\\u0308",
"htmlDecimal": "ü",
"htmlHex": "ü",
"encodedUri": "u%CC%88"
}
]
}
}
}
}
}
Required
Type: Array
Contains an array containing equivalents objects consisting of the original character and any additional characters created through normalization.
Required
Type: String
Contains a rendered equivalent character (e.g. ü
).
Required
Type: String
Contains an escaped unicode (hex) value of the character (e.g. \u00f6
).
Required
Type: String
Contains a HTML entity in decimal format (e.g. ö
).
Required
Type: String
Contains a HTML entity in hex format (e.g. ö
).
Required
Type: String
Contains a named HTML entity if one exists (e.g. ö
) - ref.
Required
Type: String
Contains a URL encoded value (e.g. %C3%B6
) - see percent encoding.
Required
Type: Array
An array of countries where the given language is officially spoken. This entry is automatically added by the build script, with language information obtained directly from Unicode's CLDR supplemental languageData and cross-referenced with supplemental territoryInfo databases.
The provided language tags are based on IETF BCP 47 with some minor differences; please see the CLDR spec for more details. For a quick reference, use this IETF-language-tags searchable table ("territory" column).