-
Notifications
You must be signed in to change notification settings - Fork 3
License
GNUAspell/aspell-lang
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
2010-11-21 This is an experimental version of the Aspell language toolkit (temporary name). It can be used to create dictionaries for Aspell both Aspell 0.50 and 0.60. ********************************************************************** Getting Started ********************************************************************** Since Aspell is 8-bit internally you need to first decide on a charset to use. See the section "Provided Character Sets" for a list of available character sets. If none of the character sets are adequate then you need to create a new one for your language. If this is necessary please email [email protected] for help. Now cd to the location of the aspell-lang package (the directory this file is in) and run ./pre LANG CHARSET where LANG is the iso language code for your language and CHARSET is the charset you decided This will create a directory LANG with the following files in it: info LANG.dat LANG.wl Copyright proc (symbolic link) and possible CHARSET.cset CHARSET.cmap misc/CHARSET.txt Edit the "info" and "Copyright" file as appropriate. See the next section for what these fields mean. If you chose a charset which Aspell provides than the default encoding will be that charset. If you rather use "utf-8" than uncomment the line "data-encoding utf-8" in LANG.dat. Replace LANG.wl with a small word list for your language. Now to build the word list: ./proc ./configure make And if all goes well you should have a very basic dictionary for your language. You can install it if you want using "make install". If you want to make a dictionary package use "make dist". Please see "Adding Support For Other Languages" in the Aspell manual and the rest of the document for where to go from here. When you have something ready to disribute check over requirements section in this file and once you are reasonably sure you have something ready to upload send it to [email protected]. ********************************************************************** Draft Documentation on the layout of Aspell dicts packages ********************************************************************** The overall goal of Aspell dicts is to provide a uniform method to distribute dictionaries for Aspell for any language that Aspell supports. This documentation is still in an early stage and rather incomplete. It is meant to give you enough of an overview so you know what is going on, but probably won't be enough information for you to actually create a distribution. Layout of the Distribution: An Aspell Word List Package contains several type of files, many of them generated by the proc script. These must be provided: info: the main file which contains all of the important word lists *.wl: word list files Copyright: the copyright notice ??.dat: The language data file Several optional ones: additional language data files (must be listed under data-file) COPYING: The actual license agreement. Automatically provided for some licenses doc/* additional documentation misc/* other files to include in distribution and finally some automatically generated or provided ones: configure: the configure script which finds the appropriate paths and generates the actual makefile. This file needs to be copied from aspell-gen package. ??.dat: the data file for the language. *.multi: the dictionary files Makefile.pre: the makefile which configure uses. *** Format of the Info File (Note: For a better idea of how this file is laid out see some of the sample info files included) The info file is the main file which contains most of the information. It is expected to be in utf-8. It has two types of entries. Single value settings, and group settings. Single value settings have the form: <key> <value> And group settings which have the form: <group key>: <key> <value> <key> <value> ... If there is ANY whitespace before a key it is assumed to belong to a group entry. The following Single value settings are mandatory: name_english: The english name of the language lang: The language code copyright: The copyright one of: LGPLv2.1 LGPLv3 GPLv2 GPLv3 FDLv1.1 FDLv1.2 Artistic Copyrighted (Copyright message must remain) Free Software (Meets FSF definition of free) Open Source (Meets OSI definition) Public Domain (ie none) Other Unknown version: A version string complete: "true" if the dictionary is reasonably complete, "almost", if its close, "false" otherwise, or "unknown" accurate: "true" if the dictionary is accurate (ie every word is a valid), "false" otherwise, or "unknown" In addition there must be at least one of each of the following group entries: author: name: The name of the author written using the Latin script, preferably spelled in English. Accents are allowed. name-native: The name of the author written in the native script and spelling. email: The email address of the author. The email needs to be translated into an anti-spam versions. '.' are replaced with spaces and '@' is replaced with ' at '. For example "[email protected]" becomes "kevina at gnu org". maintainer: Set to 'true' if this person actively maintains the Aspell version of the word list. Set to 'false' or leave out otherwise. Multiple author groups may be specified. dict: The defining entry for a dictionary name: The name of this dict alias: An alternate name (may be repeated) add: A word list to add (may be repeated) multiple dictionaries may be defined. If a particular dictionary should not have a awli entry associated with it add "awli false". Dictionary name should be of the form <code>[_<country>][-<jargon>][-<size>] Where <country> is the two letter ISO 3166 country code which should be in all upper case, <jargon> is any extra information to distinguish the dictionary from other dictionaries, <size> is the dictionary size and should be a two digit number which should roughly follow these guide lines: 10: tiny 20: really small 30: small 40: med-small 50: med 60: med-large (the default size) 70: large 80: huge 90: insane See SCOWL (http://wordlist.sourceforge.net) for an example of how these sizes are used. Aliases for individual dictionaries can automatically be created if a global alias line is defined. Each global alias represents a part of a dictionary name. For example: alias fr francais french alias 40 sml small will cause the following alias to automatically be generated: francais-40 francais-sml francais-small french-40 french-sml french-small fr-sml fr-small Aliases normally do not have awli entries associated with them. If you wish a particular alias to have a awli entry simply tag ":awli" after the alias. For example alias en_GB en:awli If an alias has a awli entry associated with it the final alias must be of the proper form In additional to the above the info file can also contain the following optional entries data-file: Additional language data files to be installed. May be given multiple times for more than one file. readme-extra: A text file in the doc/ directory to be append to the end of the README file. If is not in utf-8 than the encoding it is in should be specified after the file name (seperated by a space). doc-encoding: The encoding the documentation should be in alt-encoding: Alternate encoding for documentation. Each entry should have the form "<encoding> <ext>". url: Url of the official version of the dictionary for Aspell source_url: Url of the original word list source_version: Version of the original word list used name_ascii: The language name in spelled in its own language in all ascii characters name_native: Like above but not limited to ASCII characters or the Latin script. copyright_desc: A BRIEF description of the copyright if the copyright line doesn't adequately describe it notes: A BRIEF description of any major problems with this dictionary, other than being incomplete or inaccurate, such as being too large. mode: Controls if the dictionary package will be created for Aspell 0.50 or 0.60. Either "aspell5" or "aspell6". The default is "aspell6". And a bunch of other entries which I will document latter. *** The *.wl/*.cwl For each add entry in the dict entry there should in general be one word list. Each of these words lists will be compiled into a separate hash files so you should keep the number to a minimum. Each file is expected to have the following format: <code>[-...].wl These files will be compressed for you with prezip-bin and renamed to *.cwl. *** Copyright file The copyright file simply states the terms in which this word list is available. If the license is a standard one or is more than a paragraph or so the actual license should be included in a separate file "COPYING". If you are using one of the GNU licenses the COPYING file will automatically be generated for you. *** running proc Once the info file is created you are ready to run the proc script. The proc script needs to be copied or linked into the current directory for things to work correctly. Once that is done. Simply type: perl proc create and if there are no errors you should have the above listed generated files. To try building a word list run configure with ./configure and then to build and install it make make install To create a distribution do a make dist ********************************************************************** Requirements in order to be upload to ftp.gnu.org ********************************************************************** The number one requiment is that the dictionary package MUST be made using "make dist" using the "proc" script as previously desribed. This will check for a large number of things. When building the dictionary there should, in general, not be any warnings. The version string must end in "-NUM" where NUM is generally 0. This is to allow for minor updates. In addition there should not be any other "-" in the version string. "name_native" should be given a value if it is diffrent from the English spelling The "complete" and "accurate" fields should have a value other than unknown. If the dictionary package is based on another dictionary, then "source-version" and propabably "url" should be given a value. Also, the version string should be made to resemble the upstream version to make the relationship clear. If one of the authors plans to act as the maintainer for the dictionary package set add the line "maintainer true" for that author. There may be more than one maintainer. The file Copyright should contain a clear Copyright notice, which icnluded the owner of the Copyright. It should be something like: Copyright (c) YEAR by SO AND SO under the WHAT. The copyright must meet FSF defination of free. See http://www.gnu.org/licenses/license-list.html ********************************************************************** GNU Aspell mkchardata Perl script and Unicode data file ********************************************************************** This version of mkchardata will only work for GNU Aspell 0.60 or better. It will not work for Aspell 0.50 or any of Aspell 0.51/0.60 snapshots before 2004-03-02 The mkchardata Perl script will read in a textual reference table(s) and convert them into Aspell character data file(s). Its usage is mkchardata <textual reference table(s)> The files "unicode.txt" and "decomp.txt" are expected to be in the current directory. mkchardata will convert each textual reference table to an Aspell character data file and normalization map file. It expects the table to be in the form 0x?? 0x???? # ... Where 0x?? is the 8-bit character value in hex and 0x???? is the Unicode value. Anything after the '#' is ignored. Ranges can also be specified in the form 0x??..0x?? = 0x????..0x???? # ... The table may alternatively have the form: =?? U+???? ... Another file can be included by using: include <file name> The directive == <charset> indicates that the _unicode mapping_ is the same for the current file as it is in <charset>. The only difference is the character properties. The directives: no-latin letter <char> letters <char> <char> ... vowel <char> vowels <char> <char> ... case <upper> <lower> [<title>] can be used to customize the character properties. None of these effect the actual mapping. The "no-latin" line can be used to avoid marking Latin letters as part of a word. It is useful if the charset is based on an exiting one which maps the Latin letters but your language in not written using the Latin script. The "letter" or "letters" directives can be used to indicate that an accented letter is really a unique letter and not a letter with an accent. Each <char> is a single pre-composed character in UTF-8 or a Unicode code point of the form (U+)XXXX where XXXX is in hex. The "vowel" or "vowels" directive can be used to identify the vowels of a language. If used it is necessary to list ALL vowels of the language. If not specified than the information is taken from the unicode data file. Specifying a characters here implies "letter". The "case" directive can be used to identify special case rules which are different from the Unicode default such as the rules involving the dotless I for Turkish. See the file l-tr.txt for an example of the "letter" and "case" directive. As of Aspell 0.60 the following characters may be remapped: 01-0F ( 1- 15) # Control characters 11-1F ( 17- 31) # Control characters 41-5A ( 65- 90) # Uppercase Latin alphabet 61-7A ( 97-122) # Lowercase Latin alphabet 80-FF (128-255) Giving you a total of 210 characters to work with. If your language uses characters not found in iso-8859-1 (code points U+00 to U+FF) you might want to look over unicode.txt and make sure everything is correct for your language. If you find any errors please send them to me at [email protected]. ********************************************************************** Provided Character Sets ********************************************************************** INCLUDING WITH ASPELL: ISO-8859: iso-8859-1 - Latin1 (Western) iso-8859-2 - Latin2 (Central European) iso-8859-3 - Latin3 (South European) iso-8859-4 - Latin4 (Old Baltic) iso-8859-5 - Cyrillic iso-8859-6 - Arabic iso-8859-7 - Greek iso-8859-8 - Hebrew iso-8859-9 - Latin5 (Turkish) iso-8859-10 - Latin6 (Nordic) iso-8859-11 - Thai iso-8859-13 - Latin7 (Baltic) iso-8859-14 - Latin8 (Celtic) iso-8859-15 - Latin9 (New Western) iso-8859-16 - Latin10 (Romanian) See http://aspell.net/charsets/iso8859.html Microsoft Code Pages: cp1250 - Central European (Latin) cp1251 - Cyrillic cp1252 - Western (Latin) cp1253 - Greek cp1254 - Turkish (Latin) cp1255 - Hebrew cp1256 - Arabic cp1257 - Baltic (Latin) cp1258 - Vietnamese (Latin) See http://aspell.net/charsets/codepages.html Crylic: koi8-r koi8-u - Ukrainian iso-8859-5 cp1251 See http://aspell.net/charsets/cyrillic.html OTHERS: These mappings are available under the maps/ directory. If you use one of them for your dictionary they should be included with the tarball. You can convert all of them to Aspell's charset files by using: perl mkchardata maps/*.txt Since there is the possibility of two different dictionaries providing the same charset file, DO NOT modify the mappings or the charset files. If you wish to customize it for your language rename it to l-<lang>.cset. These are like the base character set except that the C0 and C1 control areas were remapped to include any decomposed letter found the unicode blocks "Latin-1 Supplement" and "Latin Extended-A" and any combining marks used in any of the latin unicode code blocks "Latin-1 Supplement", "Latin Extended-A", "Latin Extended-B", "Latin Extended Additional". iso-8859-1-u iso-8859-2-u iso-8859-3-u iso-8859-4-u iso-8859-9-u iso-8859-10-u iso-8859-13-u iso-8859-14-u iso-8859-15-u iso-8859-16-u These are identical to the base character set except that latin letters are not used so that Aspell won't flag words written using the Latin script as incorrect. cp1251-nl cp1253-nl cp1255-nl cp1256-nl iso-8859-5-nl iso-8859-6-nl iso-8859-7-nl iso-8859-8-nl iso-8859-11-nl koi8-r-nl koi8-u-nl Vietnamese: viscii tcvn3 Other standard mapings: iso-6438 - Extended African Latin Alphabet Simple Unicode mappings: u-armn - Armenian (U+0530..U+058F to 0xA0..0xFF) u-beng - Bengali (U+0980..U+09FF to 0x80..0xFF) u-deva - Devanagari (U+0900..U+097F to 0x80..0xFF) u-geor - Georgian (U+10A0..U+10FF to 0xA0..0xFF) u-gujr - Gujarati (U+0A80..U+0AFF to 0x80..0xFF) u-guru - Gurmukhi (U+0A00..U+0A7F to 0x80..0xFF) u-knda - Kannada (U+0C80..U+0CFF to 0x80..0xFF) u-mong - Mongolian (U+1800..U+187F to 0x80..0xFF) u-mymr - Myanmar (U+1000..U+105F to 0xA0..0xFF) u-orya - Oriya (U+0B00..U+0B7F to 0x80..0xFF) u-sinh - Sinhala (U+0D80..U+0DFF to 0x80..0xFF) u-taml - Tamil (U+0B80..U+0BFF to 0x80..0xFF) u-telu - Telugu (U+0C01..U+0C7F to 0x80..0xFF) u-tglg - Tagalog (U+1700..U+171F to 0xA0..0xBF) u-thaa - Thaana (U+0780..U+07BF to 0xC0..0xFF) Not so simple Unicode mappings: u-mlym - Malayalam u-hebr - Hebrew Special mappings using private use characters: s-ethi - Ethiopic The latin letters are not used in any of the above unicode mappings. Language specific mappings. Unlike the other mappings, it is permissible to modify these. However to avoid future problems, please let me know about the changes at [email protected]. l-az - Azerbaijani l-fa - Persian l-ky - Kirghiz l-sr - Serbian (supports both the Cyrillic and Latin script) l-tg - Tajik l-tr - Turkish (iso-8859-9 with special case rules for dotless I) l-uz - Uzbek Some other language specific mappings are also available which I created for various people, most have not been used in an official dictionary yet and might still be incomplete.
About
No description, website, or topics provided.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published