-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
i18n: european portuguese word list #6044
base: develop
Are you sure you want to change the base?
i18n: european portuguese word list #6044
Conversation
Imported from Dicionários Natura in https://natura.di.uminho.pt/download/sources/Dictionaries/wordlists/ Needs trim of the words that would be difficult to memorize
Hi @lisbonjoker! I'd like to understand better how to review your PR. Could you please give some more context on that word list and what made you choose it? For example, I'd love to hear more about:
|
How is the word list licensed?The dictionaries are covered by the GPL, LGPL, and MPL licenses (or at least one of them) How is it composed? By whom?The Natura Project is a small research group in Natural Language Processing at the Department of Computer Science, University of Minho. It is part of a larger Language Processing and Specification group. More in: https://natura.di.uminho.pt/wiki/doku.php?id=dicionarios:main Current ManagementJosé João Almeida Other collaboratorsRui Vilela What words are in it, are those verbs, nouns, adjectives, adverbs? Where do they come from?List of Portuguese words (including some acronyms, etc). It contains proper names, acronyms, abbreviations and common loanwords; This list is derived from the Jspell dictionary for morphological analysis. What purpose would it fulfill in the context of SecureDrop?For European Portuguese citizens to use in a SecureDrop as there is a big difference in languages between PT BR and PT EU. Some words unused or unrecognized. |
Don't know if this is helpful to this conversation, but: in an effort to cut this very long list down to a length closer to the existing SecureDrop wordlists, I took the most frequently appearing words from Portuguese Wikipedia articles (with help from this project), then filtered out any and all words NOT on this 994,951-word list. I then removed any and all words with accented characters or non-UTF-8 characters (I think), all words not between 3 and 15 characters, and any Roman numerals. (Notably I didn't filter out profane words.) I arbitrarily chose to make this new list 10,000 words. The result was this wordlist. Hope this helps -- sorry if it derails things. |
We're tentatively interested in evaluating this as part of other localization efforts for v2.12. As a prerequisite, we'd like to review how we've added and maintained our other word lists so far. |
Imported from Dicionários Natura in https://natura.di.uminho.pt/download/sources/Dictionaries/wordlists/
Needs trim of the words that would be difficult to memorize.