- https://github.com/Ousret/charset_normalizer
- Surveys regarding the issue https://arxiv.org/pdf/1804.08186.pdf
- Mega dataset http://www.cs.cmu.edu/~ralf/langid.html
- Language dataset for Langid https://github.com/saffsd/langid.py
- Language dataset for Langdetect https://dumps.wikimedia.org/backup-index.html
- Language dataset for CLD2 https://github.com/CLD2Owners/cld2/blob/master/internal/test_shuffle_1000_48_666.utf8.gz
- Language dataset for Lingua https://wortschatz.uni-leipzig.de/de
- (There is no way in hell UDHR is good data for language identification)
- https://github.com/shivam5992/textstat
- In case we need to use browsers https://github.com/cgiffard/TextStatistics.js
- Flesch is an all-rounder
- For Education, use Coleman-Liau (Dale-Chall/SMOG/FOG)
- For Manuals, use ARI (Linsear)
- For Publishing, use Coleman-Liau or LIX (Dale-Chall/SMOG/GOG)
- ELIXIV initiative (Coleman-Liau or ARI score should be less than 9)
- ELIXIV for LIX? it should be under 40 (why TF is it not linear!?)
https://github.com/languagetool-org/languagetool
- http://mentalfloss.com/article/18661/quick-10-10-longest-novels-ever
- http://www.vulture.com/2016/11/long-books-worth-your-time.html
- https://themillions.com/2007/09/world-longest-novel.html
- My Top THree
- The shitpost novel - The Blah Story by Nigel Tomm (3.3 Mil)
- A romance novel - Marienbad My Love by Mark Leach (2.5 Mil)
- Pre-scientology - Mission Earth by L. Ron Hubbard (1.2 Mil)
- WHO AUTOMATED THIS? https://github.com/johnafish/duree
- bad reference (needs cleaning) https://github.com/shaoxiongji/awesome-knowledge-graph
- https://github.com/Accenture/AmpliGraph