Skip to content

High-performance program to spell-check and auto-correct large documents

License

Notifications You must be signed in to change notification settings

MS-Hall-Git/spell-check

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spell-check

Automatically find and fix spelling errors in your READMEs and other documents that you write without a word processor. spell-check is a fast command line application to spell-check large text files (books, Github files, assignments etc.) and autocorrect misspelled words based on a probabilistic model. The program is optimized for speed and can check over 1 million words in less than 1 second.

Checking Documents

Download and Install from source

$ wget -P ~/Downloads https://github.com/madhav-datt/spell-check/archive/v2.0.zip
$ unzip ~/Downloads/spell-check-2.0.zip
$ mv ~/Downloads/spell-check-2.0 ~/Downloads/spell-check
$ chmod +x spell-check/install
$ sudo spell-check/install

Running spellchecker

$ spellcheck /path/to/file/file_to_be_checked

The program supports spellchecking and auto-correct for txt files and PDF files. You could also batch process multiple files inside a directory.

Sample spell-check usage

Output

The program will output a list of all the misspelled words along with suggested corrections, and file checking benchmarks.

Sample spell-check output

oficiel, which was intended to be official has no suggested correction because it has an edit distance of more than 1 from a correctly spelled word. Read more about this here.

spell-check will fix spelling errors due to missing spaces using a segmentation algorithm. Read more here.

Sample spell-check segmentation output

Benchmarks

Both speed and accuracy benchmarks give an approximate value that has been averaged over multiple input text files and documents.

Spellcheck Speed

Optimized for speed - can spellcheck over 1 Million words in less than 1 second.

Misspelled Words Words in Dictionary Words in Text Document Time in Loading Data
122.5 368895 10237257.5 0.34 seconds
Time in Checking Text Time in Correcting Text Time in Unloading Data Total Time
0.56 seconds 0.01 seconds 0.15 seconds 0.98 seconds

Sample speed benchmark

Autocorrect Accuracy

Calculated on inputs from Roger Mitton's Birkbeck spelling error corpus from the Oxford Text Archive. On a development set of 250 test cases (including context based mistakes for correctly spelled words) the spell-check program has an accuracy of around 66 % and close to 80 % for misspelled words with an edit distance equal to one.

Word Frequency Data Details

Read about the data, sources, processing raw word data, word frequency, probabilistic model for word correction etc. here.

Known Issues

  • No context based/grammar checking -

    Their is nothing to be done here.

    will be treated as a correct sentence and not be changed to

    There is nothing to be done here.

  • Words with edit distances greater than 1 cannot be corrected - oficiel, won't be corrected to official.

  • Doesn't fix spelling errors due to missing spaces.

    historicaldata

    will be found as misspelled, but won't be corrected to

    historical data

  • Please report bugs and issues here.

About

High-performance program to spell-check and auto-correct large documents

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C 64.8%
  • Python 28.3%
  • Shell 4.0%
  • Makefile 2.9%