diff --git a/README.md b/README.md index f4aa424..384f49c 100644 --- a/README.md +++ b/README.md @@ -14,9 +14,15 @@ The classes defined are listed below: Provides a basic implementations of some popular edit distance methods (currently, Levenshtein and indel) applied to arrays of objects. +[distance.BagOfWords](https://github.com/impactcentre/ocrevalUAtion/blob/master/src/main/java/eu/digitisation/distance/BagOfWords.java)
+Computes distances between two bags of words (order independent distance). + +[distance.EditTable] +Compact storage for a large table containing four basic edit operations. + [distance.StringEditDistance](https://github.com/impactcentre/ocrevalUAtion/blob/master/src/main/java/eu/digitisation/distance/StringEditDistance.java)
Provides basic implementations of some popular edit distance methods -operating on strings (currently, Levenshtein and indel). +operating on strings (currently, Levenshtein, Damerau-Levenshtein, and indel). [distance.TextFileEncoder](https://github.com/impactcentre/ocrevalUAtion/blob/master/src/main/java/eu/digitisation/distance/TextFileEncoder.java)
Encode a text file as an array of Integers (one code per word). @@ -25,9 +31,14 @@ Encode a text file as an array of Integers (one code per word). Transform text according to a mapping between (source, target) Unicode character sequences. +[io.StringNormalizer] +Normalizes strings: collapse whitespace and use composed form (see java.text.Normalizer.Form) + [io.TextContent](https://github.com/impactcentre/ocrevalUAtion/blob/master/src/main/java/eu/digitisation/io/TextContent.java)
Reads and normalizes text from file content, -and optionally applies a CharFilter. +and optionally applies a CharFilter. Now, it supports text files and PAGE XML files (selects only those +elements listed in a properties file, TOC-entry, heading, +drop-capital, paragraph). [io.UnicodeReader](https://github.com/impactcentre/ocrevalUAtion/blob/master/src/main/java/eu/digitisation/io/UnicodeReader.java)
Transformations between Unicode strings and codepoints. @@ -45,6 +56,13 @@ Standard operations on arrays: sum, average, max, min, standard deviation. Counts the number of different objects, a map between objects and integers which can be incremented and decremented. +[math.BiCounter](https://github.com/impactcentre/ocrevalUAtion/blob/master/src/main/java/eu/digitisation/math/BiCounter.java)
+Counts the number of different pairs of objects, a map between +pairs of objects and integers which can be incremented and decremented. + +[math.Pair] +A pair of objects. + [ocr.ErrorMeasure](https://github.com/impactcentre/ocrevalUAtion/blob/master/src/main/java/eu/digitisation/ocr/ErrorMeasure.java)
The main class which computes character and word error rates. @@ -56,10 +74,6 @@ PAGE-XML regions order in the document can differ form reading order. This class makes the order of elements in the document consistent with the reading order stored therein. -[Page.TextContent](https://github.com/impactcentre/ocrevalUAtion/blob/master/src/main/java/eu/digitisation/Page/TextContent.java)
-Textual content in a PAGE XML: selects only those -elements listed in a properties file (TOC-entry, heading, -drop-capital, paragraph). [Page.TextRegion](https://github.com/impactcentre/ocrevalUAtion/blob/master/src/main/java/eu/digitisation/Page/TextRegion.java)
A TextRegion in a PAGE-XML document. diff --git a/src/main/java/eu/digitisation/distance/BagOfWords.java b/src/main/java/eu/digitisation/distance/BagOfWords.java index 621a4bc..670e919 100644 --- a/src/main/java/eu/digitisation/distance/BagOfWords.java +++ b/src/main/java/eu/digitisation/distance/BagOfWords.java @@ -24,7 +24,7 @@ import java.util.logging.Logger; /** - * + * Computes distances between two bags of words (order independent distance) * @author R.C.C. */ public class BagOfWords {