Skip to content
Stefan Weil edited this page May 11, 2020 · 13 revisions

Training Fraktur with Neue Zürcher Zeitung



Training set 1

All original lines were randomly split in 90 % for training and 10 % for evaluation. 3 line images were skipped because they exceeded the width limit in current lstmtrain. Training is running for 10 epochs.

make -r MODEL_NAME=nzz-new GROUND_TRUTH_DIR=NZZ_groundtruth/gt MAX_ITERATIONS=388550 training

Current CER: 1.384 %, CPU time: 28:33 h

Training set 2

10 pages (same as in original test) were used for evaluation. The code for lstmtrain was modified to allow line images with a width of up to 4096 px. Training is running for 10 epochs with the default network specification and an alternate specification which scales to 64 px height.

make -r MODEL_NAME=nzz-ref GROUND_TRUTH_DIR=NZZ_groundtruth/gt MAX_ITERATIONS=387780 training

Current CER: 1.115 %, CPU time: 33 h

make -r MODEL_NAME=nzz-64 GROUND_TRUTH_DIR=NZZ_groundtruth/gt MAX_ITERATIONS=387780 NET_SPEC="[1,64,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c148]" training

Current CER: 2.852 %, CPU time: 15:43 h

Intermediate results

An intermediate model nzz-64_2.852_48798_92600.traineddata achieves CER 95.48 % on the evaluation set of 10 pages. The confusion list shows that there are some likely transcription errors in the ground truth data. These will be fixed using GTCheck before we try the next iteration.

UNLV-ISRI OCR Accuracy Report Version 5.1
  207870   Characters
    9396   Errors
   95.48%  Accuracy

       0   Reject Characters
       0   Suspect Markers
       0   False Marks
    0.00%  Characters Marked
   95.48%  Accuracy After Correction

     Ins    Subst      Del   Errors
       0        0        0        0   Marked
    1640     5829     1927     9396   Unmarked
    1640     5829     1927     9396   Total

   Count   Missed   %Right
   26448      309    98.83   ASCII Spacing Characters
    6038      504    91.65   ASCII Special Symbols
    2234      238    89.35   ASCII Digits
    9991      712    92.87   ASCII Uppercase Letters
  163159     5706    96.50   ASCII Lowercase Letters
  207870     7469    96.41   Total

  Errors   Marked   Correct-Generated
      89        0   {}-{ }
      83        0   {}-{,}
      80        0   {h}-{b}
      72        0   {f}-{s}
      67        0   { }-{}
      51        0   {u}-{n}
      49        0   {}-{n}
      49        0   {}-{s}
      47        0   {}-{b}
      46        0   {l}-{t}
      46        0   {n}-{u}
      43        0   {,}-{}
      43        0   {.}-{,}
      42        0   {i}-{}
      42        0   {}-{.}
      41        0   {}-{t}
      40        0   {r}-{n}
      39        0   {.}-{}
      38        0   {g}-{a}
      36        0   {B}-{V}
      35        0   {e}-{}
      33        0   {l}-{}
      33        0   {}-{i}
      32        0   {N}-{R}
      32        0   {r}-{}
      31        0   {t}-{}
      30        0   {}-{a}
      30        0   {}-{u}
      28        0   {s}-{f}
      28        0   {s}-{}
      27        0   {c}-{e}
      27        0   {k}-{t}
      26        0   {i}-{t}
      26        0   {}-{8}
      26        0   {}-{e}
      25        0   {d}-{b}


Clone this wiki locally