-
Notifications
You must be signed in to change notification settings - Fork 202
See https://github.com/impresso/NZZ-black-letter-ground-truth.
All original lines were randomly split in 90 % for training and 10 % for evaluation.
3 line images were skipped because they exceeded the width limit in current lstmtrain
.
Training is running for 10 epochs.
make -r MODEL_NAME=nzz-new GROUND_TRUTH_DIR=NZZ_groundtruth/gt MAX_ITERATIONS=388550 training
Current CER: 1.384 %, CPU time: 28:33 h
10 pages (same as in original test) were used for evaluation.
The code for lstmtrain
was modified to allow line images with a width of up to 4096 px.
Training is running for 10 epochs with the default network specification and
an alternate specification which scales to 64 px height.
make -r MODEL_NAME=nzz-ref GROUND_TRUTH_DIR=NZZ_groundtruth/gt MAX_ITERATIONS=387780 training
Current CER: 1.115 %, CPU time: 33 h
make -r MODEL_NAME=nzz-64 GROUND_TRUTH_DIR=NZZ_groundtruth/gt MAX_ITERATIONS=387780 NET_SPEC="[1,64,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c148]" training
Current CER: 2.852 %, CPU time: 15:43 h
An intermediate model nzz-64_2.852_48798_92600.traineddata
achieves CER 95.48 % on the evaluation set of 10 pages. The confusion list shows that there are some likely transcription errors in the ground truth data. These will be fixed using GTCheck
before we try the next iteration.
UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
207870 Characters
9396 Errors
95.48% Accuracy
0 Reject Characters
0 Suspect Markers
0 False Marks
0.00% Characters Marked
95.48% Accuracy After Correction
Ins Subst Del Errors
0 0 0 0 Marked
1640 5829 1927 9396 Unmarked
1640 5829 1927 9396 Total
Count Missed %Right
26448 309 98.83 ASCII Spacing Characters
6038 504 91.65 ASCII Special Symbols
2234 238 89.35 ASCII Digits
9991 712 92.87 ASCII Uppercase Letters
163159 5706 96.50 ASCII Lowercase Letters
207870 7469 96.41 Total
Errors Marked Correct-Generated
89 0 {}-{ }
83 0 {}-{,}
80 0 {h}-{b}
72 0 {f}-{s}
67 0 { }-{}
51 0 {u}-{n}
49 0 {}-{n}
49 0 {}-{s}
47 0 {}-{b}
46 0 {l}-{t}
46 0 {n}-{u}
43 0 {,}-{}
43 0 {.}-{,}
42 0 {i}-{}
42 0 {}-{.}
41 0 {}-{t}
40 0 {r}-{n}
39 0 {.}-{}
38 0 {g}-{a}
36 0 {B}-{V}
35 0 {e}-{}
33 0 {l}-{}
33 0 {}-{i}
32 0 {N}-{R}
32 0 {r}-{}
31 0 {t}-{}
30 0 {}-{a}
30 0 {}-{u}
28 0 {s}-{f}
28 0 {s}-{}
27 0 {c}-{e}
27 0 {k}-{t}
26 0 {i}-{t}
26 0 {}-{8}
26 0 {}-{e}
25 0 {d}-{b}