-
Notifications
You must be signed in to change notification settings - Fork 887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Javanese Script for jav-java #126
Comments
You can try this as it will be faster than training from scratch. Please post links to Javanese script related resources below. If there is a transliterator which convertes Javanese in Latin script to Javanese script, that can be used for converting the files for lang jav as a start. |
See https://github.com/tesseract-ocr/langdata/tree/master/jav https://github.com/tesseract-ocr/langdata/blob/master/README.md |
before training, he should try best/fast jav.traineddata. |
jav is Javanese language in Latin script.
He wants it in Javanese script
This might be similar to Thai/Khmer - could try using that to train from. |
http://unicode.org/udhr/d/udhr_jav_java.html Universal Declaration of Human Rights - Javanese (Javanese) |
https://jv.wikipedia.org/wiki/Parembugan:Joko_Widodo Most of Javanese wikipedia seems to be in Latin script. |
Did you unpack jav from best/fast? |
Hello, thanks a lot for your help. I appreciate it. Sorry, I always thought that we need images as training data, but it is not the case for Tesseract 4.0. Another question, do we have to collect all 500,000 text lines before begin the training? |
100 lines will work only for fine tuning. But you can give it a try to get familiar with training process. |
@amitdo I had only looked at langdata. I checked just now after your post. The unicharset in both is in Latin Script only. See below for tessdata_fast version
|
@topherseance Please see attached zip file which has a test training for Javanese including both Javanese and Latin script. Only trained (replace a layer) upto about 7% accuracy on the small training data that I could gather. Keep us updated on your progress with training. |
@Shreeshrii hi, I am quite interested in this post, could you give me training data from this? i need to generate javanese script training data compactible with tesseract 3.04/3.05, I want to use that training data for android device, I use tess-two which is not yet compactible with tesseract 4. |
The requirements of training data for tesseract 3.0x are quite different from those for 4.0.0 LSTM training. You can use jav-java text from UDHR or wikipedia as linked in posts above. |
Hello, sorry for the hiatus, had other tasks to do. Only found 2 Javanese fonts so far: I tried to create a starter traineddata for Noto Sans Javanese by using this command below, it works succesfully:
But when I tried to do the same for Tuladha Jejeg font, it shows this error:
The
(taken from here) Opened the Another info, Javanese script, by Unicode standard, has glyph-combining letters. (See Pasangan) Thanks before |
Text2image uses Pango for font rendering. It is possible that it does not support the SIL graphite fonts. I also get errors for Annapurna SIL devanagari font and do not use it. |
I think I had used a couple.more fonts |
I see.. but why it seems the resulting .tif image is rendered correctly? |
We plan on using the OCR for old textbook scans written in javanese script. |
Just tested with other text strings, some of them worked, some did not. here's what we found: |
see tesseract-ocr/tesseract#1038
There may not be any existing Javanese script related rules. These will need to be added.
On Mon, Jul 9, 2018 at 6:50 PM Shree Devi Kumar <[email protected]>
wrote:
… Word started with a combiner:0xa9bc
Word started with a combiner:0xa981
Normalization failed for string 'ꦼꦁ'
Word started with a combiner:0xa983
Normalization failed for string 'ꦃ'
Word started with a combiner:0xa9c0
Normalization failed for string '꧀ꦠ'
Word started with a combiner:0xa9bc
Normalization failed for string 'ꦼ'
Word started with a combiner:0xa9c0
Normalization failed for string '꧀ꦲ'
Word started with a combiner:0xa9b6
Word started with a combiner:0xa981
Normalization failed for string 'ꦶꦁ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa983
Normalization failed for string 'ꦃ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Please look at the validation/normalization rules for Indic scripts in the
code. Something there maybe triggering these errors.
On Mon, Jul 9, 2018 at 5:20 PM topherseance ***@***.***>
wrote:
> Just tested with other text strings, some of them worked, some did not.
> here's what we found:
> ꦲꦤꦕꦫꦏ simple phrase, no gylph-combining --> success
> ꦲꦤꦕꦫꦏꦮꦺꦤ꧀ꦠꦼ uses glyph-combining: Pasangan
> <https://en.wikipedia.org/wiki/Javanese_script#Pasangan> --> success
> ꦲꦤꦕꦫꦏꦮꦺꦤ꧀ꦠꦼꦮꦸ uses glyph-combining: Sandhangan
> <https://en.wikipedia.org/wiki/Javanese_script#Sandhangan> --> failure
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#126 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AE2_o_D04G4y8MeiheI5Yt1TlKO7fxFpks5uE0N6gaJpZM4Tfa_w>
> .
>
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
I ignored the errors and continued with training using 5 fonts which seem to cover javanese code range.
|
Can you please share the commands and steps you did for the above training? I still can't get the training to work successfully. I used the "training from scratch" method. Again, sorry if it is a newbie mistake.
I did run the Each line contains |
Found another font: Prada |
Start of Javanese script unicode range may need to be added to
https://github.com/tesseract-ocr/tesseract/blob/master/src/training/validator.h
…On Thu, Jul 19, 2018 at 3:29 AM topherseance ***@***.***> wrote:
Found another font: Prada
<https://sites.google.com/site/fontsundaprada/unduh-font-sunda-prada/prada.ttf>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#126 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o78VzTinzCsjGIABMgZmA9cOsczXks5uH6_WgaJpZM4Tfa_w>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
The text is clearly not encoded in utf-8. |
|
I collected few javanese aksara here, probably has about several thousand textlines: |
@Shreeshrii when you run your scripts, I ran the script and still got this:
Checked the encoding of
|
Are you getting the error on all lines of training text or just some lines?
I have had the error before but not with the current set of files.
What is your locale?
…On Mon 6 Aug, 2018, 12:47 PM topherseance, ***@***.***> wrote:
@Shreeshrii <https://github.com/Shreeshrii> when you run your scripts,
layertrain.sh or plustrain.sh, did you receive the Encoding of string
failed error?
I ran the script and still got this:
File /tmp/tmp.fAmoYPWBIL/jav_java/jav_java.Carakan_Anyar.exp1.lstmf page 569 :
Mean rms=5.024%, delta=42.402%, train=100.11%(100%), skip ratio=61.7%
Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8b ffffffea ffffffa6 ffffffb1 ffffffea ffffffa6 ffffffb6 ffffffea ffffffa6 ffffff81 20 ffffffea ffffffa6 ffffffa2 ffffffea ffffffa6 ffffffb6 ffffffea ffffffa6 ffffffaa ffffffea ffffffa6 ffffffaa ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffffb6 20 ffffffea ffffffa6 ffffffa2 ffffffea ffffffa6 ffffffb8 ffffffea ffffffa6 ffffffa9
Can't encode transcription: 'ꦮꦺꦠꦤ꧀ꦱꦶꦁ ꦢꦶꦪꦪꦲꦶ ꦢꦸꦩ' in language ''
Iteration 1171: ALIGNED TRUTH : ꦩꦭꦁꦲꦠꦺꦤꦶ ꦧꦸꦩꦶ ꦧꦸꦩ꧀ꦥꦼꦠ꧀
Iteration 1171: BEST OCR TEXT :
File /tmp/tmp.fAmoYPWBIL/jav_java/jav_java.Tuladha_Jejeg.exp0.lstmf page 480 :
Mean rms=5.024%, delta=42.397%, train=100.111%(100%), skip ratio=61.7%
Iteration 1172: ALIGNED TRUTH : ꦄꦢꦩ꧀ ꦩꦭꦶꦏ꧀ ꦮ꦳ꦶꦢꦺꦪꦺꦴ ꦒ
Iteration 1172: BEST OCR TEXT :
File /tmp/tmp.fAmoYPWBIL/jav_java/jav_java.Carakan_Anyar.exp-1.lstmf page 40 :
Mean rms=5.023%, delta=42.37%, train=100.113%(100%), skip ratio=61.7%
Iteration 1173: ALIGNED TRUTH : ꦏꦤ꧀ꦕ꧀ꦂꦶꦠ꧀ ꦏꦤ꧀ꦛꦶꦁ
Iteration 1173: BEST OCR TEXT :
Checked the encoding of jav.training_text, I guess it is encoded in UTF-8
***@***.***:~/tesseract/langdata/jav$ file -i jav.training_text
jav.training_text: text/plain; charset=utf-8
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#126 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_oyxLRSEm1PBGpSlEa_nM9rTjNJpDks5uN-1sgaJpZM4Tfa_w>
.
|
Just some lines, I guess. |
There is probably some invisible code or character that is not in
unicharset. You can try to identify it from the text and provided codes. If
it is only a few lines, you can ignore.
I have tried more training but still not getting much better results, error
rate is around 7% on training set.
…On Mon 6 Aug, 2018, 10:31 PM topherseance, ***@***.***> wrote:
Just some lines, I guess.
My locale is EN.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#126 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o0-LjXqa5ywOofM-MQ7kq22rg_y7ks5uOHZpgaJpZM4Tfa_w>
.
|
My locale is en_us.utf8
That might make some difference in the display of the codes.
…On Tue 7 Aug, 2018, 9:18 AM Shree Devi Kumar, ***@***.***> wrote:
There is probably some invisible code or character that is not in
unicharset. You can try to identify it from the text and provided codes. If
it is only a few lines, you can ignore.
I have tried more training but still not getting much better results,
error rate is around 7% on training set.
On Mon 6 Aug, 2018, 10:31 PM topherseance, ***@***.***>
wrote:
> Just some lines, I guess.
> My locale is EN.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#126 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AE2_o0-LjXqa5ywOofM-MQ7kq22rg_y7ks5uOHZpgaJpZM4Tfa_w>
> .
>
|
Ok, I think I got the reason for the error. It is related to the text not
passing the 'normalization' rules as setup for the script.
For Javanese, I copied rules from existing languages but these need to be
verified and corrected.
https://github.com/tesseract-ocr/tesseract/blob/master/src/training/validate_javanese.cpp
On Tue, Aug 7, 2018 at 9:56 AM Shree Devi Kumar <[email protected]>
wrote:
… My locale is en_us.utf8
That might make some difference in the display of the codes.
On Tue 7 Aug, 2018, 9:18 AM Shree Devi Kumar, ***@***.***>
wrote:
> There is probably some invisible code or character that is not in
> unicharset. You can try to identify it from the text and provided codes. If
> it is only a few lines, you can ignore.
>
> I have tried more training but still not getting much better results,
> error rate is around 7% on training set.
>
> On Mon 6 Aug, 2018, 10:31 PM topherseance, ***@***.***>
> wrote:
>
>> Just some lines, I guess.
>> My locale is EN.
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> <#126 (comment)>,
>> or mute the thread
>> <https://github.com/notifications/unsubscribe-auth/AE2_o0-LjXqa5ywOofM-MQ7kq22rg_y7ks5uOHZpgaJpZM4Tfa_w>
>> .
>>
>
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
I am getting the following kind of errors. Please check whether the
Javanese text is valid.
Word started with a combiner:0xa9ba
Word started with a combiner:0xa9b4
Normalization failed for string 'ꦺꦴ'
Word started with a combiner:0xa9c0
Normalization failed for string '꧀ꦲꦶꦁ'
Invalid start of grapheme sequence:M=0xa9ba
Normalization failed for string '꧇ꦺ'
Invalid start of grapheme sequence:M=0xa9ba
Normalization failed for string 'ꦩ꧀ꦥꦸꦺ'
Invalid start of grapheme sequence:M=0xa9ba
Normalization failed for string 'ꦏ꧀ꦢꦴꦺ'
Invalid start of grapheme sequence:M=0xa9ba
Normalization failed for string '꧇ꦺ'
Invalid start of grapheme sequence:M=0xa9ba
Normalization failed for string 'ꦲꦶꦁꦺ'
Invalid start of grapheme sequence:M=0xa9ba
Normalization failed for string 'ꦗꦿꦴꦺ'
Invalid start of grapheme sequence:M=0xa9ba
Normalization failed for string 'ꦧꦼꦺ'
Invalid start of grapheme sequence:M=0xa9ba
Normalization failed for string 'ꦤ꧀ꦱꦴꦺ'
Invalid start of grapheme sequence:M=0xa9ba
Normalization failed for string 'ꦮꦶꦺ'
Invalid start of grapheme sequence:M=0xa9ba
Normalization failed for string 'ꦤ꧀ꦏꦁꦺ'
Invalid start of grapheme sequence:M=0xa9ba
Normalization failed for string 'ꦒꦴꦺ'
Invalid start of grapheme sequence:M=0xa9ba
Normalization failed for string 'ꦠ꧀ꦤꦶꦺ'
Invalid start of grapheme sequence:M=0xa9ba
On Tue, Aug 7, 2018 at 8:36 PM Shree Devi Kumar <[email protected]>
wrote:
… Ok, I think I got the reason for the error. It is related to the text not
passing the 'normalization' rules as setup for the script.
For Javanese, I copied rules from existing languages but these need to be
verified and corrected.
https://github.com/tesseract-ocr/tesseract/blob/master/src/training/validate_javanese.cpp
On Tue, Aug 7, 2018 at 9:56 AM Shree Devi Kumar ***@***.***>
wrote:
> My locale is en_us.utf8
>
> That might make some difference in the display of the codes.
>
> On Tue 7 Aug, 2018, 9:18 AM Shree Devi Kumar, ***@***.***>
> wrote:
>
>> There is probably some invisible code or character that is not in
>> unicharset. You can try to identify it from the text and provided codes. If
>> it is only a few lines, you can ignore.
>>
>> I have tried more training but still not getting much better results,
>> error rate is around 7% on training set.
>>
>> On Mon 6 Aug, 2018, 10:31 PM topherseance, ***@***.***>
>> wrote:
>>
>>> Just some lines, I guess.
>>> My locale is EN.
>>>
>>> —
>>> You are receiving this because you were mentioned.
>>> Reply to this email directly, view it on GitHub
>>> <#126 (comment)>,
>>> or mute the thread
>>> <https://github.com/notifications/unsubscribe-auth/AE2_o0-LjXqa5ywOofM-MQ7kq22rg_y7ks5uOHZpgaJpZM4Tfa_w>
>>> .
>>>
>>
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
@
topherseance
The LSTM training langdata for Javanese in Latin script is now available at
https://github.com/tesseract-ocr/langdata_lstm/tree/master/jav
You can convert it to aksara jawa to use for training.
|
Done converting this file to aksara jawa: But what about the other files? |
Thanks.
Please convert the wordlist also.
The numbers file should actually just show patterns where numbers are
found. Please look at the file for English as a sample.
…On Tue, Aug 14, 2018 at 11:05 AM topherseance ***@***.***> wrote:
Done converting this file to aksara jawa:
https://github.com/tesseract-ocr/langdata_lstm/blob/master/jav/jav.training_text
Result:
https://github.com/topherseance/javanese-aksara-training-text
But what about the other files?
For example, .numbers, .wordlist.
Is the .numbers file correct? Seems to contain random letters..
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#126 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o7yeAUKbXxt8zXaXa51o6IMYFIYQks5uQmGngaJpZM4Tfa_w>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
The conversion does not retain any spaces between words in the lines. It
seems that Javanese script does not require the spaces, but maybe it will
help in training to have words in a sentence separated by space.
On Tue, Aug 14, 2018 at 5:37 PM Shree Devi Kumar <[email protected]>
wrote:
… Thanks.
Please convert the wordlist also.
The numbers file should actually just show patterns where numbers are
found. Please look at the file for English as a sample.
On Tue, Aug 14, 2018 at 11:05 AM topherseance ***@***.***>
wrote:
> Done converting this file to aksara jawa:
>
> https://github.com/tesseract-ocr/langdata_lstm/blob/master/jav/jav.training_text
> Result:
> https://github.com/topherseance/javanese-aksara-training-text
>
> But what about the other files?
> For example, .numbers, .wordlist.
> Is the .numbers file correct? Seems to contain random letters..
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#126 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AE2_o7yeAUKbXxt8zXaXa51o6IMYFIYQks5uQmGngaJpZM4Tfa_w>
> .
>
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
Updated my repo to include conversion result with whitespaces: |
Thank you.
…On Thu, Aug 16, 2018 at 12:22 AM, topherseance ***@***.***> wrote:
Updated my repo to include conversion result with whitespaces:
https://github.com/topherseance/javanese-aksara-training-text
https://github.com/topherseance/javanese-aksara-training-text/blob/master/
with-whitespace-combined.txt
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#126 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o7AzrqkOhomn1NcrFJq12gcPfpg7ks5uRG3bgaJpZM4Tfa_w>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
Hi, first I'm sorry perhaps this is different topic but I think quite related. I've opened an issue at #152 (Balinese script OCR) but stiil confuse (newbie syndrom) until finally landed here. I'm ready for collecting training text but still on hold since same case like Javanese fonts, Balinese script has Bali SImbar Dwijendra font (see posted issue) most similar to ancient script but not yet tested for training (I'm afraid same incompatibility issue like Tuladha Jejeg, will check soon). At the other side Balinese script also has Noto Sans/Seri Balinese from Google. Also, I've download https://github.com/Shreeshrii/tessdata_jav_java, and its README.md said "Source code changes will be needed in tesseract... " Could you direct me how to use all material here since Javanese script has big influence to Balinese script. Geograhically, Bali and Java are also neighbor to each other. Thank you very much before for your kind attention. |
I had done aksara jawa training and created two traineddata files - see links given in https://github.com/Shreeshrii/tessdata_jav_java/blob/master/README.md The changes to tesseract codebase were made via: tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d |
Thanks for the quick response @Shreeshrii Here is the update condition:
The question is how do I do that? For several hours try to learn and get the strategy but still far away.. |
UDHR is a small text. You will need larger text for training.
LSTM training takes time, days and weeks.
…On Mon, Mar 23, 2020, 10:17 gindrawan ***@***.***> wrote:
Thanks for the quick response @Shreeshrii <https://github.com/Shreeshrii>
Here is the update condition:
1. At the attachment, we have 2 fonts with Balinese-unicode, namely
Vimala (most similar to the non-Balinese-unicode Bali SImbar Dwijendra) and
Noto Sans Balinese (like Javanese that has Noto Sans Javanese).
2. I want to use https://github.com/Shreeshrii/tessdata_jav_java as a
base for training with my Balinese training text. See the attachment for
the Balinese version of Article 1 of the Universal Declaration of Human
Rights (https://en.wikipedia.org/wiki/Balinese_script). And about
three code for that text, I don't know, jav for javanese, bal for balinese?
The question is how do I do that? For several hours try to learn and get
the strategy but still far away..
bal.training_text.txt
<https://github.com/tesseract-ocr/langdata/files/4367222/bal.training_text.txt>
balinese-unicode.zip
<https://github.com/tesseract-ocr/langdata/files/4367183/balinese-unicode.zip>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#126 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABG37I4MBIBCLURDZ7ZFML3RI3SXFANCNFSM4E35V7YA>
.
|
Yes, I know that.. I want to start from small training text first and incrementally add later (if possible) while get more understanding to the learning process. I have already had larger training text in Noto Sans Balinese (up to 30 thousand words and possibility doubling it for Vimala). More likely, the number continue to grow since there are other sources that still haven't processed yet. I dont know if that number is enough.. |
Language and script codes follow the assigned names as per standards bodies.
Balinese language three letter code is ban. Balinese script is bali. The
script can be used for a couple of other languages also.
|
… On Mon, Mar 23, 2020, 17:59 Shree Devi Kumar ***@***.***> wrote:
Language and script codes follow the assigned names as per standards
bodies.
Balinese language three letter code is ban. Balinese script is bali. The
script can be used for a couple of other languages also.
|
Ok, thanks. I'll post the update at #152. |
@topherseance: if you're still looking for the Javanese OCR, a team in UKDW is working on it.
|
@Shreeshrii & @topherseance: there are more than 20 Javanese script fonts available here: |
@bennylin Are these Unicode fonts? |
Yes |
Are there any labelled datasets with scanned images and their Unicode groundtruth transcription that can be used for training/testing tesseract's jav-java traineddata? What accuracy did the UKDW ocr achieve? |
I'm not in the loop for the research. You might want to contact Dr. Lucia Krisnawati for that. |
Originally posted in forum
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/8r8YOQgTBT4/xHpCTp9DAwAJ
The text was updated successfully, but these errors were encountered: