Requirements for Indian languages #10

vermaprashant1 · 2021-03-17T07:02:24Z

TDIL(Technology development for Indian languages) have collated Indian languages requirements concerned with Hindi language variations(different keystrokes and spelling variations) with examples that need to be focus and reflected in string searching recommendation. Kindly guide us for further actions.

r12a · 2021-08-11T16:12:03Z

@vermaprashant1 could you provide a link to the requirements document you created, so that Addison and i can review it?

vermaprashant1 · 2021-08-14T14:47:45Z

@richard ,please refer draft requirement document
ILanguage-requirement document-character-model.pdf
that covers Hindi as a initial language. The same has been circulated with experts for inputs. This document further extended by taking variations in other Indian languages apart from Hindi.

vermaprashant1 · 2021-12-17T05:35:12Z

@r12a. Can you please share your feedback for this document,

vermaprashant1 · 2022-01-31T05:45:32Z

@richard ,please refer draft requirement document ILanguage-requirement document-character-model.pdf that covers Hindi as a initial language. The same has been circulated with experts for inputs. This document further extended by taking variations in other Indian languages apart from Hindi.

@r12a please send your feedback on the shared document. We are also investigating the different variations and rules for other additional 5 languages and will share soon.

r12a · 2022-02-09T15:37:37Z

@vermaprashant1 sorry it's taken me so long to get to this. Here are my comments.

[1] I think the document would be much clearer if at the beginning you separated out more cleanly the various ways in which words can be encoded differently, and then after that point out the consequences and proposed advice. I would start with a list of problem cases that would include the following:

spelling variants such as the alternation between syllable-final /n/ or nasalisation (eg. the word Hindi) – note that spelling variants occur in most languages, and so it's something any search engine typically has to consider - what other common alternative spellings occur in Hindi besides LA vs LLA (which you mention almost in passing without any examples)? It would be good to have a list of at least the more common ones.
the choice of characters to represent nuktas (with a little more detail) – this is a little complicated in Devanagari because normalisation produces different results for different visual combinations, see https://r12a.github.io/scripts/devanagari/#nukta_encoding
inappropriate combinations that look the same visually – you don't mention these at all, but it's a significant issue for indic scripts. See examples of this for vowel-sign and independent vowel representation at https://r12a.github.io/scripts/devanagari/#vowelsign_encoding and https://r12a.github.io/scripts/devanagari/#vowelsign_encoding2
any combinations of combining characters with a single base that can be typed and stored in an order that causes problems - often this is resolved during normalisation, but there are problematic cases that are not resolved by normalising the text - similar issues are motivating some folks involved with Unicode to produce rendering guidelines for Thai, Khmer and Arabic scripts - these advise reordering of specific sequences of characters so as to produce consistent ordering and ensure that the text renders correctly when displayed. Again, you don't mention any such combinations, and i haven't researched this either yet for Devanagari.
Matching needs to decide what to do when format characters appear in text, eg. ZWJ, ZWNJ. In languages like Persian, these can affect the semantics of the text, but i suspect that in Devanagari that is not the case, and they can just be ignored. It's worth checking the full list of invisible characters that may appear in Devanagari text.
Graphically similar but semantically different (confusable) code points - i would probably put the OM in this category.

Such an analysis would need to indicate which alternations in sequence are handled by normalisation. Normalisation should be expected as a given, always, before matching, so it's the ones that normalisation doesn't fix that we are particularly interested in.

It would be interesting to explore whether what equivalences need to be made for string matching of identifiers (eg. the HTML/CSS case) vs. full text search. For example, in english spelling differences such as 'internationalization' vs. 'internationalisation' are not seen as equivalent, and maybe the anusvara-conjunct alternate is the same. In full text search, however, searching for one should probably find the other.

[2] Section 2.2.

It is requires by the Unicode to store and interchanged the characters in the same logical order or we can say that order that user typed through the keyboards

The initial sentence gives the impression that the Unicode Standard requires that users type keys on the keyboard in a particular order. What the standard actually says is that the stored order of characters typically corresponds to the order in which they are typed, but there is no expectation at all about how the keyboard should actually function, as long as it produces an appropriate sequencing of characters in the end: combining marks after base characters, virama between conjunct parts, etc.

Given that, i'm not sure what point you want to make in section 2.2. Any decent keyboard should allow the user to produce good Unicode character sequences, and any kbd that doesn't should be avoided.

[3] Are there different concerns for other languages using devanagari? - eg. i'm thinking about the eye-lash RA in Marathi.

[4] It would be very much easier for me to review your document if it was available in HTML, rather than PDF form. I'd be able to make annotations on the document for my reference, and i'd be able to copy-paste examples for exploration without the junk that PDF produces.

hope that helps.

vermaprashant1 · 2022-02-10T05:22:45Z

Dear Richard, Greetings.. Thanks for sharing valuable inputs. As it was a long time back, We have already revised the character model requirements document with additional 5 more languages requirements. These are collected from the various Language Experts. Also we will go through your comments and revise documents accordingly wherever required. I will share it with you soon. Thanks, Prashant

…

On Wed, Feb 9, 2022 at 7:37 AM r12a ***@***.***> wrote: @vermaprashant1 <https://github.com/vermaprashant1> sorry it's taken me so long to get to this. Here are my comments. [1] I think the document would be much clearer if at the beginning you separated out more cleanly the various ways in which words can be encoded differently, and then *after that* point out the consequences and proposed advice. I would start with a list of problem cases that would include the following: 1. spelling variants such as the alternation between syllable-final /n/ or nasalisation (eg. the word Hindi) – note that spelling variants occur in most languages, and so it's something any search engine typically has to consider - what other common alternative spellings occur in Hindi besides LA vs LLA (which you mention almost in passing without any examples)? It would be good to have a list of at least the more common ones. 2. the choice of characters to represent nuktas (with a little more detail) – this is a little complicated in Devanagari because normalisation produces different results for different visual combinations, see https://r12a.github.io/scripts/devanagari/#nukta_encoding 3. inappropriate combinations that look the same visually – you don't mention these at all, but it's a significant issue for indic scripts. See examples of this for vowel-sign and independent vowel representation at https://r12a.github.io/scripts/devanagari/#vowelsign_encoding and https://r12a.github.io/scripts/devanagari/#vowelsign_encoding2 4. any combinations of combining characters with a single base that can be typed and stored in an order that causes problems - often this is resolved during normalisation, but there are problematic cases that are not resolved by normalising the text - similar issues are motivating some folks involved with Unicode to produce rendering guidelines for Thai, Khmer and Arabic scripts - these advise reordering of specific sequences of characters so as to produce consistent ordering and ensure that the text renders correctly when displayed. Again, you don't mention any such combinations, and i haven't researched this either yet for Devanagari. 5. Matching needs to decide what to do when format characters appear in text, eg. ZWJ, ZWNJ. In languages like Persian, these can affect the semantics of the text, but i suspect that in Devanagari that is not the case, and they can just be ignored. It's worth checking the full list of invisible characters that may appear in Devanagari text. 6. Graphically similar but semantically different (confusable) code points - i would probably put the OM in this category. Such an analysis would need to indicate which alternations in sequence are handled by normalisation. Normalisation should be expected as a given, always, before matching, so it's the ones that normalisation doesn't fix that we are particularly interested in. It would be interesting to explore whether what equivalences need to be made for string matching of identifiers (eg. the HTML/CSS case) vs. full text search. For example, in english spelling differences such as 'internationalization' vs. 'internationalisation' are not seen as equivalent, and maybe the anusvara-conjunct alternate is the same. In full text search, however, searching for one should probably find the other. [2] Section 2.2. It is requires by the Unicode to store and interchanged the characters in the same logical order or we can say that order that user typed through the keyboards The initial sentence gives the impression that the Unicode Standard requires that users type keys on the keyboard in a particular order. What the standard actually says is that the stored order of characters *typically* corresponds to the order in which they are typed, but there is no expectation at all about how the keyboard should actually function, as long as it produces an appropriate sequencing of characters in the end: combining marks after base characters, virama between conjunct parts, etc. Given that, i'm not sure what point you want to make in section 2.2. Any decent keyboard should allow the user to produce good Unicode character sequences, and any kbd that doesn't should be avoided. [3] Are there different concerns for other languages using devanagari? - eg. i'm thinking about the eye-lash RA in Marathi. [4] It would be very much easier for me to review your document if it was available in HTML, rather than PDF form. I'd be able to make annotations on the document for my reference, and i'd be able to copy-paste examples for exploration without the junk that PDF produces. hope that helps. — Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB7B5ESYROMPOCRINGJKHV3U2KC43ANCNFSM4ZJ77KSQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Thanks & Regards, Prashant Verma I Program Manager Web Standardization Initiative(WSI) , MeitY New Delhi Cell : +91-8800521042 Website : *http://tdil.meity.gov.in/WSI/AboutWSI.aspx <http://tdil.mit.gov.in/WSI/AboutWSI.aspx>*

r12a · 2022-02-10T09:59:09Z

Sorry for the delay. I will look at your revised document. (Please point to an HTML file, if that's possible.)

In the meantime, could you take a look at w3c/iip#119 (comment) for me? Thanks.

vermaprashant1 · 2022-06-29T11:25:08Z

@r12a
please find the revised document that covers 6 Indian languages requirements and variations.

aphillips · 2022-06-30T16:08:29Z

@vermaprashant1

Hello Prashant,

I was looking over your document today in preparation for taking up some of the text into string-search. I intend to add questions to this thread as I go through your document. For starters, I found this paragraph:

Bengali, is one of the notorious languages with regard to spelling variation. The different spellings of word having same meaning are accepted in the Bengali language, it should be treated as different word although have same meaning. It has 5000+ words which record spelling variations.Typically, spelling variation ranges from 2 to 8 words.Majority of words have 2 variations; some have 3, 4, 8 and more variations. At least there is one word that records 16 spelling variations.Nearly 80% words show two spelling, 7% words show three variations, 7% words show four spellings, and 6% words show more than four variations.

Can you clarify that "different spellings of word... should be treated as different word" means that spelling variations should be treated as if they were different words (not matching)? I know that's what the sentence means, but want to be sure that this was your intention.

Do Bengali users expect document searches not to provide spelling variation matches at all? Or are there features in some programs for users to find such matches?

I will ask other clarifications as I work through the document. Thank you so much for providing this information!!

Various changes in preparation for editing this document to address w3c#10. - Updated respec to no longer use respec-common - Removed "conformance" section (since this is Note track) - Some amount of line-joining - Fixed several typos - Copied in shared local.css stylesheet and incorporated the one local style we were using - Copied in our more-modern "special markup" block - Made all references informative

r12a · 2022-07-01T09:12:19Z

Can you clarify that "different spellings of word... should be treated as different word" means that spelling variations should be treated as if they were different words (not matching)?

I suspect that the word 'not' is missing between 'should' and 'be'.

asmusf · 2022-07-02T00:30:28Z

I would like to understand whether any of the alternate spellings or alternate code point sequences involve sequences that are listed as "do not use" in the Unicode standard. (Unfortunately, you'll need to read these from tables in the script chapter, they are not defined in any data files). For syntactic elements, editing tools etc. should probably flag any attempted use of "do not use" sequences.

Significant editing of the document in preparation for importing some of the material found in the supplied Indic doc. I basically rewrote section 2. This includes starting to bring in the list of issues we filed against `FindText` back in the day.

vermaprashant1 · 2022-07-13T11:10:43Z

@aphillips

Please find the below feedback received by Bengali expert.

I was looking over your document today in preparation for taking up some of the text into string-search. I intend to add questions to this thread as I go through your document. For starters, I found this paragraph:

Bengali, is one of the notorious languages with regard to spelling variation. The different spellings of word having same meaning are accepted in the Bengali language, it should be treated as different word although have same meaning. It has 5000+ words which record spelling variations.Typically, spelling variation ranges from 2 to 8 words.Majority of words have 2 variations; some have 3, 4, 8 and more variations. At least there is one word that records 16 spelling variations.Nearly 80% words show two spelling, 7% words show three variations, 7% words show four spellings, and 6% words show more than four variations.

Can you clarify that "different spellings of word... should be treated as different word" means that spelling variations should be treated as if they were different words (not matching)? I know that's what the sentence means, but want to be sure that this was your intention.

Yes! That is the argument. Because, in a text, you never know which spelling will be used by the text creator, and if your inbuilt system does not have all possible variants, then predicting the right spelling matches will be quite problematic.

Do Bengali users expect document searches not to provide spelling variation matches at all? Or are there features in some programs for users to find such matches?

If a document search system can capture all possible variations of all the words that show spelling variations, there is no problem. The reality is that to date we have not come across any such system that can predict all possible variations of spelling. I have not even come across any database that records all possible spelling variations of Bengali words.

r12a · 2022-07-13T11:47:56Z

@aphillips in case it helps, it's much easier to understand what's going on here if you copy the Bengali examples to the bengali character app, then highlight the text line by line and click on Trans-literate. For a slightly deeper investigation, then click on Analyse text. This link will get you started.

r12a · 2022-07-13T12:08:01Z

@vermaprashant1 The section about Encoding variations lists for Bengali the circumgraph vowel signs which are canonically equivalent in Unicode. It doesn't however mention combinations which are not recommended, such as
অা [U+0985 BENGALI LETTER A + U+09BE BENGALI VOWEL SIGN AA] instead of
আ [U+0986 BENGALI LETTER AA]

There are a fair number of these in Indian scripts, esp. for letters with nukta. Is it something you think should be in the document? (I haven't looked closely yet at all the language sections.) This is a misspelling rather than an alternative spelling.

aphillips · 2022-07-25T20:04:12Z

@vermaprashant1 Note: the document linked to here: https://tdil.meity.gov.in/WSI/ILs-variations.html is on a server with an expired certificate (it expired at midnight on 24 July), so I can't view it currently.

vermaprashant1 · 2022-08-05T07:43:10Z

@vermaprashant1 The section about Encoding variations lists for Bengali the circumgraph vowel signs which are canonically equivalent in Unicode. It doesn't however mention combinations which are not recommended, such as অা [U+0985 BENGALI LETTER A + U+09BE BENGALI VOWEL SIGN AA] instead of আ [U+0986 BENGALI LETTER AA]

There are a fair number of these in Indian scripts, esp. for letters with nukta. Is it something you think should be in the document? (I haven't looked closely yet at all the language sections.) This is a misspelling rather than an alternative spelling.

Here id the feedback received by Bengali expert:

These circumgraph vowel signs are typically known as vowel allographs. In Bengali, these are called 'svarachinha "vowel signs".
In total, nine (9) vowel graphemes have these allographs: ā-kār, i-kār, ῑ-kār, u-kār, ῡ-kār, e-kār, ai-kaār, o-kār, and au-kār.
Each vowel allograph must be assigned a unique Unicode value.
Vowel allographs are never combined with vowel graphemes. They can only be combined with consonants and clusters (conjuncts).
আ (ā) is not a combination of অ (a) and া-কার (ā-kār). আ (ā) is a completely separate character with a unique Unicode value. Similarly, অ (a) is a separate character with another unique Unicode value. There should be no confusion regarding this.

We have not taken combinations which are not recommended in the document. It covers only alternative spellings/encoding and facts which are used by particular community.

aphillips · 2022-08-05T15:15:25Z

@vermaprashant1 Thanks for your reply. Note that the expiration of the certificate on the meity.gov.in server means we don't have access to the document. Would it be possible for you to send me a copy to use as a reference?

vermaprashant1 · 2022-08-08T05:30:30Z

Please find the document.[
ILs-text_variations-final.pdf
](url)

Adding examples stripped from the document supplied in w3c#10. Some of these are unclear and need more work.

vermaprashant1 · 2022-09-29T08:42:37Z

Please find the document.[ ILs-text_variations-final.pdf ](u

@r12a any update on this file?

@r12a

* Remove the extraneous links in code points for Bengali example * Refactor the example table for other script * Minor text tweaks @r12a The Gujarati example looks like it is misspelled, although I took it from the doc in w3c#10. I eliminated the Odia example because it was two seemingly unrelated strings. It would be nice to have anuswara/visarga/candrabindu examples from other scripts to put here. Do you have any handy?

* Homogenize language tags * Fix table organization to be consistent * Removed Gujarati example that wasn't explained in w3c#10 * Replaced 'hindi' with 'snake' * Replaced 'many' with 'several'

aphillips mentioned this issue Jun 30, 2022

Modernization in preparation to edit #11

Merged

aphillips mentioned this issue Jul 4, 2022

First pass at preparing for action 1164 #12

Merged

This comment was marked as resolved.

Sign in to view

vermaprashant1 closed this as completed Aug 8, 2022

vermaprashant1 reopened this Aug 8, 2022

aphillips added a commit to aphillips/string-search that referenced this issue Aug 15, 2022

Incorporating some examples

5ae09d9

Adding examples stripped from the document supplied in w3c#10. Some of these are unclear and need more work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requirements for Indian languages #10

Requirements for Indian languages #10

vermaprashant1 commented Mar 17, 2021

r12a commented Aug 11, 2021

vermaprashant1 commented Aug 14, 2021

vermaprashant1 commented Dec 17, 2021

vermaprashant1 commented Jan 31, 2022

r12a commented Feb 9, 2022

vermaprashant1 commented Feb 10, 2022 via email

r12a commented Feb 10, 2022

vermaprashant1 commented Jun 29, 2022

aphillips commented Jun 30, 2022

r12a commented Jul 1, 2022

asmusf commented Jul 2, 2022

vermaprashant1 commented Jul 13, 2022

r12a commented Jul 13, 2022 •

edited

Loading

r12a commented Jul 13, 2022

This comment was marked as resolved.

aphillips commented Jul 25, 2022

vermaprashant1 commented Aug 5, 2022

aphillips commented Aug 5, 2022

vermaprashant1 commented Aug 8, 2022

vermaprashant1 commented Sep 29, 2022

Requirements for Indian languages #10

Requirements for Indian languages #10

Comments

vermaprashant1 commented Mar 17, 2021

r12a commented Aug 11, 2021

vermaprashant1 commented Aug 14, 2021

vermaprashant1 commented Dec 17, 2021

vermaprashant1 commented Jan 31, 2022

r12a commented Feb 9, 2022

vermaprashant1 commented Feb 10, 2022 via email

r12a commented Feb 10, 2022

vermaprashant1 commented Jun 29, 2022

aphillips commented Jun 30, 2022

r12a commented Jul 1, 2022

asmusf commented Jul 2, 2022

vermaprashant1 commented Jul 13, 2022

r12a commented Jul 13, 2022 • edited Loading

r12a commented Jul 13, 2022

This comment was marked as resolved.

aphillips commented Jul 25, 2022

vermaprashant1 commented Aug 5, 2022

aphillips commented Aug 5, 2022

vermaprashant1 commented Aug 8, 2022

vermaprashant1 commented Sep 29, 2022

r12a commented Jul 13, 2022 •

edited

Loading