Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirements for Indian languages #10

Open
vermaprashant1 opened this issue Mar 17, 2021 · 20 comments
Open

Requirements for Indian languages #10

vermaprashant1 opened this issue Mar 17, 2021 · 20 comments

Comments

@vermaprashant1
Copy link

TDIL(Technology development for Indian languages) have collated Indian languages requirements concerned with Hindi language variations(different keystrokes and spelling variations) with examples that need to be focus and reflected in string searching recommendation. Kindly guide us for further actions.

@r12a
Copy link
Contributor

r12a commented Aug 11, 2021

@vermaprashant1 could you provide a link to the requirements document you created, so that Addison and i can review it?

@vermaprashant1
Copy link
Author

@richard ,please refer draft requirement document
ILanguage-requirement document-character-model.pdf
that covers Hindi as a initial language. The same has been circulated with experts for inputs. This document further extended by taking variations in other Indian languages apart from Hindi.

@vermaprashant1
Copy link
Author

@r12a. Can you please share your feedback for this document,

@vermaprashant1
Copy link
Author

@richard ,please refer draft requirement document ILanguage-requirement document-character-model.pdf that covers Hindi as a initial language. The same has been circulated with experts for inputs. This document further extended by taking variations in other Indian languages apart from Hindi.

@r12a please send your feedback on the shared document. We are also investigating the different variations and rules for other additional 5 languages and will share soon.

@r12a
Copy link
Contributor

r12a commented Feb 9, 2022

@vermaprashant1 sorry it's taken me so long to get to this. Here are my comments.

[1] I think the document would be much clearer if at the beginning you separated out more cleanly the various ways in which words can be encoded differently, and then after that point out the consequences and proposed advice. I would start with a list of problem cases that would include the following:

  1. spelling variants such as the alternation between syllable-final /n/ or nasalisation (eg. the word Hindi) – note that spelling variants occur in most languages, and so it's something any search engine typically has to consider - what other common alternative spellings occur in Hindi besides LA vs LLA (which you mention almost in passing without any examples)? It would be good to have a list of at least the more common ones.
  2. the choice of characters to represent nuktas (with a little more detail) – this is a little complicated in Devanagari because normalisation produces different results for different visual combinations, see https://r12a.github.io/scripts/devanagari/#nukta_encoding
  3. inappropriate combinations that look the same visually – you don't mention these at all, but it's a significant issue for indic scripts. See examples of this for vowel-sign and independent vowel representation at https://r12a.github.io/scripts/devanagari/#vowelsign_encoding and https://r12a.github.io/scripts/devanagari/#vowelsign_encoding2
  4. any combinations of combining characters with a single base that can be typed and stored in an order that causes problems - often this is resolved during normalisation, but there are problematic cases that are not resolved by normalising the text - similar issues are motivating some folks involved with Unicode to produce rendering guidelines for Thai, Khmer and Arabic scripts - these advise reordering of specific sequences of characters so as to produce consistent ordering and ensure that the text renders correctly when displayed. Again, you don't mention any such combinations, and i haven't researched this either yet for Devanagari.
  5. Matching needs to decide what to do when format characters appear in text, eg. ZWJ, ZWNJ. In languages like Persian, these can affect the semantics of the text, but i suspect that in Devanagari that is not the case, and they can just be ignored. It's worth checking the full list of invisible characters that may appear in Devanagari text.
  6. Graphically similar but semantically different (confusable) code points - i would probably put the OM in this category.

Such an analysis would need to indicate which alternations in sequence are handled by normalisation. Normalisation should be expected as a given, always, before matching, so it's the ones that normalisation doesn't fix that we are particularly interested in.

It would be interesting to explore whether what equivalences need to be made for string matching of identifiers (eg. the HTML/CSS case) vs. full text search. For example, in english spelling differences such as 'internationalization' vs. 'internationalisation' are not seen as equivalent, and maybe the anusvara-conjunct alternate is the same. In full text search, however, searching for one should probably find the other.

[2] Section 2.2.

It is requires by the Unicode to store and interchanged the characters in the same logical order or we can say that order that user typed through the keyboards

The initial sentence gives the impression that the Unicode Standard requires that users type keys on the keyboard in a particular order. What the standard actually says is that the stored order of characters typically corresponds to the order in which they are typed, but there is no expectation at all about how the keyboard should actually function, as long as it produces an appropriate sequencing of characters in the end: combining marks after base characters, virama between conjunct parts, etc.

Given that, i'm not sure what point you want to make in section 2.2. Any decent keyboard should allow the user to produce good Unicode character sequences, and any kbd that doesn't should be avoided.

[3] Are there different concerns for other languages using devanagari? - eg. i'm thinking about the eye-lash RA in Marathi.

[4] It would be very much easier for me to review your document if it was available in HTML, rather than PDF form. I'd be able to make annotations on the document for my reference, and i'd be able to copy-paste examples for exploration without the junk that PDF produces.

hope that helps.

@vermaprashant1
Copy link
Author

vermaprashant1 commented Feb 10, 2022 via email

@r12a
Copy link
Contributor

r12a commented Feb 10, 2022

Sorry for the delay. I will look at your revised document. (Please point to an HTML file, if that's possible.)

In the meantime, could you take a look at w3c/iip#119 (comment) for me? Thanks.

@vermaprashant1
Copy link
Author

@r12a
please find the revised document that covers 6 Indian languages requirements and variations.

@aphillips
Copy link
Contributor

@vermaprashant1

Hello Prashant,

I was looking over your document today in preparation for taking up some of the text into string-search. I intend to add questions to this thread as I go through your document. For starters, I found this paragraph:

Bengali, is one of the notorious languages with regard to spelling variation. The different spellings of word having same meaning are accepted in the Bengali language, it should be treated as different word although have same meaning. It has 5000+ words which record spelling variations.Typically, spelling variation ranges from 2 to 8 words.Majority of words have 2 variations; some have 3, 4, 8 and more variations. At least there is one word that records 16 spelling variations.Nearly 80% words show two spelling, 7% words show three variations, 7% words show four spellings, and 6% words show more than four variations.

Can you clarify that "different spellings of word... should be treated as different word" means that spelling variations should be treated as if they were different words (not matching)? I know that's what the sentence means, but want to be sure that this was your intention.

Do Bengali users expect document searches not to provide spelling variation matches at all? Or are there features in some programs for users to find such matches?

I will ask other clarifications as I work through the document. Thank you so much for providing this information!!

aphillips added a commit to aphillips/string-search that referenced this issue Jun 30, 2022
Various changes in preparation for editing this document to address w3c#10.

- Updated respec to no longer use respec-common
- Removed "conformance" section (since this is Note track)
- Some amount of line-joining
- Fixed several typos
- Copied in shared local.css stylesheet and incorporated the one local style we were using
- Copied in our more-modern "special markup" block
- Made all references informative
@r12a
Copy link
Contributor

r12a commented Jul 1, 2022

Can you clarify that "different spellings of word... should be treated as different word" means that spelling variations should be treated as if they were different words (not matching)?

I suspect that the word 'not' is missing between 'should' and 'be'.

@asmusf
Copy link

asmusf commented Jul 2, 2022

I would like to understand whether any of the alternate spellings or alternate code point sequences involve sequences that are listed as "do not use" in the Unicode standard. (Unfortunately, you'll need to read these from tables in the script chapter, they are not defined in any data files). For syntactic elements, editing tools etc. should probably flag any attempted use of "do not use" sequences.

aphillips added a commit to aphillips/string-search that referenced this issue Jul 4, 2022
Significant editing of the document in preparation for importing some of the material found in the supplied Indic doc. I basically rewrote section 2. This includes starting to bring in the list of issues we filed against `FindText` back in the day.
@vermaprashant1
Copy link
Author

@aphillips

Please find the below feedback received by Bengali expert.

I was looking over your document today in preparation for taking up some of the text into string-search. I intend to add questions to this thread as I go through your document. For starters, I found this paragraph:

Bengali, is one of the notorious languages with regard to spelling variation. The different spellings of word having same meaning are accepted in the Bengali language, it should be treated as different word although have same meaning. It has 5000+ words which record spelling variations.Typically, spelling variation ranges from 2 to 8 words.Majority of words have 2 variations; some have 3, 4, 8 and more variations. At least there is one word that records 16 spelling variations.Nearly 80% words show two spelling, 7% words show three variations, 7% words show four spellings, and 6% words show more than four variations.

Can you clarify that "different spellings of word... should be treated as different word" means that spelling variations should be treated as if they were different words (not matching)? I know that's what the sentence means, but want to be sure that this was your intention.

Yes! That is the argument. Because, in a text, you never know which spelling will be used by the text creator, and if your inbuilt system does not have all possible variants, then predicting the right spelling matches will be quite problematic.

Do Bengali users expect document searches not to provide spelling variation matches at all? Or are there features in some programs for users to find such matches?

If a document search system can capture all possible variations of all the words that show spelling variations, there is no problem. The reality is that to date we have not come across any such system that can predict all possible variations of spelling. I have not even come across any database that records all possible spelling variations of Bengali words.

@r12a
Copy link
Contributor

r12a commented Jul 13, 2022

@aphillips in case it helps, it's much easier to understand what's going on here if you copy the Bengali examples to the bengali character app, then highlight the text line by line and click on Trans-literate. For a slightly deeper investigation, then click on Analyse text. This link will get you started.

@r12a
Copy link
Contributor

r12a commented Jul 13, 2022

@vermaprashant1 The section about Encoding variations lists for Bengali the circumgraph vowel signs which are canonically equivalent in Unicode. It doesn't however mention combinations which are not recommended, such as
অা [U+0985 BENGALI LETTER A + U+09BE BENGALI VOWEL SIGN AA] instead of
আ [U+0986 BENGALI LETTER AA]

There are a fair number of these in Indian scripts, esp. for letters with nukta. Is it something you think should be in the document? (I haven't looked closely yet at all the language sections.) This is a misspelling rather than an alternative spelling.

@aphillips

This comment was marked as resolved.

@aphillips
Copy link
Contributor

@vermaprashant1 Note: the document linked to here: https://tdil.meity.gov.in/WSI/ILs-variations.html is on a server with an expired certificate (it expired at midnight on 24 July), so I can't view it currently.

@vermaprashant1
Copy link
Author

@vermaprashant1 The section about Encoding variations lists for Bengali the circumgraph vowel signs which are canonically equivalent in Unicode. It doesn't however mention combinations which are not recommended, such as অা [U+0985 BENGALI LETTER A + U+09BE BENGALI VOWEL SIGN AA] instead of আ [U+0986 BENGALI LETTER AA]

There are a fair number of these in Indian scripts, esp. for letters with nukta. Is it something you think should be in the document? (I haven't looked closely yet at all the language sections.) This is a misspelling rather than an alternative spelling.

Here id the feedback received by Bengali expert:

  1. These circumgraph vowel signs are typically known as vowel allographs. In Bengali, these are called 'svarachinha "vowel signs".
  2. In total, nine (9) vowel graphemes have these allographs: ā-kār, i-kār, ῑ-kār, u-kār, ῡ-kār, e-kār, ai-kaār, o-kār, and au-kār.
  3. Each vowel allograph must be assigned a unique Unicode value.
  4. Vowel allographs are never combined with vowel graphemes. They can only be combined with consonants and clusters (conjuncts).
  5. আ (ā) is not a combination of অ (a) and া-কার (ā-kār). আ (ā) is a completely separate character with a unique Unicode value. Similarly, অ (a) is a separate character with another unique Unicode value. There should be no confusion regarding this.

We have not taken combinations which are not recommended in the document. It covers only alternative spellings/encoding and facts which are used by particular community.

@aphillips
Copy link
Contributor

@vermaprashant1 Thanks for your reply. Note that the expiration of the certificate on the meity.gov.in server means we don't have access to the document. Would it be possible for you to send me a copy to use as a reference?

@vermaprashant1
Copy link
Author

Please find the document.[
ILs-text_variations-final.pdf
](url)

aphillips added a commit to aphillips/string-search that referenced this issue Aug 15, 2022
Adding examples stripped from the document supplied in w3c#10. Some of these are unclear and need more work.
@vermaprashant1
Copy link
Author

Please find the document.[ ILs-text_variations-final.pdf ](u

@r12a any update on this file?

aphillips added a commit to aphillips/string-search that referenced this issue Nov 18, 2022
* Remove the extraneous links in code points for Bengali example
* Refactor the example table for other script
* Minor text tweaks

@r12a The Gujarati example looks like it is misspelled, although I took it from the doc in w3c#10. I eliminated the Odia example because it was two seemingly unrelated strings. It would be nice to have anuswara/visarga/candrabindu examples from other scripts to put here. Do you have any handy?
aphillips added a commit to aphillips/string-search that referenced this issue Nov 29, 2022
* Homogenize language tags
* Fix table organization to be consistent
* Removed Gujarati example that wasn't explained in w3c#10
* Replaced 'hindi' with 'snake'
* Replaced 'many' with 'several'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants