-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requirements for Indian languages #10
Comments
@vermaprashant1 could you provide a link to the requirements document you created, so that Addison and i can review it? |
@richard ,please refer draft requirement document |
@r12a. Can you please share your feedback for this document, |
@r12a please send your feedback on the shared document. We are also investigating the different variations and rules for other additional 5 languages and will share soon. |
@vermaprashant1 sorry it's taken me so long to get to this. Here are my comments. [1] I think the document would be much clearer if at the beginning you separated out more cleanly the various ways in which words can be encoded differently, and then after that point out the consequences and proposed advice. I would start with a list of problem cases that would include the following:
Such an analysis would need to indicate which alternations in sequence are handled by normalisation. Normalisation should be expected as a given, always, before matching, so it's the ones that normalisation doesn't fix that we are particularly interested in. It would be interesting to explore whether what equivalences need to be made for string matching of identifiers (eg. the HTML/CSS case) vs. full text search. For example, in english spelling differences such as 'internationalization' vs. 'internationalisation' are not seen as equivalent, and maybe the anusvara-conjunct alternate is the same. In full text search, however, searching for one should probably find the other. [2] Section 2.2.
The initial sentence gives the impression that the Unicode Standard requires that users type keys on the keyboard in a particular order. What the standard actually says is that the stored order of characters typically corresponds to the order in which they are typed, but there is no expectation at all about how the keyboard should actually function, as long as it produces an appropriate sequencing of characters in the end: combining marks after base characters, virama between conjunct parts, etc. Given that, i'm not sure what point you want to make in section 2.2. Any decent keyboard should allow the user to produce good Unicode character sequences, and any kbd that doesn't should be avoided. [3] Are there different concerns for other languages using devanagari? - eg. i'm thinking about the eye-lash RA in Marathi. [4] It would be very much easier for me to review your document if it was available in HTML, rather than PDF form. I'd be able to make annotations on the document for my reference, and i'd be able to copy-paste examples for exploration without the junk that PDF produces. hope that helps. |
Dear Richard,
Greetings..
Thanks for sharing valuable inputs. As it was a long time back, We have
already revised the character model requirements document with additional 5
more languages requirements. These are collected from the various Language
Experts. Also we will go through your comments and revise documents
accordingly wherever required. I will share it with you soon.
Thanks,
Prashant
…On Wed, Feb 9, 2022 at 7:37 AM r12a ***@***.***> wrote:
@vermaprashant1 <https://github.com/vermaprashant1> sorry it's taken me
so long to get to this. Here are my comments.
[1] I think the document would be much clearer if at the beginning you
separated out more cleanly the various ways in which words can be encoded
differently, and then *after that* point out the consequences and
proposed advice. I would start with a list of problem cases that would
include the following:
1. spelling variants such as the alternation between syllable-final
/n/ or nasalisation (eg. the word Hindi) – note that spelling variants
occur in most languages, and so it's something any search engine typically
has to consider - what other common alternative spellings occur in Hindi
besides LA vs LLA (which you mention almost in passing without any
examples)? It would be good to have a list of at least the more common ones.
2. the choice of characters to represent nuktas (with a little more
detail) – this is a little complicated in Devanagari because normalisation
produces different results for different visual combinations, see
https://r12a.github.io/scripts/devanagari/#nukta_encoding
3. inappropriate combinations that look the same visually – you don't
mention these at all, but it's a significant issue for indic scripts. See
examples of this for vowel-sign and independent vowel representation at
https://r12a.github.io/scripts/devanagari/#vowelsign_encoding and
https://r12a.github.io/scripts/devanagari/#vowelsign_encoding2
4. any combinations of combining characters with a single base that
can be typed and stored in an order that causes problems - often this is
resolved during normalisation, but there are problematic cases that are not
resolved by normalising the text - similar issues are motivating some folks
involved with Unicode to produce rendering guidelines for Thai, Khmer and
Arabic scripts - these advise reordering of specific sequences of
characters so as to produce consistent ordering and ensure that the text
renders correctly when displayed. Again, you don't mention any such
combinations, and i haven't researched this either yet for Devanagari.
5. Matching needs to decide what to do when format characters appear
in text, eg. ZWJ, ZWNJ. In languages like Persian, these can affect the
semantics of the text, but i suspect that in Devanagari that is not the
case, and they can just be ignored. It's worth checking the full list of
invisible characters that may appear in Devanagari text.
6. Graphically similar but semantically different (confusable) code
points - i would probably put the OM in this category.
Such an analysis would need to indicate which alternations in sequence are
handled by normalisation. Normalisation should be expected as a given,
always, before matching, so it's the ones that normalisation doesn't fix
that we are particularly interested in.
It would be interesting to explore whether what equivalences need to be
made for string matching of identifiers (eg. the HTML/CSS case) vs. full
text search. For example, in english spelling differences such as
'internationalization' vs. 'internationalisation' are not seen as
equivalent, and maybe the anusvara-conjunct alternate is the same. In full
text search, however, searching for one should probably find the other.
[2] Section 2.2.
It is requires by the Unicode to store and interchanged the characters in
the same logical order or we can say that order that user typed through the
keyboards
The initial sentence gives the impression that the Unicode Standard
requires that users type keys on the keyboard in a particular order. What
the standard actually says is that the stored order of characters
*typically* corresponds to the order in which they are typed, but there
is no expectation at all about how the keyboard should actually function,
as long as it produces an appropriate sequencing of characters in the end:
combining marks after base characters, virama between conjunct parts, etc.
Given that, i'm not sure what point you want to make in section 2.2. Any
decent keyboard should allow the user to produce good Unicode character
sequences, and any kbd that doesn't should be avoided.
[3] Are there different concerns for other languages using devanagari? -
eg. i'm thinking about the eye-lash RA in Marathi.
[4] It would be very much easier for me to review your document if it was
available in HTML, rather than PDF form. I'd be able to make annotations on
the document for my reference, and i'd be able to copy-paste examples for
exploration without the junk that PDF produces.
hope that helps.
—
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB7B5ESYROMPOCRINGJKHV3U2KC43ANCNFSM4ZJ77KSQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Thanks & Regards,
Prashant Verma I Program Manager
Web Standardization Initiative(WSI) , MeitY
New Delhi
Cell : +91-8800521042
Website : *http://tdil.meity.gov.in/WSI/AboutWSI.aspx
<http://tdil.mit.gov.in/WSI/AboutWSI.aspx>*
|
Sorry for the delay. I will look at your revised document. (Please point to an HTML file, if that's possible.) In the meantime, could you take a look at w3c/iip#119 (comment) for me? Thanks. |
Hello Prashant, I was looking over your document today in preparation for taking up some of the text into string-search. I intend to add questions to this thread as I go through your document. For starters, I found this paragraph:
Can you clarify that "different spellings of word... should be treated as different word" means that spelling variations should be treated as if they were different words (not matching)? I know that's what the sentence means, but want to be sure that this was your intention. Do Bengali users expect document searches not to provide spelling variation matches at all? Or are there features in some programs for users to find such matches? I will ask other clarifications as I work through the document. Thank you so much for providing this information!! |
Various changes in preparation for editing this document to address w3c#10. - Updated respec to no longer use respec-common - Removed "conformance" section (since this is Note track) - Some amount of line-joining - Fixed several typos - Copied in shared local.css stylesheet and incorporated the one local style we were using - Copied in our more-modern "special markup" block - Made all references informative
I suspect that the word 'not' is missing between 'should' and 'be'. |
I would like to understand whether any of the alternate spellings or alternate code point sequences involve sequences that are listed as "do not use" in the Unicode standard. (Unfortunately, you'll need to read these from tables in the script chapter, they are not defined in any data files). For syntactic elements, editing tools etc. should probably flag any attempted use of "do not use" sequences. |
Significant editing of the document in preparation for importing some of the material found in the supplied Indic doc. I basically rewrote section 2. This includes starting to bring in the list of issues we filed against `FindText` back in the day.
Please find the below feedback received by Bengali expert.
Yes! That is the argument. Because, in a text, you never know which spelling will be used by the text creator, and if your inbuilt system does not have all possible variants, then predicting the right spelling matches will be quite problematic.
If a document search system can capture all possible variations of all the words that show spelling variations, there is no problem. The reality is that to date we have not come across any such system that can predict all possible variations of spelling. I have not even come across any database that records all possible spelling variations of Bengali words. |
@aphillips in case it helps, it's much easier to understand what's going on here if you copy the Bengali examples to the bengali character app, then highlight the text line by line and click on |
@vermaprashant1 The section about Encoding variations lists for Bengali the circumgraph vowel signs which are canonically equivalent in Unicode. It doesn't however mention combinations which are not recommended, such as There are a fair number of these in Indian scripts, esp. for letters with nukta. Is it something you think should be in the document? (I haven't looked closely yet at all the language sections.) This is a misspelling rather than an alternative spelling. |
This comment was marked as resolved.
This comment was marked as resolved.
@vermaprashant1 Note: the document linked to here: https://tdil.meity.gov.in/WSI/ILs-variations.html is on a server with an expired certificate (it expired at midnight on 24 July), so I can't view it currently. |
Here id the feedback received by Bengali expert:
We have not taken combinations which are not recommended in the document. It covers only alternative spellings/encoding and facts which are used by particular community. |
@vermaprashant1 Thanks for your reply. Note that the expiration of the certificate on the meity.gov.in server means we don't have access to the document. Would it be possible for you to send me a copy to use as a reference? |
Please find the document.[ |
Adding examples stripped from the document supplied in w3c#10. Some of these are unclear and need more work.
@r12a any update on this file? |
* Remove the extraneous links in code points for Bengali example * Refactor the example table for other script * Minor text tweaks @r12a The Gujarati example looks like it is misspelled, although I took it from the doc in w3c#10. I eliminated the Odia example because it was two seemingly unrelated strings. It would be nice to have anuswara/visarga/candrabindu examples from other scripts to put here. Do you have any handy?
* Homogenize language tags * Fix table organization to be consistent * Removed Gujarati example that wasn't explained in w3c#10 * Replaced 'hindi' with 'snake' * Replaced 'many' with 'several'
TDIL(Technology development for Indian languages) have collated Indian languages requirements concerned with Hindi language variations(different keystrokes and spelling variations) with examples that need to be focus and reflected in string searching recommendation. Kindly guide us for further actions.
The text was updated successfully, but these errors were encountered: