-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need guidance on DOMString vs USVString #93
Comments
Also see whatwg/webidl#84. |
I also wanted to loop in @mgiuca because we had some discussion around when to use one or the other in the web app manifest spec. |
As this is from a merged PR, I am quoting Matt here
|
I would strongly recommend not going against Web IDL here, and keeping the default as DOMString for things that aren't meant to be sent over the wire. See my reasoning at w3c/css-houdini-drafts#687 (comment) I don't agree with the framing of the OP, that using USVString is in any way migrating the web to a world where unpaired surrogates are disallowed. Instead, it's just adding an extra unnecessary processing step. Unpaired surrogates flow freely throughout all JavaScript code, and using USVString just forces browsers to preprocess them into replacement characters before operating on them internally. (E.g., sending them over HTTP.) It doesn't in any way move the needle on unpaired surrogates in the ecosystem in general. |
@domenic What do you think about my classification? "if the string characters are displayed to the user or leave the browser, then it should be a USVString" I agree with you that converting everything to USVString isn't going to migrate away from unpaired surrogates being allowed. Strings passed internally (that do not leave the JavaScript context) should remain as DOMString, an unverified sequence of UTF-16 code units. But it's not just exposing strings to the network that should require them to be processed as Unicode characters.
|
@mgiuca we rejected that classification before. Since we haven't changed |
Right, but those could be grandfathered instances (too hard to change), while new advice might be to use (Obviously the web must make affordances for new design advice without having to update all the old interfaces; for instance, new async APIs should return Promises, but we haven't fixed |
Compared to those improvements it's unclear how this is an improvement (unless you actually use UTF-8 strings underneath). As @domenic keeps pointing out, unless you need UTF-8, all this adds is overhead. |
Right, I disagree with "shown to the user". I think "some system outside JavaScript" is potentially the right generalization of "sent to the network"; not sure. |
So shown as part of the Android system UI would quality for "some system outside JavaScript" in this case - which fits with why we use it in the web app manifest Some system outside JavaScript, such as as being shown as part of native UI, being sent over the network and the like. |
I guess then we should probably stick with the network boundary. |
Right, it's coming back to me now. The generalization is just "sent to systems which are known to, in an interoperable way, not allow unpaired surrogates". URL encoding is one such system. There are specification algorithms which any spec that uses URLs must call into which simply do not work on unpaired surrogates. OS APIs are not a case of this. Some of them might accept unpaired surrogates (all of Windows), some might not. In any case, OS APIs are not interoperable. We leave it to the OS-specific implementation code, not to the web layer, to deal with the unpaired surrogates. On the other hand if, for example, there was some cross-platform Bluetooth standard for sending 140 character messages that, per the Bluetooth standards folks, did not accept unpaired surrogates, then using USVString for the Web140CharacterBluetoothAPI would make sense. Because, you'd write the spec text starting with DOMString, insert a step to convert to a scalar value string before calling the Bluetooth standard's algorithm, and then notice that you could save yourself one step by just using USVString instead. That's the line. Only use USVString when it saves you a step of standards text, by rolling that step into the Web IDL type conversion. |
That seems wrong to me. Shouldn't it read "sent to systems which are not known to, in an interoperable way, I think I get where you're coming from --- if there's a chance the data will be allowed through unmodified (e.g., because it's Windows), then do not pre-mangle it "just in case". Systems that need to mangle it will do so (in an unspecified way), but systems that don't need to mangle it will have the least possible data loss. However, I think interoperability should override that concern: if a string containing an unpaired surrogate works properly on Windows, but gets mangled (or perhaps completely fails) on other platforms, that is an interop problem especially if developers are only testing on Windows. On the other hand, if the spec guarantees that unpaired surrogates will be transformed into U+FFFD on all platforms, then it works the same everywhere. FYI I finally found the discussion where this was decided for Web Share (it was tricky to find thanks to GitHub burying code review comments): w3c/web-share#20, then click "Show outdated" on the comment on |
Can you write a test case to show the kind of non-interoperability you are talking about? I think in the end the practical rule is pretty simple: always start with DOMString. If you notice yourself inserting a step to censor unpaired surrogates into replacement characters, then you can switch to USVString in order to get the type system to do that for you. It sounds like the web share case, you'd need that step, at least once the web share target spec gets introduced. So there it makes sense. Without web share target, you wouldn't, I believe. |
I could certainly write Web Share spec in such a way that I don't need to fix up unpaired surrogates. I just say "pass the string as a sequence of UTF-16 code units to the operating system", and let it be implied that a non-UTF-16-based operating system needs to figure out what to do with those code units. (But then I think such a spec is negligent because it isn't giving implementors guidance on what to do here, so different implementations are able to come to different conclusions -- one might insert U+FFFD, another might insert '?', and a third might return an error.) The fact that Web Share Target could come along later and accept those unpaired surrogates without issue doesn't help. It just adds another case where (like Windows) unpaired surrogates aren't a problem, but it doesn't fix the fact that you have not specified what to do in other situations where they are a problem. Changing tack: Is there any good reason why unpaired surrogates should pass through unmodified? (In any new API, that is sending the strings out to the operating system or some other service outside of the web?) There is some utility in allowing unpaired surrogates in strings that stay within an application (for example, you can hide non-character data in them (as Python cleverly does in "surrogateescape" encoding). But that is only meaningful within an application; those unpaired surrogates are, by definition, completely meaningless when passed to a service outside of the web, which is expecting a valid string. I can't think of any legitimate need for a website to send what amounts to a corrupted string, to a service outside of the web. |
In general, it's the same as the reasons that the letter "A" should pass through unmodified. There's an arbitrary number of transformations you could apply to a JavaScript string. We shouldn't unnecessarily apply them.
If your specification interacts with some other, well-specified system which states in its own standard that it only expects a scalar value string (not, e.g., a Unicode 16-bit string of the type that JavaScript/Java/Win32/etc. use)... then yes, you'd need to insert a conversion step into your spec, and yes, using USVString to get the type system to do that for you is a good idea. |
You'll have to explicitly state those reasons, because from my perspective there are some very good reasons why the letter 'A' should pass through unmodified while the 16-bit integer 0xD800 should not.
The types of "arbitrary" transformations you're talking about (that we shouldn't apply capriciously) are things like composition/decomposition, trimming, removal of control characters, case normalization, etc. All of these transformations change the text content (code points) of a valid string. Whereas the USVString preprocessing step makes no changes to a legal UTF-16 string; it just takes illegal sequences of integers disguised as UTF-16, and creates a legal UTF-16 string. It's a more fundamental (more "necessary") transformation since it's about ensuring basic validity of the string data type. Once again, I have no problem with DOMString being used inside JavaScript, since all the pieces inside the web are expected to work with unpaired surrogates. But we should strive to design web APIs that communicate only valid strings with the outside world.
I think we should assume that external systems expect valid strings, unless directly interacting with a standard that we know accepts a DOMString-compatible string. Rather than starting from an assumption that external systems are OK with corrupted string data, unless explicitly stated. (Most of the "external" systems we're talking about are not known to the standard, since they will usually be "the user's operating system", "the user's SMS app" or "the user's contacts database" or something like that. So a decree that standards should "Use DOMString unless we know the other system expects a valid string" will lead to most APIs using DOMString, while a decree that standards should "Use USVString unless we know the other system is DOMString-compatible" will lead to most APIs using USVString. In other words, we'll be using the default most of the time.) |
We should not make assumptions about what other systems expect without an interoperable specification or test showing that they expect scalar value strings.
Yes, that's the intention. |
This patch adjusts `ScriptURLString` to be a union including `USVString`, not `DOMString`. The advice in [WebIDL][1] isn't exactly clear, but it boils down to @domenic's notes in [whatwg/webidl#84][2] and [w3ctag/design-principles#93][3]. Long story short, URLs are `USVString`. This patch adjusts our implementation to match. [1]: https://heycam.github.io/webidl/#idl-USVString [2]: whatwg/webidl#84 (comment) [3]: w3ctag/design-principles#93 (comment) Change-Id: I9bf1240b421287d7d9c291b13d887ca981a66231
This patch adjusts `ScriptURLString` to be a union including `USVString`, not `DOMString`. The advice in [WebIDL][1] isn't exactly clear, but it boils down to @domenic's notes in [whatwg/webidl#84][2] and [w3ctag/design-principles#93][3]. Long story short, URLs are `USVString`. This patch adjusts our implementation to match. [1]: https://heycam.github.io/webidl/#idl-USVString [2]: whatwg/webidl#84 (comment) [3]: w3ctag/design-principles#93 (comment) Change-Id: I9bf1240b421287d7d9c291b13d887ca981a66231
hey I want to know, how to convert between those types of strings |
There has been debate in the csswg on using DOMString vs USVString in APIs. CSSOM currently has a CSSOMString union allowing engines to use either (most apparently use USVString currently).
WebIDL currently recommends DOMString except where USVStrings are required. https://heycam.github.io/webidl/#idl-USVString
There's an outstanding question if we want to attempt to migrate the web platform to a world where unpaired surrogates are no longer allowed for APIs except where needed, or just decide that we're stuck with the legacy behavior.
See w3c/css-houdini-drafts#687 and w3c/csswg-drafts#1217 for related discussion.
The text was updated successfully, but these errors were encountered: