Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need guidance on DOMString vs USVString #93

Closed
plinss opened this issue Apr 6, 2018 · 19 comments · Fixed by #199
Closed

Need guidance on DOMString vs USVString #93

plinss opened this issue Apr 6, 2018 · 19 comments · Fixed by #199
Assignees
Labels
Status: In Progress We're working on it but ideas not fully formed yet.

Comments

@plinss
Copy link
Member

plinss commented Apr 6, 2018

There has been debate in the csswg on using DOMString vs USVString in APIs. CSSOM currently has a CSSOMString union allowing engines to use either (most apparently use USVString currently).

WebIDL currently recommends DOMString except where USVStrings are required. https://heycam.github.io/webidl/#idl-USVString

There's an outstanding question if we want to attempt to migrate the web platform to a world where unpaired surrogates are no longer allowed for APIs except where needed, or just decide that we're stuck with the legacy behavior.

See w3c/css-houdini-drafts#687 and w3c/csswg-drafts#1217 for related discussion.

@dbaron
Copy link
Member

dbaron commented Apr 6, 2018

Also see whatwg/webidl#84.

@torgo torgo added the Status: In Progress We're working on it but ideas not fully formed yet. label Apr 6, 2018
@kenchris
Copy link
Contributor

kenchris commented Apr 6, 2018

I also wanted to loop in @mgiuca because we had some discussion around when to use one or the other in the web app manifest spec.

@kenchris
Copy link
Contributor

kenchris commented Apr 6, 2018

As this is from a merged PR, I am quoting Matt here

One point that keeps coming up is DOMString vs USVString. The difference is DOMString allows illegal surrogate pairs while USVString does not. The WebIDL spec tells you not to use USVString unless you have a good reason (the main accepted use of USVString is URLs). My opinion is the opposite; that USVString should generally be preferred because illegal surrogate pairs are meaningless, and cannot be represented outside of UTF-16. I think the rule of thumb should be that if the string characters are displayed to the user or leave the browser, then it should be a USVString, but if it's an internal string only (e.g., an ID string) then it can be a DOMString.

I successfully argued this logic in Web Share which is why the ShareData fields are USVStrings.

At the very least, all URL fields should be USVStrings. For Manifest, many of the strings will be exported out to the OS (e.g., on shortcuts) and illegal surrogate pairs may not be representable. I would advise the following strings be USVString:

name, short_name, description, scope, start_url, related_applications: all fields, categories, Image.src (but not size). Agree/disagree?

@domenic
Copy link
Member

domenic commented Apr 6, 2018

I would strongly recommend not going against Web IDL here, and keeping the default as DOMString for things that aren't meant to be sent over the wire. See my reasoning at w3c/css-houdini-drafts#687 (comment)

I don't agree with the framing of the OP, that using USVString is in any way migrating the web to a world where unpaired surrogates are disallowed. Instead, it's just adding an extra unnecessary processing step. Unpaired surrogates flow freely throughout all JavaScript code, and using USVString just forces browsers to preprocess them into replacement characters before operating on them internally. (E.g., sending them over HTTP.) It doesn't in any way move the needle on unpaired surrogates in the ecosystem in general.

@mgiuca
Copy link

mgiuca commented Apr 9, 2018

@domenic What do you think about my classification?

"if the string characters are displayed to the user or leave the browser, then it should be a USVString"

I agree with you that converting everything to USVString isn't going to migrate away from unpaired surrogates being allowed. Strings passed internally (that do not leave the JavaScript context) should remain as DOMString, an unverified sequence of UTF-16 code units. But it's not just exposing strings to the network that should require them to be processed as Unicode characters.

  • If their primary purpose is to be shown to the user, then they're going to be processed at some point; may as well be explicit in the type that unpaired surrogates will be converted into U+FFFD. (Then you guarantee consistent treatment of these strings across browsers, because it's done in the WebIDL layer, not a UI layer.)
  • If they are designed to interface with some system outside of JavaScript (e.g., in Web Share, the strings do not go out to the network directly, but they can be sent to the operating system), similarly, we want consistent treatment of unpaired surrogates.

@annevk
Copy link
Member

annevk commented Apr 9, 2018

@mgiuca we rejected that classification before. Since we haven't changed title et al to be lone surrogate safe, or Text nodes. Rendering already has to deal with lone surrogates.

@mgiuca
Copy link

mgiuca commented Apr 9, 2018

Right, but those could be grandfathered instances (too hard to change), while new advice might be to use USVString for cases that match the above.

(Obviously the web must make affordances for new design advice without having to update all the old interfaces; for instance, new async APIs should return Promises, but we haven't fixed setTimeout or XMLHttpRequest.)

@annevk
Copy link
Member

annevk commented Apr 9, 2018

Compared to those improvements it's unclear how this is an improvement (unless you actually use UTF-8 strings underneath). As @domenic keeps pointing out, unless you need UTF-8, all this adds is overhead.

@domenic
Copy link
Member

domenic commented Apr 9, 2018

Right, I disagree with "shown to the user". I think "some system outside JavaScript" is potentially the right generalization of "sent to the network"; not sure.

@kenchris
Copy link
Contributor

kenchris commented Apr 9, 2018

So shown as part of the Android system UI would quality for "some system outside JavaScript" in this case - which fits with why we use it in the web app manifest

Some system outside JavaScript, such as as being shown as part of native UI, being sent over the network and the like.

@domenic
Copy link
Member

domenic commented Apr 9, 2018

I guess then we should probably stick with the network boundary.

@domenic
Copy link
Member

domenic commented Apr 9, 2018

Right, it's coming back to me now. The generalization is just "sent to systems which are known to, in an interoperable way, not allow unpaired surrogates". URL encoding is one such system. There are specification algorithms which any spec that uses URLs must call into which simply do not work on unpaired surrogates.

OS APIs are not a case of this. Some of them might accept unpaired surrogates (all of Windows), some might not. In any case, OS APIs are not interoperable. We leave it to the OS-specific implementation code, not to the web layer, to deal with the unpaired surrogates.

On the other hand if, for example, there was some cross-platform Bluetooth standard for sending 140 character messages that, per the Bluetooth standards folks, did not accept unpaired surrogates, then using USVString for the Web140CharacterBluetoothAPI would make sense. Because, you'd write the spec text starting with DOMString, insert a step to convert to a scalar value string before calling the Bluetooth standard's algorithm, and then notice that you could save yourself one step by just using USVString instead.

That's the line. Only use USVString when it saves you a step of standards text, by rolling that step into the Web IDL type conversion.

@mgiuca
Copy link

mgiuca commented Apr 10, 2018

OS APIs are not a case of this. Some of them might accept unpaired surrogates (all of Windows), some might not. In any case, OS APIs are not interoperable. We leave it to the OS-specific implementation code, not to the web layer, to deal with the unpaired surrogates.

That seems wrong to me. Shouldn't it read "sent to systems which are not known to, in an interoperable way, not allow unpaired surrogates". So if you know it allows unpaired surrogates (e.g., it's being put into a JavaScript string) then use DOMString. But if you can't be certain that it will allow unpaired surrogates (e.g., it's being sent to an OS interface), then use USVString, to guarantee well-defined behaviour on all platforms.

I think I get where you're coming from --- if there's a chance the data will be allowed through unmodified (e.g., because it's Windows), then do not pre-mangle it "just in case". Systems that need to mangle it will do so (in an unspecified way), but systems that don't need to mangle it will have the least possible data loss. However, I think interoperability should override that concern: if a string containing an unpaired surrogate works properly on Windows, but gets mangled (or perhaps completely fails) on other platforms, that is an interop problem especially if developers are only testing on Windows. On the other hand, if the spec guarantees that unpaired surrogates will be transformed into U+FFFD on all platforms, then it works the same everywhere.

FYI I finally found the discussion where this was decided for Web Share (it was tricky to find thanks to GitHub burying code review comments): w3c/web-share#20, then click "Show outdated" on the comment on docs/interface.md. The discussion is short but it ends with @annevk saying "it sounds like a good reason to avoid surrogates". Does this mean we have to reverse that decision and go back to DOMStrings?

@domenic
Copy link
Member

domenic commented Apr 10, 2018

Can you write a test case to show the kind of non-interoperability you are talking about?

I think in the end the practical rule is pretty simple: always start with DOMString. If you notice yourself inserting a step to censor unpaired surrogates into replacement characters, then you can switch to USVString in order to get the type system to do that for you.

It sounds like the web share case, you'd need that step, at least once the web share target spec gets introduced. So there it makes sense. Without web share target, you wouldn't, I believe.

@mgiuca
Copy link

mgiuca commented Apr 10, 2018

I could certainly write Web Share spec in such a way that I don't need to fix up unpaired surrogates. I just say "pass the string as a sequence of UTF-16 code units to the operating system", and let it be implied that a non-UTF-16-based operating system needs to figure out what to do with those code units. (But then I think such a spec is negligent because it isn't giving implementors guidance on what to do here, so different implementations are able to come to different conclusions -- one might insert U+FFFD, another might insert '?', and a third might return an error.)

The fact that Web Share Target could come along later and accept those unpaired surrogates without issue doesn't help. It just adds another case where (like Windows) unpaired surrogates aren't a problem, but it doesn't fix the fact that you have not specified what to do in other situations where they are a problem.

Changing tack: Is there any good reason why unpaired surrogates should pass through unmodified? (In any new API, that is sending the strings out to the operating system or some other service outside of the web?) There is some utility in allowing unpaired surrogates in strings that stay within an application (for example, you can hide non-character data in them (as Python cleverly does in "surrogateescape" encoding). But that is only meaningful within an application; those unpaired surrogates are, by definition, completely meaningless when passed to a service outside of the web, which is expecting a valid string. I can't think of any legitimate need for a website to send what amounts to a corrupted string, to a service outside of the web.

@domenic
Copy link
Member

domenic commented Apr 10, 2018

Changing tack: Is there any good reason why unpaired surrogates should pass through unmodified?

In general, it's the same as the reasons that the letter "A" should pass through unmodified. There's an arbitrary number of transformations you could apply to a JavaScript string. We shouldn't unnecessarily apply them.

passed to a service outside of the web, which is expecting a valid string.

If your specification interacts with some other, well-specified system which states in its own standard that it only expects a scalar value string (not, e.g., a Unicode 16-bit string of the type that JavaScript/Java/Win32/etc. use)... then yes, you'd need to insert a conversion step into your spec, and yes, using USVString to get the type system to do that for you is a good idea.

@mgiuca
Copy link

mgiuca commented Apr 10, 2018

In general, it's the same as the reasons that the letter "A" should pass through unmodified.

You'll have to explicitly state those reasons, because from my perspective there are some very good reasons why the letter 'A' should pass through unmodified while the 16-bit integer 0xD800 should not.

  • 'A' (or, the 16-bit integer 0x0041 in UTF-16) is a character, represented by a Unicode code point (U+0041). 0xD800 is not a character. It has no code point associated with it.
  • The basic purpose of passing a string through any API is to communicate textual content, and 'A' makes up a basic unit of text. 0xD800 does not; nobody has any reason to deliver a text message containing this integer. (Even an ASCII control character, even though non-textual, has some potentially semantic content so should be preserved, but 0xD800 is not a control character.)
  • 'A' (the code point U+0041) can be represented in all three of the primary Unicode encodings (UTF-8, UTF-16 and UTF-32) as well as many other encodings, so we can be confident that it will pass through any re-encodings that might be applied to the string by the system outside of the user agent. (We'd be less confident about a non-ASCII character, but that would still survive encoding into any of the 3 primary encodings.) 0xD800 cannot survive the UTF-16 decoding process. Technically, a string containing an unpaired surrogate isn't valid UTF-16, even though it will happily sit in an array allocated for that purpose. (So even if it's passed into a system that natively uses UTF-16, there's no guarantee it will work properly, since as I said earlier, it's a corrupted string.)

The types of "arbitrary" transformations you're talking about (that we shouldn't apply capriciously) are things like composition/decomposition, trimming, removal of control characters, case normalization, etc. All of these transformations change the text content (code points) of a valid string. Whereas the USVString preprocessing step makes no changes to a legal UTF-16 string; it just takes illegal sequences of integers disguised as UTF-16, and creates a legal UTF-16 string. It's a more fundamental (more "necessary") transformation since it's about ensuring basic validity of the string data type.

Once again, I have no problem with DOMString being used inside JavaScript, since all the pieces inside the web are expected to work with unpaired surrogates. But we should strive to design web APIs that communicate only valid strings with the outside world.

If your specification interacts with some other, well-specified system which states in its own standard that it only expects a scalar value string ...

I think we should assume that external systems expect valid strings, unless directly interacting with a standard that we know accepts a DOMString-compatible string. Rather than starting from an assumption that external systems are OK with corrupted string data, unless explicitly stated. (Most of the "external" systems we're talking about are not known to the standard, since they will usually be "the user's operating system", "the user's SMS app" or "the user's contacts database" or something like that. So a decree that standards should "Use DOMString unless we know the other system expects a valid string" will lead to most APIs using DOMString, while a decree that standards should "Use USVString unless we know the other system is DOMString-compatible" will lead to most APIs using USVString. In other words, we'll be using the default most of the time.)

@domenic
Copy link
Member

domenic commented Apr 10, 2018

We should not make assumptions about what other systems expect without an interoperable specification or test showing that they expect scalar value strings.

So a decree that standards should "Use DOMString unless we know the other system expects a valid string" will lead to most APIs using DOMString

Yes, that's the intention.

chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Dec 5, 2018
This patch adjusts `ScriptURLString` to be a union including `USVString`,
not `DOMString`. The advice in [WebIDL][1] isn't exactly clear, but it
boils down to @domenic's notes in [whatwg/webidl#84][2] and
[w3ctag/design-principles#93][3].

Long story short, URLs are `USVString`. This patch adjusts our
implementation to match.

[1]: https://heycam.github.io/webidl/#idl-USVString
[2]: whatwg/webidl#84 (comment)
[3]: w3ctag/design-principles#93 (comment)

Change-Id: I9bf1240b421287d7d9c291b13d887ca981a66231
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Dec 6, 2018
This patch adjusts `ScriptURLString` to be a union including `USVString`,
not `DOMString`. The advice in [WebIDL][1] isn't exactly clear, but it
boils down to @domenic's notes in [whatwg/webidl#84][2] and
[w3ctag/design-principles#93][3].

Long story short, URLs are `USVString`. This patch adjusts our
implementation to match.

[1]: https://heycam.github.io/webidl/#idl-USVString
[2]: whatwg/webidl#84 (comment)
[3]: w3ctag/design-principles#93 (comment)

Change-Id: I9bf1240b421287d7d9c291b13d887ca981a66231
@hober hober self-assigned this May 18, 2020
@yousef312
Copy link

hey I want to know, how to convert between those types of strings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: In Progress We're working on it but ideas not fully formed yet.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants