Skip to content

Commit 723256c

Browse files
committed
[Proposal] Add "String Encoding Names" proposal.
This proposal allows `String.Encoding` to be converted to and from various names. For example: ```swift print(String.Encoding.utf8.ianaName!) // Prints "UTF-8" print(String.Encoding(ianaName: "ISO_646.irv:1991") == .ascii) // Prints "true" ```
1 parent edcf9ac commit 723256c

File tree

1 file changed

+317
-0
lines changed

1 file changed

+317
-0
lines changed
+317
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,317 @@
1+
# String Encoding Names
2+
3+
* Proposal: Not assigned yet <!-- [FOU-NNNN](NNNN-filename.md) -->
4+
* Author(s): [YOCKOW](https://GitHub.com/YOCKOW)
5+
* Review Manager: TBD
6+
* Status: **Awaiting review**
7+
<!-- * Bug: *if applicable* [apple/swift#NNNN](https://github.com/apple/swift-foundation/issues/NNNNN) -->
8+
* Implementation: [StringEncodingNameImpl/StringEncodingName.swift](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/main/Sources/StringEncodingNameImpl/StringEncodingName.swift) *(Awaiting implementation on [swiftlang/swift-foundation](https://github.com/swiftlang/swift-foundation))*
9+
<!-- * Previous Proposal: *if applicable* [FOU-XXXX](XXXX-filename.md) -->
10+
<!-- * Previous Revision: *if applicable* [1](https://github.com/apple/swift-evolution/blob/...commit-ID.../proposals/NNNN-filename.md) -->
11+
* Review: ([Pitch](https://forums.swift.org/t/pitch-foundation-string-encoding-names/74623))
12+
13+
14+
## Revision History
15+
16+
### [Pitch#1](https://gist.github.com/YOCKOW/f5a385e3c9e2d0c97f3340a889f57a16/d76651bf4375164f6a46df792fccd74955a4733a)
17+
18+
- Features
19+
* Fully compatible with CoreFoundation.
20+
+ Planned to add static properties corresponding to `kCFStringEncoding*`.
21+
* Spelling of getter/initializer was `ianaCharacterSetName`.
22+
- Pros
23+
* Easy to migrate from CoreFoundation.
24+
- Cons
25+
* Propagating undesirable legacy conversions into current Swift Foundation.
26+
* Including string encodings which might not be supported by Swift Foundation.
27+
28+
29+
### [Pitch#2](https://gist.github.com/YOCKOW/f5a385e3c9e2d0c97f3340a889f57a16/215404d620b41119a8a03ec1a51e725eb09be4b6)
30+
31+
- Features
32+
* Consulting both [IANA Character Sets](https://www.iana.org/assignments/character-sets/character-sets.xhtml) and [WHATWG Encoding Standard](https://encoding.spec.whatwg.org/).
33+
+ Making a compromise between them.
34+
* Spelling of getter/initializer was `name`.
35+
- Pros
36+
* Easy to communicate with API.
37+
- Cons
38+
* Hard for users to comprehend conversions.
39+
* Difficult to maintain the API in a consistant way.
40+
41+
### [Pitch#3](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/0.1.0/proposal/NNNN-String-Encoding-Names.md), [Pitch#4](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/0.2.1/proposal/NNNN-String-Encoding-Names.md)
42+
43+
- Features
44+
* Consulting both [IANA Character Sets](https://www.iana.org/assignments/character-sets/character-sets.xhtml) and [WHATWG Encoding Standard](https://encoding.spec.whatwg.org/).
45+
* Following ["Charset Alias Matching"](https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching) rule defined in UTS#22 to parse IANA Charset Names.
46+
* Separated getters/initializers for them.
47+
+ #3: `charsetName` and `standardName` respectively.
48+
+ #4: `name(.iana)` and `name(.whatwg)` for getters; `init(iana:)` and `init(whatwg:)` for initializers.
49+
- Pros
50+
* Users can recognize what kind of conversions is used.
51+
- Cons
52+
* Not reflecting the fact that WHATWG's Encoding Standard doesn't provide only string encoding names but also implementations to encode/decode data.
53+
54+
### [Pitch#5](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/0.3.1/proposal/NNNN-String-Encoding-Names.md)
55+
56+
- Features
57+
* Withdrew support for [WHATWG Encoding Standard](https://encoding.spec.whatwg.org/).
58+
* Following ["Charset Alias Matching"](https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching) rule defined in UTS#22 to parse IANA Charset Names.
59+
* Spelling of getter/initializer was `name`.
60+
* "Fixed" some behaviour of parsing, which differs from CoreFoundation.
61+
- Pros
62+
* Simple API to use.
63+
- Cons
64+
* It was unclear that IANA names were used.
65+
* The parsing behavior was complex and unpredictable.
66+
67+
68+
### [Pitch#6](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/0.4.0/proposal/NNNN-String-Encoding-Names.md), Proposal#1
69+
70+
This version.
71+
72+
73+
## Introduction
74+
75+
This proposal allows `String.Encoding` to be converted to and from various names.
76+
77+
For example:
78+
79+
```swift
80+
print(String.Encoding.utf8.ianaName!) // Prints "UTF-8"
81+
print(String.Encoding(ianaName: "ISO_646.irv:1991") == .ascii) // Prints "true"
82+
```
83+
84+
85+
## Motivation
86+
87+
String encoding names are widely used in computer networking and other areas. For instance, you often see them in HTTP headers such as `Content-Type: text/plain; charset=UTF-8` or in XML documents with declarations such as `<?xml version="1.0" encoding="Shift_JIS"?>`.
88+
89+
Therefore, it is necessary to parse and generate such names.
90+
91+
92+
### Current solution
93+
94+
Swift lacks the necessary APIs, requiring the use of `CoreFoundation` (hereinafter called "CF") as described below.
95+
96+
```swift
97+
extension String.Encoding {
98+
var nameInLegacyWay: String? {
99+
// 1. Convert `String.Encoding` value to the `CFStringEncoding` value.
100+
// NOTE: The raw value of `String.Encoding` is the same as the value of `NSStringEncoding`,
101+
// while it is not equal to the value of `CFStringEncoding`.
102+
let cfStrEncValue: CFStringEncoding = CFStringConvertNSStringEncodingToEncoding(self.rawValue)
103+
104+
// 2. Convert it to the name where its type is `CFString?`
105+
let cfStrEncName: CFString? = CFStringConvertEncodingToIANACharSetName(cfStrEncValue)
106+
107+
// 3. Convert `CFString` to Swift's `String`.
108+
// NOTE: Unfortunately they can not be implicitly casted on Linux.
109+
let charsetName: String? = cfStrEncName.flatMap {
110+
let bufferSize = CFStringGetMaximumSizeForEncoding(
111+
CFStringGetLength($0),
112+
kCFStringEncodingASCII
113+
) + 1
114+
let buffer = UnsafeMutablePointer<CChar>.allocate(capacity: bufferSize)
115+
defer {
116+
buffer.deallocate()
117+
}
118+
guard CFStringGetCString($0, buffer, bufferSize, kCFStringEncodingASCII) else {
119+
return nil
120+
}
121+
return String(utf8String: buffer)
122+
}
123+
return charsetName
124+
}
125+
126+
init?(fromNameInLegacyWay charsetName: String) {
127+
// 1. Convert `String` to `CFString`
128+
let cfStrEncName: CFString = charsetName.withCString { (cString: UnsafePointer<CChar>) -> CFString in
129+
return CFStringCreateWithCString(nil, cString, kCFStringEncodingASCII)
130+
}
131+
132+
// 2. Convert it to `CFStringEncoding`
133+
let cfStrEncValue: CFStringEncoding = CFStringConvertIANACharSetNameToEncoding(cfStrEncName)
134+
135+
// 3. Check whether or not it's valid
136+
guard cfStrEncValue != kCFStringEncodingInvalidId else {
137+
return nil
138+
}
139+
140+
// 4. Convert `CFStringEncoding` value to `String.Encoding` value
141+
self.init(rawValue: CFStringConvertEncodingToNSStringEncoding(cfStrEncValue))
142+
}
143+
}
144+
```
145+
146+
147+
### What's the problem of the current solution?
148+
149+
- It is complicated to use multiple CF functions to get a simple value. That's not *Swifty*.
150+
- CF functions are legacy APIs that do not always meet modern requirements.
151+
- CF APIs are not officially intended to be called directly from Swift on non-Darwin platforms.
152+
153+
154+
## Proposed solution
155+
156+
The solution is straightforward.
157+
We introduce a computed property that returns the name, and the initializer that creates an instance from a name as shown below.
158+
159+
```swift
160+
extension String.Encoding {
161+
/// The name of this encoding that is compatible with the one of the IANA registry "charset".
162+
public var ianaName: String?
163+
164+
/// Creates an instance from the name of the IANA registry "charset".
165+
public init?(ianaName: String)
166+
}
167+
```
168+
169+
## Detailed design
170+
171+
This proposal refers to "[Character Sets](https://www.iana.org/assignments/character-sets/character-sets.xhtml)" published by IANA.
172+
173+
One of the reasons for this is that The World Wide Web Consortium (W3C) recommends using IANA "charset" names in XML[^XML-IANA-charset-names] and they assert that any IANA "charset" names are available in HTTP header[^HTTP-IANA-charset-names].
174+
175+
[^XML-IANA-charset-names]: https://www.w3.org/TR/xml11/#charencoding
176+
[^HTTP-IANA-charset-names]: https://www.w3.org/International/articles/http-charset/index#charset
177+
178+
Another reason is that CF claims that IANA "charset" names are used, as implied by its function names[^CF-IANA-function-names].
179+
180+
[^CF-IANA-function-names]: [`CFStringConvertIANACharSetNameToEncoding`](https://developer.apple.com/documentation/corefoundation/cfstringconvertianacharsetnametoencoding(_:)) and [`CFStringConvertEncodingToIANACharSetName`](https://developer.apple.com/documentation/corefoundation/cfstringconvertencodingtoianacharsetname(_:))
181+
182+
However, as mentioned above, CF APIs are sometimes outdated.
183+
Furthermore, CF parses "charset" names inconsistently[^CF-inconsistent-parse].
184+
Therefore, we shouldn't adopt CF-like behavior without modifications. Nevertheless, adjusting it to some extent can be unpredictable and complex.
185+
186+
[^CF-inconsistent-parse]: https://forums.swift.org/t/pitch-foundation-string-encoding-names/74623/53
187+
188+
Accordingly, this proposal suggests just simple correspondence between `String.Encoding` instances and IANA names:
189+
190+
191+
| `String.Encoding` | IANA "charset" Name |
192+
|----------------------|---------------------|
193+
| `.ascii` | US-ASCII |
194+
| `.iso2022JP` | ISO-2022-JP |
195+
| `.isoLatin1` | ISO-8859-1 |
196+
| `.isoLatin2` | ISO-8859-2 |
197+
| `.japaneseEUC` | EUC-JP |
198+
| `.macOSRoman` | macintosh |
199+
| `.nextstep` | *n/a* |
200+
| `.nonLossyASCII` | *n/a* |
201+
| `.shiftJIS` | Shift_JIS |
202+
| `.symbol` | *n/a* |
203+
| `.unicode`/`.utf16` | UTF-16 |
204+
| `.utf16BigEndian` | UTF-16BE |
205+
| `.utf16LittleEndian` | UTF-16LE |
206+
| `.utf32` | UTF-32 |
207+
| `.utf32BigEndian` | UTF-32BE |
208+
| `.utf32LittleEndian` | UTF-32LE |
209+
| `.utf8` | UTF-8 |
210+
| `.windowsCP1250` | windows-1250 |
211+
| `.windowsCP1251` | windows-1251 |
212+
| `.windowsCP1252` | windows-1252 |
213+
| `.windowsCP1253` | windows-1253 |
214+
| `.windowsCP1254` | windows-1254 |
215+
216+
217+
### `String.Encoding` to Name
218+
219+
- Upper-case letters may be used unlike CF.
220+
* `var ianaName` returns *Preferred MIME Name* or *Name* of the encoding defined in "IANA Character Sets".
221+
222+
223+
### Name to `String.Encoding`
224+
225+
- `init(ianaName:)` adopts case-insensitive comparison with *Preferred MIME Name*, *Name*, and *Aliases*.
226+
227+
228+
## Source compatibility
229+
230+
These changes proposed here are only additive. However, care must be taken if migrating from CF APIs.
231+
232+
233+
## Implications on adoption
234+
235+
This feature can be freely adopted and un-adopted in source code with no deployment constraints and without affecting source compatibility.
236+
237+
238+
## Future directions
239+
240+
`String.init(data:encoding:)` and `String.data(using:)` will be implemented more appropriately[^string-data-regression].
241+
242+
[^string-data-regression]: https://github.com/swiftlang/swift-foundation/issues/1015
243+
244+
245+
Hopefully, happening some cascades like below might be expected in the longer term.
246+
247+
- General string decoders/encoders and their protocols (for example, as suggested in "[Unicode Processing APIs](https://forums.swift.org/t/pitch-unicode-processing-apis/69294)") could be implemented.
248+
249+
- Some types which provide their names and decoders/encoders could be implemented for the purpose of tightness between names and implementations.
250+
* There would be a type for WHATWG Encoding Standard which defines both names and implementations.
251+
252+
<details><summary>They would look like...</summary><div>
253+
254+
```swift
255+
public protocol StrawmanStringEncodingProtocol {
256+
static func encoding(for name: String) -> Self?
257+
var name: String? { get }
258+
var encoder: (any StringToByteStreamEncoder)? { get }
259+
var decoder: (any ByteStreamToUnicodeScalarsDecoder)? { get }
260+
}
261+
262+
public struct IANACharset: StrawmanStringEncodingProtocol {
263+
public static let utf8: IANACharset = ...
264+
public static let shiftJIS: IANACharset = ...
265+
:
266+
:
267+
}
268+
269+
public struct WHATWGEncoding: StrawmanStringEncodingProtocol {
270+
public static let utf8: WHATWGEncoding = ...
271+
public static let eucJP: WHATWGEncoding = ...
272+
:
273+
:
274+
}
275+
```
276+
277+
</div></details>
278+
279+
- `String.Encoding` might be deprecated as a natural course in the distant future??
280+
281+
282+
## Alternatives considered
283+
284+
### Following "Charset Alias Matching"
285+
286+
[UTS#22](https://www.unicode.org/reports/tr22/tr22-8.html) defines "Charset Alias Matching" rule.
287+
ICU adopts that rule and CF partially depends on ICU.
288+
On the other hand, there doesn't seem to be any specifications that require "Charset Alias Matching".
289+
Moreover, some risks may be inherent in such a tolerant rule.
290+
291+
One possible solution may be letting users choose which rule should be used:
292+
```swift
293+
extension String.Encoding {
294+
public enum NameParsingStrategy {
295+
case uts22
296+
case caseInsensitiveComparison
297+
}
298+
299+
public init?(ianaName: String, strategy: NameParsingStrategy = .caseInsensitiveComparison) {
300+
...
301+
}
302+
}
303+
```
304+
305+
306+
### Adopting the WHATWG Encoding Standard (as well)
307+
308+
There is another standard for string encodings which is published by WHATWG: "[Encoding Standard](https://encoding.spec.whatwg.org/)".
309+
While it may claim the IANA's Character Sets could be replaced with it, it entirely focuses on Web browsers and their JavaScript APIs.
310+
Furthermore it binds tightly names with implementations.
311+
Since `String.Encoding` is just a `RawRepresentable` type where its `RawValue` is `UInt`, it is more universal but is more loosely bound to implementations.
312+
As a result, WHATWG Encoding Standard doesn't easily align with `String.Encoding`. So it is just mentioned in "Future Directions".
313+
314+
315+
## Acknowledgments
316+
317+
Thanks to everyone who gave me advices on the pitch thread; especially to [@benrimmington](https://github.com/benrimmington) and [@xwu](https://github.com/xwu) who could channel their concerns into this proposal in the very early stage.

0 commit comments

Comments
 (0)