|
| 1 | +# String Encoding Names |
| 2 | + |
| 3 | +* Proposal: Not assigned yet <!-- [FOU-NNNN](NNNN-filename.md) --> |
| 4 | +* Author(s): [YOCKOW](https://GitHub.com/YOCKOW) |
| 5 | +* Review Manager: TBD |
| 6 | +* Status: **Awaiting review** |
| 7 | +<!-- * Bug: *if applicable* [apple/swift#NNNN](https://github.com/apple/swift-foundation/issues/NNNNN) --> |
| 8 | +* Implementation: [StringEncodingNameImpl/StringEncodingName.swift](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/main/Sources/StringEncodingNameImpl/StringEncodingName.swift) *(Awaiting implementation on [swiftlang/swift-foundation](https://github.com/swiftlang/swift-foundation))* |
| 9 | +<!-- * Previous Proposal: *if applicable* [FOU-XXXX](XXXX-filename.md) --> |
| 10 | +<!-- * Previous Revision: *if applicable* [1](https://github.com/apple/swift-evolution/blob/...commit-ID.../proposals/NNNN-filename.md) --> |
| 11 | +* Review: ([Pitch](https://forums.swift.org/t/pitch-foundation-string-encoding-names/74623)) |
| 12 | + |
| 13 | + |
| 14 | +## Revision History |
| 15 | + |
| 16 | +### [Pitch#1](https://gist.github.com/YOCKOW/f5a385e3c9e2d0c97f3340a889f57a16/d76651bf4375164f6a46df792fccd74955a4733a) |
| 17 | + |
| 18 | +- Features |
| 19 | + * Fully compatible with CoreFoundation. |
| 20 | + + Planned to add static properties corresponding to `kCFStringEncoding*`. |
| 21 | + * Spelling of getter/initializer was `ianaCharacterSetName`. |
| 22 | +- Pros |
| 23 | + * Easy to migrate from CoreFoundation. |
| 24 | +- Cons |
| 25 | + * Propagating undesirable legacy conversions into current Swift Foundation. |
| 26 | + * Including string encodings which might not be supported by Swift Foundation. |
| 27 | + |
| 28 | + |
| 29 | +### [Pitch#2](https://gist.github.com/YOCKOW/f5a385e3c9e2d0c97f3340a889f57a16/215404d620b41119a8a03ec1a51e725eb09be4b6) |
| 30 | + |
| 31 | +- Features |
| 32 | + * Consulting both [IANA Character Sets](https://www.iana.org/assignments/character-sets/character-sets.xhtml) and [WHATWG Encoding Standard](https://encoding.spec.whatwg.org/). |
| 33 | + + Making a compromise between them. |
| 34 | + * Spelling of getter/initializer was `name`. |
| 35 | +- Pros |
| 36 | + * Easy to communicate with API. |
| 37 | +- Cons |
| 38 | + * Hard for users to comprehend conversions. |
| 39 | + * Difficult to maintain the API in a consistant way. |
| 40 | + |
| 41 | +### [Pitch#3](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/0.1.0/proposal/NNNN-String-Encoding-Names.md), [Pitch#4](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/0.2.1/proposal/NNNN-String-Encoding-Names.md) |
| 42 | + |
| 43 | +- Features |
| 44 | + * Consulting both [IANA Character Sets](https://www.iana.org/assignments/character-sets/character-sets.xhtml) and [WHATWG Encoding Standard](https://encoding.spec.whatwg.org/). |
| 45 | + * Following ["Charset Alias Matching"](https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching) rule defined in UTS#22 to parse IANA Charset Names. |
| 46 | + * Separated getters/initializers for them. |
| 47 | + + #3: `charsetName` and `standardName` respectively. |
| 48 | + + #4: `name(.iana)` and `name(.whatwg)` for getters; `init(iana:)` and `init(whatwg:)` for initializers. |
| 49 | +- Pros |
| 50 | + * Users can recognize what kind of conversions is used. |
| 51 | +- Cons |
| 52 | + * Not reflecting the fact that WHATWG's Encoding Standard doesn't provide only string encoding names but also implementations to encode/decode data. |
| 53 | + |
| 54 | +### [Pitch#5](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/0.3.1/proposal/NNNN-String-Encoding-Names.md) |
| 55 | + |
| 56 | +- Features |
| 57 | + * Withdrew support for [WHATWG Encoding Standard](https://encoding.spec.whatwg.org/). |
| 58 | + * Following ["Charset Alias Matching"](https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching) rule defined in UTS#22 to parse IANA Charset Names. |
| 59 | + * Spelling of getter/initializer was `name`. |
| 60 | + * "Fixed" some behaviour of parsing, which differs from CoreFoundation. |
| 61 | +- Pros |
| 62 | + * Simple API to use. |
| 63 | +- Cons |
| 64 | + * It was unclear that IANA names were used. |
| 65 | + * The parsing behavior was complex and unpredictable. |
| 66 | + |
| 67 | + |
| 68 | +### [Pitch#6](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/0.4.0/proposal/NNNN-String-Encoding-Names.md), Proposal#1 |
| 69 | + |
| 70 | +This version. |
| 71 | + |
| 72 | + |
| 73 | +## Introduction |
| 74 | + |
| 75 | +This proposal allows `String.Encoding` to be converted to and from various names. |
| 76 | + |
| 77 | +For example: |
| 78 | + |
| 79 | +```swift |
| 80 | +print(String.Encoding.utf8.ianaName!) // Prints "UTF-8" |
| 81 | +print(String.Encoding(ianaName: "ISO_646.irv:1991") == .ascii) // Prints "true" |
| 82 | +``` |
| 83 | + |
| 84 | + |
| 85 | +## Motivation |
| 86 | + |
| 87 | +String encoding names are widely used in computer networking and other areas. For instance, you often see them in HTTP headers such as `Content-Type: text/plain; charset=UTF-8` or in XML documents with declarations such as `<?xml version="1.0" encoding="Shift_JIS"?>`. |
| 88 | + |
| 89 | +Therefore, it is necessary to parse and generate such names. |
| 90 | + |
| 91 | + |
| 92 | +### Current solution |
| 93 | + |
| 94 | +Swift lacks the necessary APIs, requiring the use of `CoreFoundation` (hereinafter called "CF") as described below. |
| 95 | + |
| 96 | +```swift |
| 97 | +extension String.Encoding { |
| 98 | + var nameInLegacyWay: String? { |
| 99 | + // 1. Convert `String.Encoding` value to the `CFStringEncoding` value. |
| 100 | + // NOTE: The raw value of `String.Encoding` is the same as the value of `NSStringEncoding`, |
| 101 | + // while it is not equal to the value of `CFStringEncoding`. |
| 102 | + let cfStrEncValue: CFStringEncoding = CFStringConvertNSStringEncodingToEncoding(self.rawValue) |
| 103 | + |
| 104 | + // 2. Convert it to the name where its type is `CFString?` |
| 105 | + let cfStrEncName: CFString? = CFStringConvertEncodingToIANACharSetName(cfStrEncValue) |
| 106 | + |
| 107 | + // 3. Convert `CFString` to Swift's `String`. |
| 108 | + // NOTE: Unfortunately they can not be implicitly casted on Linux. |
| 109 | + let charsetName: String? = cfStrEncName.flatMap { |
| 110 | + let bufferSize = CFStringGetMaximumSizeForEncoding( |
| 111 | + CFStringGetLength($0), |
| 112 | + kCFStringEncodingASCII |
| 113 | + ) + 1 |
| 114 | + let buffer = UnsafeMutablePointer<CChar>.allocate(capacity: bufferSize) |
| 115 | + defer { |
| 116 | + buffer.deallocate() |
| 117 | + } |
| 118 | + guard CFStringGetCString($0, buffer, bufferSize, kCFStringEncodingASCII) else { |
| 119 | + return nil |
| 120 | + } |
| 121 | + return String(utf8String: buffer) |
| 122 | + } |
| 123 | + return charsetName |
| 124 | + } |
| 125 | + |
| 126 | + init?(fromNameInLegacyWay charsetName: String) { |
| 127 | + // 1. Convert `String` to `CFString` |
| 128 | + let cfStrEncName: CFString = charsetName.withCString { (cString: UnsafePointer<CChar>) -> CFString in |
| 129 | + return CFStringCreateWithCString(nil, cString, kCFStringEncodingASCII) |
| 130 | + } |
| 131 | + |
| 132 | + // 2. Convert it to `CFStringEncoding` |
| 133 | + let cfStrEncValue: CFStringEncoding = CFStringConvertIANACharSetNameToEncoding(cfStrEncName) |
| 134 | + |
| 135 | + // 3. Check whether or not it's valid |
| 136 | + guard cfStrEncValue != kCFStringEncodingInvalidId else { |
| 137 | + return nil |
| 138 | + } |
| 139 | + |
| 140 | + // 4. Convert `CFStringEncoding` value to `String.Encoding` value |
| 141 | + self.init(rawValue: CFStringConvertEncodingToNSStringEncoding(cfStrEncValue)) |
| 142 | + } |
| 143 | +} |
| 144 | +``` |
| 145 | + |
| 146 | + |
| 147 | +### What's the problem of the current solution? |
| 148 | + |
| 149 | +- It is complicated to use multiple CF functions to get a simple value. That's not *Swifty*. |
| 150 | +- CF functions are legacy APIs that do not always meet modern requirements. |
| 151 | +- CF APIs are not officially intended to be called directly from Swift on non-Darwin platforms. |
| 152 | + |
| 153 | + |
| 154 | +## Proposed solution |
| 155 | + |
| 156 | +The solution is straightforward. |
| 157 | +We introduce a computed property that returns the name, and the initializer that creates an instance from a name as shown below. |
| 158 | + |
| 159 | +```swift |
| 160 | +extension String.Encoding { |
| 161 | + /// The name of this encoding that is compatible with the one of the IANA registry "charset". |
| 162 | + public var ianaName: String? |
| 163 | + |
| 164 | + /// Creates an instance from the name of the IANA registry "charset". |
| 165 | + public init?(ianaName: String) |
| 166 | +} |
| 167 | +``` |
| 168 | + |
| 169 | +## Detailed design |
| 170 | + |
| 171 | +This proposal refers to "[Character Sets](https://www.iana.org/assignments/character-sets/character-sets.xhtml)" published by IANA. |
| 172 | + |
| 173 | +One of the reasons for this is that The World Wide Web Consortium (W3C) recommends using IANA "charset" names in XML[^XML-IANA-charset-names] and they assert that any IANA "charset" names are available in HTTP header[^HTTP-IANA-charset-names]. |
| 174 | + |
| 175 | +[^XML-IANA-charset-names]: https://www.w3.org/TR/xml11/#charencoding |
| 176 | +[^HTTP-IANA-charset-names]: https://www.w3.org/International/articles/http-charset/index#charset |
| 177 | + |
| 178 | +Another reason is that CF claims that IANA "charset" names are used, as implied by its function names[^CF-IANA-function-names]. |
| 179 | + |
| 180 | +[^CF-IANA-function-names]: [`CFStringConvertIANACharSetNameToEncoding`](https://developer.apple.com/documentation/corefoundation/cfstringconvertianacharsetnametoencoding(_:)) and [`CFStringConvertEncodingToIANACharSetName`](https://developer.apple.com/documentation/corefoundation/cfstringconvertencodingtoianacharsetname(_:)) |
| 181 | + |
| 182 | +However, as mentioned above, CF APIs are sometimes outdated. |
| 183 | +Furthermore, CF parses "charset" names inconsistently[^CF-inconsistent-parse]. |
| 184 | +Therefore, we shouldn't adopt CF-like behavior without modifications. Nevertheless, adjusting it to some extent can be unpredictable and complex. |
| 185 | + |
| 186 | +[^CF-inconsistent-parse]: https://forums.swift.org/t/pitch-foundation-string-encoding-names/74623/53 |
| 187 | + |
| 188 | +Accordingly, this proposal suggests just simple correspondence between `String.Encoding` instances and IANA names: |
| 189 | + |
| 190 | + |
| 191 | +| `String.Encoding` | IANA "charset" Name | |
| 192 | +|----------------------|---------------------| |
| 193 | +| `.ascii` | US-ASCII | |
| 194 | +| `.iso2022JP` | ISO-2022-JP | |
| 195 | +| `.isoLatin1` | ISO-8859-1 | |
| 196 | +| `.isoLatin2` | ISO-8859-2 | |
| 197 | +| `.japaneseEUC` | EUC-JP | |
| 198 | +| `.macOSRoman` | macintosh | |
| 199 | +| `.nextstep` | *n/a* | |
| 200 | +| `.nonLossyASCII` | *n/a* | |
| 201 | +| `.shiftJIS` | Shift_JIS | |
| 202 | +| `.symbol` | *n/a* | |
| 203 | +| `.unicode`/`.utf16` | UTF-16 | |
| 204 | +| `.utf16BigEndian` | UTF-16BE | |
| 205 | +| `.utf16LittleEndian` | UTF-16LE | |
| 206 | +| `.utf32` | UTF-32 | |
| 207 | +| `.utf32BigEndian` | UTF-32BE | |
| 208 | +| `.utf32LittleEndian` | UTF-32LE | |
| 209 | +| `.utf8` | UTF-8 | |
| 210 | +| `.windowsCP1250` | windows-1250 | |
| 211 | +| `.windowsCP1251` | windows-1251 | |
| 212 | +| `.windowsCP1252` | windows-1252 | |
| 213 | +| `.windowsCP1253` | windows-1253 | |
| 214 | +| `.windowsCP1254` | windows-1254 | |
| 215 | + |
| 216 | + |
| 217 | +### `String.Encoding` to Name |
| 218 | + |
| 219 | +- Upper-case letters may be used unlike CF. |
| 220 | + * `var ianaName` returns *Preferred MIME Name* or *Name* of the encoding defined in "IANA Character Sets". |
| 221 | + |
| 222 | + |
| 223 | +### Name to `String.Encoding` |
| 224 | + |
| 225 | +- `init(ianaName:)` adopts case-insensitive comparison with *Preferred MIME Name*, *Name*, and *Aliases*. |
| 226 | + |
| 227 | + |
| 228 | +## Source compatibility |
| 229 | + |
| 230 | +These changes proposed here are only additive. However, care must be taken if migrating from CF APIs. |
| 231 | + |
| 232 | + |
| 233 | +## Implications on adoption |
| 234 | + |
| 235 | +This feature can be freely adopted and un-adopted in source code with no deployment constraints and without affecting source compatibility. |
| 236 | + |
| 237 | + |
| 238 | +## Future directions |
| 239 | + |
| 240 | +`String.init(data:encoding:)` and `String.data(using:)` will be implemented more appropriately[^string-data-regression]. |
| 241 | + |
| 242 | +[^string-data-regression]: https://github.com/swiftlang/swift-foundation/issues/1015 |
| 243 | + |
| 244 | + |
| 245 | +Hopefully, happening some cascades like below might be expected in the longer term. |
| 246 | + |
| 247 | +- General string decoders/encoders and their protocols (for example, as suggested in "[Unicode Processing APIs](https://forums.swift.org/t/pitch-unicode-processing-apis/69294)") could be implemented. |
| 248 | + |
| 249 | +- Some types which provide their names and decoders/encoders could be implemented for the purpose of tightness between names and implementations. |
| 250 | + * There would be a type for WHATWG Encoding Standard which defines both names and implementations. |
| 251 | + |
| 252 | +<details><summary>They would look like...</summary><div> |
| 253 | + |
| 254 | +```swift |
| 255 | +public protocol StrawmanStringEncodingProtocol { |
| 256 | + static func encoding(for name: String) -> Self? |
| 257 | + var name: String? { get } |
| 258 | + var encoder: (any StringToByteStreamEncoder)? { get } |
| 259 | + var decoder: (any ByteStreamToUnicodeScalarsDecoder)? { get } |
| 260 | +} |
| 261 | + |
| 262 | +public struct IANACharset: StrawmanStringEncodingProtocol { |
| 263 | + public static let utf8: IANACharset = ... |
| 264 | + public static let shiftJIS: IANACharset = ... |
| 265 | + : |
| 266 | + : |
| 267 | +} |
| 268 | + |
| 269 | +public struct WHATWGEncoding: StrawmanStringEncodingProtocol { |
| 270 | + public static let utf8: WHATWGEncoding = ... |
| 271 | + public static let eucJP: WHATWGEncoding = ... |
| 272 | + : |
| 273 | + : |
| 274 | +} |
| 275 | +``` |
| 276 | + |
| 277 | +</div></details> |
| 278 | + |
| 279 | +- `String.Encoding` might be deprecated as a natural course in the distant future?? |
| 280 | + |
| 281 | + |
| 282 | +## Alternatives considered |
| 283 | + |
| 284 | +### Following "Charset Alias Matching" |
| 285 | + |
| 286 | +[UTS#22](https://www.unicode.org/reports/tr22/tr22-8.html) defines "Charset Alias Matching" rule. |
| 287 | +ICU adopts that rule and CF partially depends on ICU. |
| 288 | +On the other hand, there doesn't seem to be any specifications that require "Charset Alias Matching". |
| 289 | +Moreover, some risks may be inherent in such a tolerant rule. |
| 290 | + |
| 291 | +One possible solution may be letting users choose which rule should be used: |
| 292 | +```swift |
| 293 | +extension String.Encoding { |
| 294 | + public enum NameParsingStrategy { |
| 295 | + case uts22 |
| 296 | + case caseInsensitiveComparison |
| 297 | + } |
| 298 | + |
| 299 | + public init?(ianaName: String, strategy: NameParsingStrategy = .caseInsensitiveComparison) { |
| 300 | + ... |
| 301 | + } |
| 302 | +} |
| 303 | +``` |
| 304 | + |
| 305 | + |
| 306 | +### Adopting the WHATWG Encoding Standard (as well) |
| 307 | + |
| 308 | +There is another standard for string encodings which is published by WHATWG: "[Encoding Standard](https://encoding.spec.whatwg.org/)". |
| 309 | +While it may claim the IANA's Character Sets could be replaced with it, it entirely focuses on Web browsers and their JavaScript APIs. |
| 310 | +Furthermore it binds tightly names with implementations. |
| 311 | +Since `String.Encoding` is just a `RawRepresentable` type where its `RawValue` is `UInt`, it is more universal but is more loosely bound to implementations. |
| 312 | +As a result, WHATWG Encoding Standard doesn't easily align with `String.Encoding`. So it is just mentioned in "Future Directions". |
| 313 | + |
| 314 | + |
| 315 | +## Acknowledgments |
| 316 | + |
| 317 | +Thanks to everyone who gave me advices on the pitch thread; especially to [@benrimmington](https://github.com/benrimmington) and [@xwu](https://github.com/xwu) who could channel their concerns into this proposal in the very early stage. |
0 commit comments