Skip to content

Commit 17908fe

Browse files
Delivered 2022-02-17 at TTT
0 parents  commit 17908fe

File tree

6 files changed

+221
-0
lines changed

6 files changed

+221
-0
lines changed

figs/ascii.png

34.8 KB
Loading

figs/bigger.jpeg

80.9 KB
Loading

figs/card.png

4.42 MB
Loading

figs/cp437.png

2.45 KB
Loading

figs/shocked.gif

884 KB
Loading

strings.md

+221
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
---
2+
theme: gaia
3+
_class: lead
4+
paginate: true
5+
backgroundColor: #fff
6+
backgroundImage: url('https://marp.app/assets/hero-background.svg')
7+
style: |
8+
section.photo h1,section.photo h2,section.photo h3,section.photo h4,section.photo h5,section.photo h6 {
9+
background-color: #888;
10+
color: #FFF;
11+
}
12+
h6 {
13+
font-size: 30%;
14+
}
15+
img[alt~="centre"] {
16+
display: block;
17+
margin: 0 auto;
18+
}
19+
marp: true
20+
---
21+
22+
# Strings and OsStr: A wild ride through the history of Unicode
23+
24+
#### Jonathan Pallant
25+
26+
---
27+
28+
# A Journey...
29+
30+
1. A String is just a String, right?
31+
1. A Brief History of the String
32+
1. Not all Strings are alike
33+
34+
---
35+
36+
# A String is just a String, right?
37+
38+
* String
39+
* Byte String
40+
* OS String
41+
* C Strings
42+
43+
---
44+
45+
## String
46+
47+
```rust
48+
let s: String = "Hi 😀!".to_owned();
49+
dbg!(&s);
50+
dbg!(s.len());
51+
dbg!(s.bytes().count());
52+
dbg!(s.chars().count());
53+
```
54+
55+
[▶️](https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=dc9bead8ea15ce3bc95fb4b87fbcc963)
56+
57+
* A Vector of `u8` inside
58+
* Iterates as 32-bit `char`
59+
60+
---
61+
62+
## Byte String
63+
64+
```rust
65+
let s: [u8; 13] = b"Hello, world!".to_owned();
66+
dbg!(&s);
67+
dbg!(s.len());
68+
```
69+
70+
[▶️](https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=27b67623710f3e5125163db204ff709d)
71+
72+
* Iterates as octets (`u8`)
73+
* A Vector of octets (`u8`) inside
74+
75+
---
76+
77+
# A Brief History of the String
78+
79+
---
80+
81+
## The Punched Card
82+
83+
![centre h:500px](./figs/card.png)
84+
85+
<!-- Contains the EBCDIC character set -->
86+
87+
---
88+
89+
## Character Encoding
90+
91+
* Computers work in numbers
92+
* Humans like to write words
93+
* Words are made of characters
94+
* Technically grapheme clusters
95+
* Is ï one character or two?
96+
* We need a conversion table!
97+
* AKA: A Character Set
98+
99+
---
100+
101+
## American Standard Code for Information Interchange
102+
103+
* Morse Code
104+
* Telegraph / Baudot codes
105+
* BCD
106+
* EBCDIC
107+
* ASA X3.4-1963
108+
* aka ASCII
109+
110+
<!-- X3 committee of the American Standards Association -->
111+
112+
---
113+
114+
## An ASCII Table
115+
116+
![centre h:500px](./figs/ascii.png)
117+
118+
<!-- Let's encode H e l l o -->
119+
120+
<!-- Now let's encode t s c h ü s s -->
121+
122+
---
123+
124+
![centre h:500px](./figs/shocked.gif)
125+
126+
---
127+
128+
## What if we used the eighth-bit?
129+
130+
* We get 128 more characters!
131+
132+
![centre h:400px](./figs/cp437.png)
133+
134+
---
135+
136+
## More standards are required...
137+
138+
* MS-DOS Code Page 437, 850, ...
139+
* Windows Code Page 1252, 1250, ...
140+
* Macintosh Code Page 1275, 1282, ...
141+
142+
---
143+
144+
## OK, one Standard to Rule Them All then
145+
146+
> Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.
147+
148+
---
149+
150+
## OK, let's go!
151+
152+
* Microsoft used it in Windows
153+
* Sun used it in Java
154+
* Netscape used it in JavaScript
155+
* The Standard C Library added `wcslen` and friends
156+
157+
---
158+
159+
## Unicode 2.0 in 1996...
160+
161+
![centre h:400px](./figs/bigger.jpeg)
162+
163+
* Unicode Translation Format 16 (UTF-16) arrives
164+
165+
---
166+
167+
## Isn't this the *worst* of everything?
168+
169+
* Unit length != number of characters
170+
* Not ASCII compatible
171+
* Enter Plan 9 and UTF-8...
172+
173+
---
174+
175+
## UTF-8
176+
177+
* Variable-length encoding
178+
* Can encode any Unicode Scalar Value as one, two, three or four bytes.
179+
* Unit length != number of characters
180+
* `0b0xxxxxxx`
181+
* `0b110xxxxx 0b10xxxxxx`
182+
* `0b1110xxxx 0b10xxxxxx 0b10xxxxxx`
183+
184+
<!-- order matters! Not all 8-bit sequences are valid UTF-8 -->
185+
186+
---
187+
188+
## Are we done now?
189+
190+
* POSIX says file names are an array of 8-bit values
191+
* Windows says file names are an array of 16-bit `wchar_t`
192+
* :(
193+
194+
---
195+
196+
# Not all Strings are alike
197+
198+
* `String`/`&[str]`/`"hi"`
199+
* use this by default
200+
* `Vec<u8>`/`&[u8]`/`b"hi"`
201+
* use for exchanging data with 8-bit / ASCII systems
202+
* `OsString`/`OsStr`
203+
* use for exchanging data with your Operating System
204+
205+
---
206+
207+
## C Strings?
208+
209+
* `CString`/`CStr`
210+
* use for exchanging data with 8-bit C APIs
211+
* null-terminated
212+
* Might not be UTF-8
213+
* https://docs.rs/widestring/
214+
* use for exchanging data with 'wide' C APIs
215+
216+
---
217+
218+
# <!-- fit --> Questions?
219+
220+
221+

0 commit comments

Comments
 (0)