-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlies.slide
159 lines (113 loc) · 4.2 KB
/
lies.slide
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
Lies, Damn Lies and Benchmarks
amnon
* Story
.image https://c1.staticflickr.com/3/2759/4273477501_c0e077e5f6_b.jpg 600 _
* What is our lamp?
go test -bench .
func BenchmarkMyFun(b *testing.B) {
for i := 0; i < b.N; i++ {
MyFun()
}
}
* Myths
- numbers do not lie
- nanoseconds matter
- faster means faster
- large programs are like small ones, but more so
* nanoseconds matter?
Go is usually fast enough
* real program latencies
dominated by
- network delays
- queueing
- database access
- lock contention
- Garbage collection
- cache misses
* faster instructions -> faster programs?
or more stall cycles?
* Moore's law mismatch
- CPU speed doubles every 2 years
- Memory speed doubles every 10 years
.image https://dave.cheney.net/wp-content/uploads/2014/06/Gocon-2014-10.jpg
* Caches
.image https://i.stack.imgur.com/bmuak.jpg _ 800
* Numbers all programmers should know
from Jeff Dean
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
* Caches
Caches get larger every generation
Skylake H 28 core
- 32K instruction + 32K data L1 per core - 4 cycles
- 1Mb L2 - 12-64 cycles
- 1.3M xCores L3 - 42 cycles
DRAM 42 cycles + 51ns
Cache only helps you if the data you need is in the cache
- temporal locality
- spatial locality
- eviction
- stall cycles
pointer chasing defeats all attempts to prefetch
* problem with micro-benchmarks
- data in cache
* Instrumentation
- performance counters
- vtune
- perf [[https://perf.wiki.kernel.org/]]
- oprofile
* Go Can Help
- Embedding
- related data located together
- Slices
- data stored in order of traversal
- Modern cache will detect strided access and prefetch the data
- int8 int16 int32 etc.
- use smaller types where possible - more can fit into a cache line
These features reduce allocations, reduce pointer chasing
and help produce cache friendly code
* Example Linked List
.code llist.go /START OMIT/,/END OMIT/
.code llist.go /START2 OMIT/,/END2 OMIT/
.code llist.go /STARTS OMIT/,/ENDS OMIT/
* Linked Lists are EVIL
.image https://cdn-images-1.medium.com/max/400/1*1B4X5Jbe5LNXeZNWx2msyw.gif
* Timings
.code llist_test.go /STARTBENCH OMIT/,/ENDBENCH OMIT/
goos: darwin
goarch: amd64
BenchmarkSumV10-4 100000000 12.9 ns/op
BenchmarkSumV100-4 20000000 89.0 ns/op
BenchmarkSumV1000-4 2000000 805 ns/op
BenchmarkSumV10000-4 200000 8210 ns/op
BenchmarkSumV100000-4 20000 80502 ns/op
BenchmarkSum10-4 100000000 15.5 ns/op
BenchmarkSum100-4 10000000 217 ns/op
BenchmarkSum1000-4 1000000 2054 ns/op
BenchmarkSum10000-4 30000 62039 ns/op
BenchmarkSum100000-4 1000 2851967 ns/op
* results
.image timings.png
* Resources
- Dave Cheney https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go
- https://en.wikichip.org/wiki/intel/microarchitectures
- Ulrich Drepper: What every programmer should know about memory [[https://lwn.net/Articles/250967/]]
* Summary
- Don't do it yet
- Optimize data rather than code
- Allocations matter
- Use embedding
- Slices are your friend
- Don't spend all your time looking under the lamp