lies.slide

Lies, Damn Lies and Benchmarks

amnon


* Story
.image https://c1.staticflickr.com/3/2759/4273477501_c0e077e5f6_b.jpg 600 _

* What is our lamp?

   go test -bench .


   func BenchmarkMyFun(b *testing.B) {
       for i := 0; i < b.N; i++ {
           MyFun()
       }
   }

* Myths

- numbers do not lie

- nanoseconds matter

- faster means faster

- large programs are like small ones, but more so

* nanoseconds matter?

Go is usually fast enough

* real program latencies 
dominated by 
- network  delays
- queueing
- database access
- lock contention
- Garbage collection
- cache misses


* faster instructions -> faster programs?

or more stall cycles?


* Moore's law mismatch
- CPU speed doubles every 2 years
- Memory speed doubles every 10 years
.image https://dave.cheney.net/wp-content/uploads/2014/06/Gocon-2014-10.jpg

* Caches

.image https://i.stack.imgur.com/bmuak.jpg _ 800

* Numbers all programmers should know

from Jeff Dean

    L1 cache reference                           0.5 ns
    Branch mispredict                            5   ns
    L2 cache reference                           7   ns                      14x L1 cache
    Mutex lock/unlock                           25   ns
    Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
    Compress 1K bytes with Zippy             3,000   ns        3 us
    Send 1K bytes over 1 Gbps network       10,000   ns       10 us
    Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD
    Read 1 MB sequentially from memory     250,000   ns      250 us
    Round trip within same datacenter      500,000   ns      500 us
    Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms  ~1GB/sec SSD, 4X memory
    Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
    Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
    Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms
* Caches 
Caches get larger every generation
Skylake H  28 core
- 32K instruction + 32K data L1 per core - 4 cycles
- 1Mb L2 - 12-64 cycles
- 1.3M xCores L3 - 42 cycles
DRAM 42 cycles + 51ns

Cache only helps you if the data you need is in the cache
- temporal locality
- spatial locality

- eviction
- stall cycles

pointer chasing defeats all attempts to prefetch

* problem with micro-benchmarks
- data in cache

* Instrumentation

- performance counters
- vtune
- perf [[https://perf.wiki.kernel.org/]]
- oprofile


* Go Can Help

- Embedding
  - related data located together
- Slices
  - data stored in order of traversal
  - Modern cache will detect strided access and prefetch the data
- int8 int16 int32 etc.
  - use smaller types where possible - more can fit into a cache line

These features reduce allocations, reduce pointer chasing 
and help produce cache friendly code

* Example Linked List

.code llist.go /START OMIT/,/END OMIT/
.code llist.go /START2 OMIT/,/END2 OMIT/
.code llist.go /STARTS OMIT/,/ENDS OMIT/

* Linked Lists are EVIL
.image https://cdn-images-1.medium.com/max/400/1*1B4X5Jbe5LNXeZNWx2msyw.gif

* Timings
.code llist_test.go /STARTBENCH OMIT/,/ENDBENCH OMIT/

    goos: darwin
    goarch: amd64
    BenchmarkSumV10-4       	100000000	        12.9 ns/op
    BenchmarkSumV100-4      	20000000	        89.0 ns/op
    BenchmarkSumV1000-4     	 2000000	       805 ns/op
    BenchmarkSumV10000-4    	  200000	      8210 ns/op
    BenchmarkSumV100000-4   	   20000	     80502 ns/op
    BenchmarkSum10-4        	100000000	        15.5 ns/op
    BenchmarkSum100-4       	10000000	       217 ns/op
    BenchmarkSum1000-4      	 1000000	      2054 ns/op
    BenchmarkSum10000-4     	   30000	     62039 ns/op
    BenchmarkSum100000-4    	    1000	   2851967 ns/op

* results

.image timings.png

* Resources
- Dave Cheney https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go
- https://en.wikichip.org/wiki/intel/microarchitectures
- Ulrich Drepper: What every programmer should know about memory [[https://lwn.net/Articles/250967/]]


* Summary
- Don't do it yet
- Optimize data rather than code
- Allocations matter
- Use embedding
- Slices are your friend
- Don't spend all your time looking under the lamp