forked from sthbrx/sthbrx.github.io-old
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathrss.xml
More file actions
763 lines (701 loc) · 124 KB
/
rss.xml
File metadata and controls
763 lines (701 loc) · 124 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Store Half Byte-Reverse Indexed</title><link>https://sthbrx.github.io/</link><description>A Power Technical Blog</description><atom:link href="https://sthbrx.github.io/rss.xml" rel="self"></atom:link><lastBuildDate>Tue, 22 Mar 2016 18:00:00 +1100</lastBuildDate><item><title>Getting logs out of things</title><link>https://sthbrx.github.io/blog/2016/03/22/getting-logs-out-of-things/</link><description><p>Here at OzLabs, we have an unfortunate habit of making our shiny Power computers very sad, which is a common problem in systems programming and kernel hacking. When this happens, we like having logs. In particular, we like to have the kernel log and the OPAL firmware log, which are, very surprisingly, rather helpful when debugging kernel and firmware issues.</p>
<p>Here's how to get them.</p>
<h2>From userspace</h2>
<p>You're lucky enough that your machine is still up, yay! As every Linux sysadmin knows, you can just grab the kernel log using <code>dmesg</code>.</p>
<p>As for the OPAL log: we can simply ask OPAL to tell us where its log is located in memory, copy it from there, and hand it over to userspace. In Linux, as per standard Unix conventions, we do this by exposing the log as a file, which can be found in <code>/sys/firmware/opal/msglog</code>.</p>
<p>Annoyingly, the <code>msglog</code> file reports itself as size 0 (I'm not sure exactly why, but I <em>think</em> it's due to limitations in sysfs), so if you try to copy the file with <code>cp</code>, you end up with just a blank file. However, you can read it with <code>cat</code> or <code>less</code>.</p>
<h2>From <code>xmon</code></h2>
<p><code>xmon</code> is a really handy in-kernel debugger for PowerPC that allows you to do basic debugging over the console without hooking up a second machine to use with <code>kgdb</code>. On our development systems, we often configure <code>xmon</code> to automatically begin debugging whenever we hit an oops or panic (using <code>xmon=on</code> on the kernel command line, or the <code>XMON_DEFAULT</code> Kconfig option). It can also be manually triggered:</p>
<div class="highlight"><pre>root@p86:~# echo x &gt; /proc/sysrq-trigger
sysrq: SysRq : Entering xmon
cpu 0x7: Vector: 0 at [c000000fcd717a80]
pc: c000000000085ad8: sysrq_handle_xmon+0x68/0x80
lr: c000000000085ad8: sysrq_handle_xmon+0x68/0x80
sp: c000000fcd717be0
msr: 9000000000009033
current = 0xc000000fcd689200
paca = 0xc00000000fe01c00 softe: 0 irq_happened: 0x01
pid = 7127, comm = bash
Linux version 4.5.0-ajd-11118-g968f3e3 (ajd@ka1) (gcc version 5.2.1 20150930 (GCC) ) #1 SMP Tue Mar 22 17:01:58 AEDT 2016
enter ? for help
7:mon&gt;
</pre></div>
<p>From <code>xmon</code>, simply type <code>dl</code> to dump out the kernel log. If you'd like to page through the log rather than dump the entire thing at once, use <code>#&lt;n&gt;</code> to split it into groups of <code>n</code> lines.</p>
<p>Until recently, it wasn't as easy to extract the OPAL log without knowing magic offsets. A couple of months ago, I was debugging a nasty CAPI issue and got rather frustrated by this, so one day when I had a couple of hours free I <a href="http://patchwork.ozlabs.org/patch/581775/">refactored</a> the existing sysfs interface and <a href="http://patchwork.ozlabs.org/patch/581774/">added</a> the <code>do</code> command to <code>xmon</code>. These patches will be included from kernel 4.6-rc1 onwards.</p>
<p>When you're done, <code>x</code> will attempt to recover the machine and continue, <code>zr</code> will reboot, and <code>zh</code> will halt.</p>
<h2>From the FSP</h2>
<p>Sometimes, not even <code>xmon</code> will help you. In production environments, you're not generally going to start a debugger every time you have an incident. Additionally, a serious hardware error can cause a 'checkstop', which completely halts the system. (Thankfully, end users don't see this very often, but kernel developers, on the other hand...)</p>
<p>This is where the Flexible Service Processor, or FSP, comes in. The FSP is an IBM-developed baseboard management controller used on most IBM-branded Power Systems machines, and is responsible for a whole range of things, including monitoring system health. Among its many capabilities, the FSP can automatically take "system dumps" when fatal errors occur, capturing designated regions of memory for later debugging. System dumps can be configured and triggered via the FSP's web interface, which is beyond the scope of this post but is <a href="https://www.ibm.com/support/knowledgecenter/POWER8/p8ha5/mainstoragedump.htm?cp=POWER8%2F1-3-14-2">documented</a> in IBM Power Systems user manuals.</p>
<p>How does the FSP know what to capture? As it turns out, skiboot (the firmware which implements OPAL) maintains a <a href="https://github.com/open-power/skiboot/blob/master/hw/fsp/fsp-mdst-table.c">Memory Dump Source Table</a> which tells the FSP which memory regions to dump. MDST updates are recorded in the OPAL log:</p>
<div class="highlight"><pre><span class="p">[</span><span class="mi">2690088026</span><span class="p">,</span><span class="mi">5</span><span class="p">]</span> <span class="nl">MDST</span><span class="p">:</span> <span class="n">Max</span> <span class="n">entries</span> <span class="k">in</span> <span class="n">MDST</span> <span class="nl">table</span> <span class="p">:</span> <span class="mi">256</span>
<span class="p">[</span><span class="mi">2690090666</span><span class="p">,</span><span class="mi">5</span><span class="p">]</span> <span class="nl">MDST</span><span class="p">:</span> <span class="n">Addr</span> <span class="o">=</span> <span class="mh">0x31000000</span> <span class="p">[</span><span class="nl">size</span> <span class="p">:</span> <span class="mh">0x100000</span> <span class="n">bytes</span><span class="p">]</span> <span class="n">added</span> <span class="n">to</span> <span class="n">MDST</span> <span class="n">table</span><span class="p">.</span>
<span class="p">[</span><span class="mi">2690093767</span><span class="p">,</span><span class="mi">5</span><span class="p">]</span> <span class="nl">MDST</span><span class="p">:</span> <span class="n">Addr</span> <span class="o">=</span> <span class="mh">0x31100000</span> <span class="p">[</span><span class="nl">size</span> <span class="p">:</span> <span class="mh">0x100000</span> <span class="n">bytes</span><span class="p">]</span> <span class="n">added</span> <span class="n">to</span> <span class="n">MDST</span> <span class="n">table</span><span class="p">.</span>
<span class="p">[</span><span class="mi">2750378890</span><span class="p">,</span><span class="mi">5</span><span class="p">]</span> <span class="nl">MDST</span><span class="p">:</span> <span class="n">Table</span> <span class="n">updated</span><span class="p">.</span>
<span class="p">[</span><span class="mi">11199672771</span><span class="p">,</span><span class="mi">5</span><span class="p">]</span> <span class="nl">MDST</span><span class="p">:</span> <span class="n">Addr</span> <span class="o">=</span> <span class="mh">0x1fff772780</span> <span class="p">[</span><span class="nl">size</span> <span class="p">:</span> <span class="mh">0x200000</span> <span class="n">bytes</span><span class="p">]</span> <span class="n">added</span> <span class="n">to</span> <span class="n">MDST</span> <span class="n">table</span><span class="p">.</span>
<span class="p">[</span><span class="mi">11215193760</span><span class="p">,</span><span class="mi">5</span><span class="p">]</span> <span class="nl">MDST</span><span class="p">:</span> <span class="n">Table</span> <span class="n">updated</span><span class="p">.</span>
<span class="p">[</span><span class="mi">28031311971</span><span class="p">,</span><span class="mi">5</span><span class="p">]</span> <span class="nl">MDST</span><span class="p">:</span> <span class="n">Table</span> <span class="n">updated</span><span class="p">.</span>
<span class="p">[</span><span class="mi">28411709421</span><span class="p">,</span><span class="mi">5</span><span class="p">]</span> <span class="nl">MDST</span><span class="p">:</span> <span class="n">Addr</span> <span class="o">=</span> <span class="mh">0x1fff830000</span> <span class="p">[</span><span class="nl">size</span> <span class="p">:</span> <span class="mh">0x100000</span> <span class="n">bytes</span><span class="p">]</span> <span class="n">added</span> <span class="n">to</span> <span class="n">MDST</span> <span class="n">table</span><span class="p">.</span>
<span class="p">[</span><span class="mi">28417251110</span><span class="p">,</span><span class="mi">5</span><span class="p">]</span> <span class="nl">MDST</span><span class="p">:</span> <span class="n">Table</span> <span class="n">updated</span><span class="p">.</span>
</pre></div>
<p>In the above log, we see four entries: the skiboot/OPAL log, the <a href="https://github.com/open-power/hostboot">hostboot</a> runtime log, the petitboot Linux kernel log (which doesn't make it into the final dump) and the real Linux kernel log. skiboot obviously adds the OPAL and hostboot logs to the MDST early in boot, but it also exposes the <a href="https://github.com/open-power/skiboot/blob/master/doc/opal-api/opal-register-dump-region-101.txt"><code>OPAL_REGISTER_DUMP_REGION</code></a> call which can be used by the operating system to register additional regions. Linux uses this to <a href="https://github.com/torvalds/linux/blob/master/arch/powerpc/platforms/powernv/opal.c#L608">register the kernel log buffer</a>. If you're a kernel developer, you could potentially use the OPAL call to register your own interesting bits of memory.</p>
<p>So, the MDST is all set up, we go about doing our business, and suddenly we checkstop. The FSP does its sysdump magic and a few minutes later it reboots the system. What now?</p>
<ul>
<li>
<p>After we come back up, the FSP notifies OPAL that a new dump is available. Linux exposes the dump to userspace under <code>/sys/firmware/opal/dump/</code>.</p>
</li>
<li>
<p><a href="https://sourceforge.net/projects/linux-diag/files/ppc64-diag/">ppc64-diag</a> is a suite of utilities that assist in manipulating FSP dumps, including the <code>opal_errd</code> daemon. <code>opal_errd</code> monitors new dumps and saves them in <code>/var/log/dump/</code> for later analysis.</p>
</li>
<li>
<p><code>opal-dump-parse</code> (also in the <code>ppc64-diag</code> suite) can be used to extract the sections we care about from the dump:</p>
<div class="highlight"><pre>root@p86:/var/log/dump# opal-dump-parse -l SYSDUMP.842EA8A.00000001.20160322063051
|---------------------------------------------------------|
|ID SECTION SIZE|
|---------------------------------------------------------|
|1 Opal-log 1048576|
|2 HostBoot-Runtime-log 1048576|
|128 printk 1048576|
|---------------------------------------------------------|
List completed
root@p86:/var/log/dump# opal-dump-parse -s 1 SYSDUMP.842EA8A.00000001.20160322063051
Captured log to file Opal-log.842EA8A.00000001.20160322063051
root@p86:/var/log/dump# opal-dump-parse -s 2 SYSDUMP.842EA8A.00000001.20160322063051
Captured log to file HostBoot-Runtime-log.842EA8A.00000001.20160322063051
root@p86:/var/log/dump# opal-dump-parse -s 128 SYSDUMP.842EA8A.00000001.20160322063051
Captured log to file printk.842EA8A.00000001.20160322063051
</pre></div>
</li>
</ul>
<p>There's various other types of dumps and logs that I won't get into here. I'm probably obliged to say that if you're having problems out in the wild, you should probably contact your friendly local IBM Service Representative...</p>
<h2>Acknowledgements</h2>
<p>Thanks to <a href="https://flamingspork.com">Stewart Smith</a> for pointing me in the right direction regarding FSP sysdumps and related tools.</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Donnellan</dc:creator><pubDate>Tue, 22 Mar 2016 18:00:00 +1100</pubDate><guid>tag:sthbrx.github.io,2016-03-22:blog/2016/03/22/getting-logs-out-of-things/</guid><category>debugging</category><category>skiboot</category><category>OPAL</category><category>FSP</category><category>kernel</category><category>development</category><category>OpenPOWER</category></item><item><title>The Elegance of the Plaintext Patch</title><link>https://sthbrx.github.io/blog/2016/03/22/the-elegance-of-the-plaintext-patch/</link><description><p>I've only been working on the Linux kernel for a few months. Before that, I worked with proprietary source control at work and common tools like GitHub at home. The concept of the mailing list seemed obtuse to me. If I noticed a problem with some program, I'd be willing to open an issue on GitHub but not to send an email to a mailing list. Who still uses those, anyway?</p>
<p>Starting out with the kernel meant I had to figure this email thing out. <code>git format-patch</code> and <code>git send-email</code> take most of the pain out of formatting and submitting a patch, which is nice. The patch files generated by <code>format-patch</code> open nicely in Emacs by default, showing all whitespace and letting you pick up any irregularities. <code>send-email</code> means you can send it to yourself or a friend first, finding anything that looks stupid before being exposed to the public.</p>
<p>And then what? You've sent an email. It gets sent to hundreds or thousands of people. Nowhere near that many will read it. Some might miss it due to their mail server going down, or the list flagging your post as spam, or requiring moderation. Some recipients will be bots that archive mail on the list, or publish information about the patch. If you haven't formatted it correctly, someone will let you know quickly. If your patch is important or controversial, you'll have all sorts of responses. If your patch is small or niche, you might not ever hear anything back.</p>
<p>I remember when I sent my first patch. I was talking to a former colleague who didn't understand the patch/mailing list workflow at all. I sent him a link to my patch on a mail archive. I explained it like a pull request - here's my code, you can find the responses. What's missing from a GitHub-esque pull request? We don't know what tests it passed. We don't know if it's been merged yet, or if the maintainer has looked at it. It takes a bit of digging around to find out who's commented on it. If it's part of a series, that's awkward to find out as well. What about revisions of a series? That's another pain point.</p>
<p>Luckily, these problems do have solutions. <a href="http://jk.ozlabs.org/projects/patchwork/">Patchwork</a>, written by fellow OzLabs member <a href="http://jk.ozlabs.org">Jeremy Kerr</a>, changes the way we work with patches. Project maintainers rely on Pathwork instances, such as <a href="https://patchwork.ozlabs.org">https://patchwork.ozlabs.org</a>, for their day-to-day workflow: tagging reviewers, marking the status of patches, keeping track of tests, acks, reviews and comments in one place. Missing from this picture is support for series and revisions, which is a feature that's being developed by the <a href="https://www.freedesktop.org/wiki/">freedesktop</a> project. You can check out their changes in action <a href="https://patchwork.freedesktop.org">here</a>.</p>
<p>So, Patchwork helps patches and email catch up to what GitHub has in terms of ease of information. We're still missing testing and other hooks. What about review? What can we do with email, compared to GitHub and the like?</p>
<p>In my opinion, the biggest feature of email is the ease of review. Just reply inline and you're done. There's inline commenting on GitHub and GitLab, which works well but is a bit tacky, people commenting on the same thing overlap and conflict, each comment generates a notification (which can be an email until you turn that off). Plus, since it's email, it's really easy to bring in additional people to the conversation as necessary. If there's a super lengthy technical discussion in the kernel, it might just take Linus to resolve.</p>
<p>There are alternatives to just replying to email, too, such as <a href="https://www.gerritcodereview.com/">Gerrit</a>. Gerrit's pretty popular, and has a huge amount of features. I understand why people use it, though I'm not much of a fan. Reason being, it doesn't add to the email workflow, it replaces it. Plaintext email is supported on pretty much any device, with a bunch of different programs. From the goals of Patchwork: "patchwork should supplement mailing lists, not replace them".</p>
<p>Linus Torvalds famously explained why he prefers email over GitHub pull requests <a href="https://github.com/torvalds/linux/pull/17">here</a>, using <a href="https://groups.google.com/forum/#!topic/linux.kernel/w957vpu3PPU">this</a> pull request from Ben Herrenschmidt as an example of why git's own pull request format is superior to that of GitHub. Damien Lespiau, who is working on the freedesktop Patchwork fork, <a href="http://damien.lespiau.name/2016/02/augmenting-mailing-lists-with-patchwork.html">outlines on his blog</a> all the issues he has with mailing list workflows and why he thinks mailing lists are a relic of the past. His work on Patchwork has gone a long way to help fix those problems, however I don't think mailing lists are outdated and superceded, I think they are timeless. They are a technology-agnostic, simple and free system that will still be around if GitHub dies or alienates its community.</p>
<p>That said, there's still the case of the missing features. What about automated testing? What about developer feedback? What about making a maintainer's life easier? We've been working on improving these issues, and I'll outline how we're approaching them in a future post.</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Russell Currey</dc:creator><pubDate>Tue, 22 Mar 2016 13:53:00 +1100</pubDate><guid>tag:sthbrx.github.io,2016-03-22:blog/2016/03/22/the-elegance-of-the-plaintext-patch/</guid><category>development</category><category>education</category><category>kernel</category><category>patches</category></item><item><title>No Network For You</title><link>https://sthbrx.github.io/blog/2016/03/21/no-network-for-you/</link><description><p>In POWER land <a href="https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface">IPMI</a> is mostly known as the method to access the machine's console and start interacting with Petitboot. However it also has a plethora of other features, handily described in the 600ish page <a href="http://www.intel.com/content/www/us/en/servers/ipmi/ipmi-second-gen-interface-spec-v2-rev1-1.html">IPMI specification</a> (which you can go read yourself).</p>
<p>One especially relevant feature to Petitboot however is the 'chassis bootdev' command, which you can use to tell Petitboot to ignore any existing boot order, and only consider boot options of the type you specify (eg. 'network', 'disk', or 'setup' to not boot at all). Support for this has been in Petitboot for a while and should work on just about any machine you can get your hands on.</p>
<h2>Network Overrides</h2>
<p>Over in OpenPOWER<sup>1</sup> land however, someone took this idea and pushed it further - why not allow the network configuration to be overwritten too? This isn't in the IPMI spec, but if you cast your gaze down to page 398 where the spec lays out the entire format of the IPMI request, there is a certain field named "OEM Parameters". This is an optional amount of space set aside for whatever you like, which in this case is going to be data describing an override of the network config.</p>
<p>This allows a user to tell Petitboot over IPMI to either;</p>
<ul>
<li>Disable the network completely,</li>
<li>Set a particular interface to use DHCP, or</li>
<li>Set a particular interface to use a specific static configuration.</li>
</ul>
<p>Any of these options will cause any existing network configurations to be ignored.</p>
<h2>Building the Request</h2>
<p>Since this is an OEM-specific command, your average ipmitool package isn't going to have a nice way of making this request, such as 'chassis bootdev network'. Rather you need to do something like this:</p>
<div class="highlight"><pre><span class="x">ipmitool -I lanplus -H </span><span class="p">$</span><span class="nv">yourbmc</span><span class="x"> -U </span><span class="p">$</span><span class="nv">user</span><span class="x"> -P </span><span class="p">$</span><span class="nv">pass</span><span class="x"> raw 0x00 0x08 0x61 0x80 0x21 0x70 0x62 0x21 0x00 0x01 0x06 0x04 0xf4 0x52 0x14 0xf3 0x01 0xdf 0x00 0x01 0x0a 0x3d 0xa1 0x42 0x10 0x0a 0x3d 0x2 0x1</span>
</pre></div>
<p>Horrific right? In the near future the Petitboot tree will include a helper program to format this request for you, but in the meantime (and for future reference), lets lay out how to put this together:</p>
<div class="highlight"><pre>Specify the &quot;chassis bootdev&quot; command, field 96, data field 1:
0x00 0x08 0x61 0x80
Unique value that Petitboot recognises:
0x21 0x70 0x62 0x21
Version field (1)
0x00 0x01 .. ..
Size of the hardware address (6):
.. .. 0x06 ..
Size of the IP address (IPv4/IPv6):
.. .. .. 0x04
Hardware (MAC) address:
0xf4 0x52 0x14 0xf3
0x01 0xdf .. ..
&#39;Ignore flag&#39; and DHCP/Static flag (DHCP is 0)
.. .. 0x00 0x01
(Below fields only required if setting a static IP)
IP Address:
0x0a 0x3d 0xa1 0x42
Subnet Mask (eg, /16):
0x10 .. .. ..
Gateway IP Address:
.. 0x0a 0x3d 0x02
0x01
</pre></div>
<p>Clearing a network override is as simple as making a request empty aside from the header:</p>
<div class="highlight"><pre>0x00 0x08 0x61 0x80 0x21 0x70 0x62 0x21 0x00 0x01 0x00 0x00
</pre></div>
<p>You can also read back the request over IPMI with this request:</p>
<div class="highlight"><pre>0x00 0x09 0x61 0x00 0x00
</pre></div>
<p>That's it! Ideally this is something you would be scripting rather than bashing out on the keyboard - the main use case at the moment is as a way to force a machine to netboot against a known good source, rather than whatever may be available on its other interfaces.</p>
<p>[1] The reason this is only available on OpenPOWER machines at the moment is that support for the IPMI command itself depends on the BMC firmware, and non-OpenPOWER machines use an FSP which is a different platform.</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Samuel Mendoza-Jonas</dc:creator><pubDate>Mon, 21 Mar 2016 15:23:00 +1100</pubDate><guid>tag:sthbrx.github.io,2016-03-21:blog/2016/03/21/no-network-for-you/</guid><category>petitboot</category><category>power</category><category>p8</category><category>openpower</category><category>goodposts</category><category>realcontent</category><category>ipmi</category><category>bmc</category><category>based16</category></item><item><title>And now for something completely different: approximate computing</title><link>https://sthbrx.github.io/blog/2016/03/15/and-now-for-something-completely-different-approximate-computing/</link><description><p>In early February I had the opportunity to go the the NICTA Systems Summer School, where Cyril and I were invited to represent IBM. There were a number of excellent talks across a huge range of systems related subjects, but the one that has stuck with me the most was a talk given by <a href="http://homes.cs.washington.edu/~luisceze/">Luis Ceze</a> on a topic called approximate computing. So here, in hopes that you too find it interesting, is a brief run-down on what I learned.</p>
<p>Approximate computing is fundamentally about trading off accuracy for something else - often speed or power consumption. Initially this sounded like a very weird proposition: computers do things like 'running your operating system' and 'reading from and writing to disks': things you need to always be absolutely correct if you want anything vaguely resembling reliability. It turns out that this is actually not as big a roadblock as I had assumed - you can work around it fairly easily.</p>
<p>The model proposed for approximate computing is as follows. You divide your computation up into two classes: 'precise', and 'approximate'. You use 'precise' computations when you need to get exact answers: so for example if you are constructing a JPEG file, you want the JPEG header to be exact. Then you have approximate computations: so for example the contents of your image can be approximate.</p>
<p>For correctness, you have to establish some boundaries: you say that precise data can be used in approximate calculations, but that approximate data isn't allowed to cross back over and pollute precise calculations. This, while intuitively correct, poses some problems in practise: when you want to write out your approximate JPEG data, you need an operation that allows you to 'bless' (or in their terms 'endorse') some approximate data so it can be used in the precise file system operations.</p>
<p>In the talk we were shown an implementation of this model in Java, called <a href="http://sampa.cs.washington.edu/research/approximation/enerj.html">EnerJ</a>. EnerJ allows you to label variables with either <code>@Precise</code> if you're dealing with precise data, or <code>@Approx</code> if you're dealing with approximate data. The compiler was modified so that it would do all sorts of weird things when it knew it was dealing with approximate data: for example, drop loop iterations entirely, do things in entirely non-determistic ways - all sorts of fun stuff. It turns out this works surprisingly well.</p>
<p>However, the approximate computing really shines when you can bring it all the way down to the hardware level. The first thing they tried was a CPU with both 'approximate' and precise execution engines, but this turned out not to have the power savings hoped for. What seemed to work really well was a model where some approximate calculations could be identified ahead of time, and then replaced with neural networks in hardware. These neural networks approximated the calculations, but did so at significantly lower power levels. This sounded like a really promising concept, and it will be interesting to see if this goes anywhere over the next few years.</p>
<p>There's a lot of work evaluating the quality of the approximate result, for cases where the set of inputs is known, and when the inputs is not known. This is largely beyond my understanding, so I'll simply refer you to some of the papers <a href="http://sampa.cs.washington.edu/research/approximation/enerj.html">listed on the website</a>.</p>
<p>The final thing covered in the talk was bringing approximate computing into current paradigms by just being willing to accept higher user-visible error rates. For example, they hacked up a network stack to accept packets with invalid checksums. This has had mixed results so far. A question I had (but didn't get around to asking!) would be whether the mathematical properties of checksums (i.e. that they can correct a certain number of bit errors) could be used to correct some of the errors, rather than just accepting/rejecting them blindly. Perhaps by first attempting to correct errors using the checksums, we will be able to fix the simpler errors, reducing the error rate visible to the user.</p>
<p>Overall, I found the NICTA Systems Summer School to be a really interesting experience (and I hope to blog more about it soon). If you're a university student in Australia, or an academic, see if you can make it in 2017!</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Axtens</dc:creator><pubDate>Tue, 15 Mar 2016 11:30:00 +1100</pubDate><guid>tag:sthbrx.github.io,2016-03-15:blog/2016/03/15/and-now-for-something-completely-different-approximate-computing/</guid><category>nicta</category><category>conferences</category></item><item><title>linux.conf.au 2016: A set of brief thoughts</title><link>https://sthbrx.github.io/blog/2016/03/15/linuxconfau-2016-a-set-of-brief-thoughts/</link><description><p>Recently most of us attended LCA2016. This is one set of reflections on what we heard and what we've thought since. (Hopefully not the only set of reflections that will be posted on this blog either!)</p>
<p>LCA was 2 days of miniconferences plus 3 days of talks. Here, I've picked some of the more interesting talks I attended, and I've written down some thoughts. If you find the thoughts interesting, you can click through and watch the whole talk video, because LCA is awesome like that.</p>
<h4>Life is better with Rust's community automation</h4>
<p><a href="https://www.youtube.com/watch?v=dIageYT0Vgg">This talk</a> is probably the one that's had the biggest impact on our team so far. We were really impressed by the community automation that Rust has: the way they can respond to pull requests from new community members in a way that lets them keep their code quality high and be nice to everyone at the same time.</p>
<p>The system that they've developed is fascinating (and seems fantastic). However, their system uses pull requests, while we use mailing lists. Pull requests are easy, because github has good hook support, but how do we link mailing lists to an automatic test system?</p>
<p>As it turns out, this is something we're working on: we already have <a href="http://patchwork.ozlabs.org/">Patchwork</a>, and <a href="https://openpower.xyz/">Jenkins</a>: how do we link them? We have something brewing, which we'll open source real soon now - stay tuned!</p>
<h4>Usable formal methods - are we there yet?</h4>
<p>I liked <a href="https://www.youtube.com/watch?v=RxHjhBVOCSU">this talk</a>, as I have a soft spot for formal methods (as I have a soft spot for maths). It covers applying a bunch of static analysis and some of the less intrusive formal methods (in particular <a href="http://www.cprover.org/cbmc/">cbmc</a>) to an operating system kernel. They were looking at eChronos rather than Linux, but it's still quite an interesting set of results.</p>
<p>We've also tried to increase our use of static analysis, which has already found a <a href="http://patchwork.ozlabs.org/patch/580629/">real bug</a>. We're hoping to scale this up, especially the use of sparse and cppcheck, but we're a bit short on developer cycles for it at the moment.</p>
<h4>Adventures in OpenPower Firmware</h4>
<p>Stewart Smith - another OzLabber - gave <a href="https://www.youtube.com/watch?v=a4XGvssR-ag">this talk</a> about, well, OpenPOWER firmware. This is a large part of our lives in OzLabs, so it's a great way to get a picture of what we do each day. It's also a really good explanation of the open source stack we have: a POWER8 CPU runs open-source from the first cycle.</p>
<h4>What Happens When 4096 Cores <code>All Do synchronize_rcu_expedited()</code>?</h4>
<p>Paul McKenney is a parallel programming genius - he literally <a href="https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html">'wrote the book'</a> (or at least, wrote <em>a</em> book!) on it. <a href="https://www.youtube.com/watch?v=1nfpjHTWaUc">His talk</a> is - as always - a brain-stretching look at parallel programming within the RCU subsystem of the Linux kernel. In particular, the tree structure for locking that he presents is really interesting and quite a clever way of scaling what at first seems to be a necessarily global lock.</p>
<p>I'd also really recommed <a href="https://www.youtube.com/watch?v=tFmajPt0_hI">RCU Mutation Testing</a>, from the kernel miniconf, also by Paul.</p>
<h4>What I've learned as the kernel docs maintainer</h4>
<p>As an extra bonus: I mention <a href="https://www.youtube.com/watch?v=gsJXf6oSbAE">this talk</a>, just to say "why on earth have we still not fixed the Linux kernel <a href="https://www.kernel.org/doc/linux/README">README</a>"?!!?</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Axtens</dc:creator><pubDate>Tue, 15 Mar 2016 11:30:00 +1100</pubDate><guid>tag:sthbrx.github.io,2016-03-15:blog/2016/03/15/linuxconfau-2016-a-set-of-brief-thoughts/</guid><category>lca2016</category><category>conferences</category></item><item><title>Learning From the Best</title><link>https://sthbrx.github.io/blog/2016/03/03/learning-from-the-best/</link><description><p>When I first started at IBM I knew how to alter Javascript and compile it. This is because of my many years playing Minecraft (yes I am a nerd). Now I have leveled up! I can understand and use Bash, Assembly, Python, Ruby and C! Writing full programs in any of these languages is a very difficult prospect but none the less achievable with what I know now. Whereas two weeks ago it would have been impossible. Working here even for a short time has been an amazing Learning experience for me, plus it looks great on a resume! Learning how to write C has been one of the most useful things I have learnt. I have already written programs for use both in and out of IBM. The first program I wrote was the standard newbie 'hello world' exercise. I have now expanded on that program so that it now says, "Hello world! This is Callum Scarvell". This is done using strings that recognise my name as a set character. Then I used a header file called conio.h or curses.h to recognise 'cal' as the short form of my name. This is so now I can abbreviate my name easier. Heres what the code looks like:</p>
<div class="highlight"><pre><span class="cp">#include &lt;stdio.h&gt;</span>
<span class="cp">#include &lt;string.h&gt;</span>
<span class="cp">#include &lt;curses.h&gt;</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">printf</span><span class="p">(</span><span class="s">&quot;Hello, World! This Is cal&quot;</span><span class="p">);</span>
<span class="kt">char</span> <span class="n">first_name</span><span class="p">[]</span> <span class="o">=</span> <span class="s">&quot;Callum&quot;</span><span class="p">;</span>
<span class="kt">char</span> <span class="n">last_name</span><span class="p">[]</span> <span class="o">=</span> <span class="s">&quot;Scarvell&quot;</span><span class="p">;</span>
<span class="kt">char</span> <span class="n">name</span><span class="p">[</span><span class="mi">100</span><span class="p">];</span>
<span class="cm">/* testing code */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">strncmp</span><span class="p">(</span><span class="n">first_name</span><span class="p">,</span> <span class="s">&quot;Callum&quot;</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">strncmp</span><span class="p">(</span><span class="n">last_name</span><span class="p">,</span> <span class="s">&quot;Scarvell&quot;</span><span class="p">,</span><span class="mi">100</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">last_name</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="sc">&#39;S&#39;</span><span class="p">;</span>
<span class="n">sprintf</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="s">&quot;%s %s&quot;</span><span class="p">,</span> <span class="n">first_name</span><span class="p">,</span> <span class="n">last_name</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">strncmp</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="s">&quot;Callum Scarvell&quot;</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">printf</span><span class="p">(</span><span class="s">&quot;This is %s</span><span class="se">\n</span><span class="s">&quot;</span><span class="p">,</span><span class="n">name</span><span class="p">);</span>
<span class="p">}</span>
<span class="cm">/*printf(&quot;actual string is -%s-\n&quot;,name);*/</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">Name_Rec</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">,</span><span class="n">k</span><span class="p">;</span>
<span class="kt">char</span> <span class="n">a</span><span class="p">[</span><span class="mi">30</span><span class="p">],</span><span class="n">b</span><span class="p">[</span><span class="mi">30</span><span class="p">];</span>
<span class="n">clrscr</span><span class="p">();</span>
<span class="n">puts</span><span class="p">(</span><span class="s">&quot;Callum Scarvell : </span><span class="se">\n</span><span class="s">&quot;</span><span class="p">);</span>
<span class="n">gets</span><span class="p">(</span><span class="n">a</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">&quot;</span><span class="se">\n\n</span><span class="s">cal : </span><span class="se">\n\n</span><span class="s">%c&quot;</span><span class="p">,</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
<span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">!=</span><span class="sc">&#39;\0&#39;</span><span class="p">;</span><span class="n">i</span><span class="o">++</span><span class="p">)</span>
</pre></div>
<p>The last two lines have been left out to make it a challenge to recreate. Feel free to test your own knowledge of C to finish the program! My ultimate goal for this program is to make it generate the text 'Hello World! This is Callum Scarvell's computer. Everybody else beware!'(which is easy) then import it into the Linux kernel to the profile login screen. Then I will have my own unique copy of the kernel. And I could call myself an LSD(Linux system developer). That's just a small pet project I have been working on in my time here. Another pet project of mine is my own very altered copy of the open source game NetHack. It's written in C as well and is very easy to tinker with. I have been able to do things like set my characters starting hit points to 40, give my character awesome starting gear and keep save files even after the death of a character. These are just a couple small projects that made learning C so much easier and a lot more fun. And the whole time I was learning C, Ruby, or Python I had some of the best system developers in the world showing me the ropes. This made things even easier, and much more comprehensive. So really its no surprise that in three short weeks I managed to learn almost four different languages and how to run a blog from the raw source code. The knowledge given to me by the OzLabs team is priceless and invaluable. I will forever remember all the new faces and what they taught me. And the <em>Linux Gods</em> will answer your prayers whether e-mail or in person because they walk among us! So if you ever get an opportunity to do work experience, internship or a graduate placement take the chance to do it because you will learn many things that are not taught in school.</p>
<p>If you would like to reveiw the source code for the blog or my work in general you can find me at <a href="https://github.com/CallumScar">CallumScar.github.com</a> or find me on facebook, <a href="https://www.facebook.com/callum.scarvell/about">Callum Scarvell</a>. </p>
<p>And a huge thankyou to the OzLabs team for taking me on for the three weeks and for teaching me so much! I am forever indebted to everyone here. </p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Callum Scarvell</dc:creator><pubDate>Thu, 03 Mar 2016 00:00:00 +1100</pubDate><guid>tag:sthbrx.github.io,2016-03-03:blog/2016/03/03/learning-from-the-best/</guid><category>education</category><category>work experience</category><category>Linux Gods</category></item><item><title>Work Experience At Ozlabs</title><link>https://sthbrx.github.io/blog/2016/02/25/work-experience-at-ozlabs/</link><description><p>As a recent year twelve graduate my knowledge of computer science was very limited and my ability to write working programs was all but none. So you can imagine my excitement when I heard of an opening for work experience with IBM's internationally renowned Ozlabs team, or as I knew them the <em>Linux Gods</em>. My first day of working at Ozlabs I learnt more about programing then in six years of secondary education. I met most of the Ozlabs team and made connections that will certainly help with my pursuit of a career in IT. Because in business its who you know more than what you know, and now I know the guys at Ozlabs I know how to write code and run it on my own Linux Distro. And on top of all the extremely valuable knowledge I am on a first name basis with the <em>Linux Gods</em> at the LTC.</p>
<p>After my first week at Ozlabs I cloned this blog from Octopress and reformatted it for pelican static site generator.For those who don't know Octopress is a ruby based static site generator so converting the embedded ruby gems to pelicans python code was no easy task for this newbie. Luckily I had a team of some of the best software developers in the world to help and teach me their ways. After we sorted the change from ruby to python and I was able to understand both languages, I presented my work to the team. They then decided to throw me a curve ball as they did not like any of pelicans default themes, instead they wanted the original Octopress theme on the new blog. This is how I learnt GitHub is my bestest friend, because some kind soul had already converted the ruby theme into python and it ran perfectly!</p>
<p>Now it was a simple task of re-formatting the ruby-gem text files into markdown which is pelican compatible(which is why we chose pelican in the first place). So now we had a working pelican blog on the Octopress theme, one issue it was very annoying to navigate. Using my newly learned skills and understanding of python I inserted tags, categories, web-links, navigation bar and I started learning how to code C. And it all worked fine! That was what I a newbie could accomplish in one week. I still have two more weeks left here and I have plenty of really interesting work left to do. This has been one of the greatest learning experiences of my life and I would do it again if I could! So if you are looking for experience in it or software development look no further because you could be learning to code from the people who wrote the language itself. The <em>Linux Gods</em>.</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Callum Scarvell</dc:creator><pubDate>Thu, 25 Feb 2016 00:00:00 +1100</pubDate><guid>tag:sthbrx.github.io,2016-02-25:blog/2016/02/25/work-experience-at-ozlabs/</guid><category>Work Experience</category><category>Ozlabs</category></item><item><title>Panic, flushing and compromise</title><link>https://sthbrx.github.io/blog/2016/02/15/panic/</link><description><p>This is a tale of a simple problem, with a relatively simple solution, that ended up being pretty complicated.</p>
<p>The BMC of OpenPOWER machines expose a serial console. It's pretty useful for getting information as the system is booting, or when it's having issues and the network is down. OpenPOWER machines also have runtime firmware, namely <a href="https://github.com/open-power/skiboot">skiboot</a>, which the Linux kernel calls to make certain things happen. One of those is writing to the serial console. There's a function that <a href="https://github.com/open-power/skiboot/blob/master/core/opal.c">skiboot exposes</a>, <code>opal_poll_events()</code> (which then calls <code>opal_run_pollers()</code>), which the kernel calls frequently. Among other things, it performs a partial flush of the serial console. And that all works fine...until the kernel panics.</p>
<p>Well, the kernel is in panic. Who cares if it flushes the console? It's dead. It doesn't need to do anything else.</p>
<p>Oh, right. It prints the reason it panicked. Turns out that's pretty useful.</p>
<p>There's a pretty simple fix here that we can push into the firmware. Most kernels are configured to reboot after panic, typically with some delay. In OpenPOWER, the kernel reboots by calling into skiboot with the <code>opal_cec_reboot()</code> function. So all we need to do is flush out the console buffer:</p>
<div class="highlight"><pre><span class="k">static</span> <span class="n">int64</span> <span class="nf">opal_cec_reboot</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">printf</span><span class="p">(</span><span class="s">&quot;OPAL: Reboot request...</span><span class="se">\n</span><span class="s">&quot;</span><span class="p">);</span>
<span class="n">console_complete_flush</span><span class="p">();</span> <span class="c1">// &lt;-- what I added</span>
<span class="c1">// rebooting stuff happens here...</span>
<span class="k">return</span> <span class="n">OPAL_SUCCESS</span><span class="p">;</span>
<span class="p">}</span>
</pre></div>
<p>Writing a complete flushing function was pretty easy, then call it from the power down and reboot functions. Easy, all nicely contained in firmware.</p>
<p>Now, what if the kernel isn't configured to reboot after panic. Or, what if the reboot timer is really long? Do you want to wait 3 minutes to see your panic output? Probably not. We need to call the pollers after panic.</p>
<p>First, I had to figure out what the kernel actually <em>does</em> when it panics. Let's have a look at the <a href="https://github.com/torvalds/linux/blob/master/kernel/panic.c">panic function itself</a> to figure out where we could work some code in.</p>
<p>In the <code>panic()</code> function, the easiest place I found to put in some code was <code>panic_blink()</code>. This is supposed to be a function to blink the LEDs on your keyboard when the kernel is panicking, but we could set it to <code>opal_poll_events()</code> and it'd work fine. There, problem solved!</p>
<p>Oh, wait. That will never get accepted upstream, ever. Let's try again.</p>
<p>Well, there are <code>#ifdef</code>s in the code that are architecture specific, for s390 and SPARC. I could add an <code>#ifdef</code> to check if we're an OpenPOWER machine, and if so, run the pollers a bunch of times. That would also involve including architecture specific code from <code>arch/powerpc</code>, and that's somewhat gross. Maybe I could upstream this, but it'd be difficult. There must be a better way.</p>
<p>As a kernel noob, I found myself digging into what every function called by <code>panic()</code> actually did, to see if there's a way I could use it. I looked over it at first, but eventually I started looking harder at this line:</p>
<div class="highlight"><pre> <span class="n">kmsg_dump</span><span class="p">(</span><span class="n">KMSG_DUMP_PANIC</span><span class="p">);</span>
</pre></div>
<p>It turns out <code>kmsg_dump()</code> does what it says: dumps messages from the kernel. Different parts of the kernel can register their own dumpers, so the kernel can have a variety of dumpers for different purposes. One existing example in OpenPOWER is a kmsg dumper that stores messages in <code>nvram</code> (non-volatile RAM), so you can find it after you reboot.</p>
<p>Well, we don't really want to dump any output, it's already been sent to the output buffer. We just need to flush it. Pretty simple, just call <code>opal_poll_events()</code> a whole bunch of times, right? That <em>would</em> work, though it'd be nice to have a better way than just calling the pollers. Instead, we can add a new API call to skiboot specifically for console flushing, and call it from the kmsg dumper.</p>
<p>Initially, I wired up the skiboot complete console flushing function to a new OPAL API call, and called that from the kernel. After some feedback, this was refactored into a partial, incremental flush so it was more generic. I also had to consider what happened if the machine was running a newer kernel and an older skiboot, so if the skiboot version didn't have my new flushing call it would fall back to calling the pollers an arbitrary amount of times.</p>
<p>In the end, it looks like this:</p>
<div class="highlight"><pre><span class="cm">/*</span>
<span class="cm"> * Console output is controlled by OPAL firmware. The kernel regularly calls</span>
<span class="cm"> * OPAL_POLL_EVENTS, which flushes some console output. In a panic state,</span>
<span class="cm"> * however, the kernel no longer calls OPAL_POLL_EVENTS and the panic message</span>
<span class="cm"> * may not be completely printed. This function does not actually dump the</span>
<span class="cm"> * message, it just ensures that OPAL completely flushes the console buffer.</span>
<span class="cm"> */</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">force_opal_console_flush</span><span class="p">(</span><span class="k">struct</span> <span class="n">kmsg_dumper</span> <span class="o">*</span><span class="n">dumper</span><span class="p">,</span>
<span class="k">enum</span> <span class="n">kmsg_dump_reason</span> <span class="n">reason</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
<span class="kt">int64_t</span> <span class="n">ret</span><span class="p">;</span>
<span class="cm">/*</span>
<span class="cm"> * Outside of a panic context the pollers will continue to run,</span>
<span class="cm"> * so we don&#39;t need to do any special flushing.</span>
<span class="cm"> */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">reason</span> <span class="o">!=</span> <span class="n">KMSG_DUMP_PANIC</span><span class="p">)</span>
<span class="k">return</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">opal_check_token</span><span class="p">(</span><span class="n">OPAL_CONSOLE_FLUSH</span><span class="p">))</span> <span class="p">{</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">opal_console_flush</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ret</span> <span class="o">==</span> <span class="n">OPAL_UNSUPPORTED</span> <span class="o">||</span> <span class="n">ret</span> <span class="o">==</span> <span class="n">OPAL_PARAMETER</span><span class="p">)</span>
<span class="k">return</span><span class="p">;</span>
<span class="cm">/* Incrementally flush until there&#39;s nothing left */</span>
<span class="k">while</span> <span class="p">(</span><span class="n">opal_console_flush</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">!=</span> <span class="n">OPAL_SUCCESS</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="cm">/*</span>
<span class="cm"> * If OPAL_CONSOLE_FLUSH is not implemented in the firmware,</span>
<span class="cm"> * the console can still be flushed by calling the polling</span>
<span class="cm"> * function enough times to flush the buffer. We don&#39;t know</span>
<span class="cm"> * how much output still needs to be flushed, but we can be</span>
<span class="cm"> * generous since the kernel is in panic and doesn&#39;t need</span>
<span class="cm"> * to do much else.</span>
<span class="cm"> */</span>
<span class="n">printk</span><span class="p">(</span><span class="n">KERN_NOTICE</span> <span class="s">&quot;opal: OPAL_CONSOLE_FLUSH missing.</span><span class="se">\n</span><span class="s">&quot;</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">1024</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">opal_poll_events</span><span class="p">(</span><span class="nb">NULL</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</pre></div>
<p>You can find the full code in-tree <a href="https://github.com/torvalds/linux/blob/master/arch/powerpc/platforms/powernv/opal-kmsg.c">here</a>.</p>
<p>And thus, panic messages now roam free 'cross the countryside, causing developer frustration around the world. At least now they know why they're frustrated.</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Russell Currey</dc:creator><pubDate>Mon, 15 Feb 2016 14:23:00 +1100</pubDate><guid>tag:sthbrx.github.io,2016-02-15:blog/2016/02/15/panic/</guid><category>openpower</category></item><item><title>Evolving into a systems programmer</title><link>https://sthbrx.github.io/blog/2015/11/06/evolving-into-a-systems-programmer/</link><description><p>In a previous life I tutored first year computing. The university I
attended had a policy of using C to introduce first years to programming.
One of the most rewarding aspects of teaching is opening doors of
possibility to people by sharing my knowledge.</p>
<p>Over the years I had a mixture of computer science or computer engineering
students as well as other disciplines of engineering who were required to
learn the basics (notably electrical and mechanical). Each class was
different and the initial knowledge always varied greatly. The beauty of
teaching C meant that there was never someone who truly knew it all, heck,
I didn't and still don't. The other advantage of teaching C is that I could
very quickly spot the hackers, the shy person at the back of the room who's
eyes light up when you know you've correctly explained pointers (to them
anyway) or when asked "What happens if you use a negative index into an
array" and the smile they would make upon hearing "What do you think happens".</p>
<p>Right there I would see the makings of a hacker, and this post is dedicated
to you or to anyone who wants to be a hacker. I've been asked "What did you
do to get where you are?", "How do I get into Linux?" (vague much) at
careers fairs. I never quite know what to say, here goes a braindump.</p>
<p>Start with the basics, one of the easiest way we tested the first years was
to tell them they can't use parts of libc. That was a great exam, taking
aside those who didn't read the question and used <code>strlen()</code> when they were
explicitly told they couldn't <code>#include &lt;string.h&gt;</code> a true hacker doesn't
need libc, understand it won't always be there. I thought of this example
because only two weeks ago I was writing code in an environment where I
didn't have libc. Ok sure, if you've got it, use it, just don't crumble
when you don't. Oh how I wish I could have told those students who argued
that it was a pointless question that they were objectively wrong.</p>
<p>Be a fan of assembly, don't be afraid of it, it doesn't bite and it can be
a lot of fun. I wouldn't encourage you to dive right into the PowerISA,
it's intense but perhaps understand the beauty of GCC, know what it's doing
for you. There is a variety of little 8 bit processors you can play with
these days.</p>
<p>At all levels of my teaching I saw almost everyone get something which
'worked', and that's fine, it probably does but I'm here to tell you that
it doesn't work until you know why it works. I'm all for the 'try it and
see' approach but once you've tried it you have to explain why the
behaviour changed otherwise you didn't fix it. As an extension to that,
know how your tools work, I don't think anyone would expect you to be able
to write tools to the level of complexity of GCC or GDB or Valgrind but
have a rough idea as to how they achieve their goals.</p>
<p>A hacker is paranoid, yes, <code>malloc()</code> fails. Linux might just decide now
isn't a good time for you to <code>open()</code> and your <code>fopen()</code> calling function had
better be cool with that. A hacker also doesn't rely on the kindness of the
operating system theres an <code>munmap()</code> for a reason. Nor should you even
completely trust it, what are you leaving around in memory?</p>
<p>Above all do a it for the fun of it, so many of my students asked how I
knew everything I knew (I was only a year ahead of them in my first year of
teaching) and put simply, write code on a Saturday night.</p>
<p>None of these things do or don't make you a hacker, being a hacker is a
frame of mind and a way of thinking but all of the above helps.</p>
<p>Unfortunately there isn't a single path, I might even say it is a path that
chooses you. Odds are you're here because you approached me at some point
and asked me one of those questions I never quite know how to answer.
Perhaps this is the path, at the very least you're asking questions and
approaching people. I'm hope I did on the day, but once again, all the very
best with your endeavours into the future</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Cyril Bur</dc:creator><pubDate>Fri, 06 Nov 2015 11:13:00 +1100</pubDate><guid>tag:sthbrx.github.io,2015-11-06:blog/2015/11/06/evolving-into-a-systems-programmer/</guid><category>education</category><category>offtopic</category></item><item><title>What the HILE is this?</title><link>https://sthbrx.github.io/blog/2015/11/03/what-the-hile-is-this/</link><description><p>One of the cool features of POWER8 processors is the ability to run in either big- or little-endian mode. Several distros are already available in little-endian, but up until recently Petitboot has remained big-endian. While it has no effect on the OS, building Petitboot little-endian has its advantages, such as making support for vendor tools easier.
So it should just be a matter of compiling Petitboot LE right? Well...</p>
<h3>Switching Endianess</h3>
<p>Endianess, and several other things besides, are controlled by the Machine State Register (MSR). Each processor in a machine has an MSR, and each bit of the MSR controls some aspect of the processor such as 64-bit mode or enabling interrupts. To switch endianess we set the LE bit (63) to 1.</p>
<p>When a processor first starts up it defaults to big-endian (bit 63 = 0). However the processor doesn't actually know the endianess of the kernel code it is about to execute - either it is big-endian and everything is fine, or it isn't and the processor will very quickly try to execute an illegal instruction.</p>
<p>The solution to this is an amazing little snippet of code in <a href="https://github.com/torvalds/linux/blob/master/arch/powerpc/boot/ppc_asm.h#L65">arch/powerpc/boot/ppc_asm.h</a> (follow the link to see some helpful commenting):</p>
<div class="highlight"><pre><span class="cp">#define FIXUP_ENDIAN</span>
<span class="n">tdi</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mh">0x48</span><span class="p">;</span>
<span class="n">b</span> <span class="err">$</span><span class="o">+</span><span class="mi">36</span><span class="p">;</span>
<span class="p">.</span><span class="kt">long</span> <span class="mh">0x05009f42</span><span class="p">;</span>
<span class="p">.</span><span class="kt">long</span> <span class="mh">0xa602487d</span><span class="p">;</span>
<span class="p">.</span><span class="kt">long</span> <span class="mh">0x1c004a39</span><span class="p">;</span>
<span class="p">.</span><span class="kt">long</span> <span class="mh">0xa600607d</span><span class="p">;</span>
<span class="p">.</span><span class="kt">long</span> <span class="mh">0x01006b69</span><span class="p">;</span>
<span class="p">.</span><span class="kt">long</span> <span class="mh">0xa6035a7d</span><span class="p">;</span>
<span class="p">.</span><span class="kt">long</span> <span class="mh">0xa6037b7d</span><span class="p">;</span>
<span class="p">.</span><span class="kt">long</span> <span class="mh">0x2400004c</span>
</pre></div>
<p>By some amazing coincidence if you take the opcode for <code>tdi 0, 0, 0x48</code> and flip the order of the bytes it forms the opcode for <code>b . + 8</code>. So if the kernel is big-endian, the processor will jump to the next instruction after this snippet. However if the kernel is little-endian we execute the next 8 instructions. These are written in reverse so that if the processor isn't in the right endian it interprets them backwards, executing the instructions shown in the linked comments above, resulting in MSR<sub>LE</sub> being set to 1.</p>
<p>When booting a little-endian kernel all of the above works fine - but there is a problem for Petitboot that will become apparent a little further down...</p>
<h3>Petitboot's Secret Sauce</h3>
<p>The main feature of Petitboot is that it is a full (but small!) Linux kernel and userspace which scans all available devices and presents possible boot options. To boot an available operating system Petitboot needs to start executing the OS's kernel, which it accomplishes via <a href="https://en.wikipedia.org/wiki/Kexec">kexec</a>. Simply speaking kexec loads the target kernel into memory, shuts the current system down most of the way, and at the last moment sets the instruction pointer to the start of the target kernel. From there it's like booting any other kernel, including the FIXUP_ENDIAN section above.</p>
<h3>We've Booted! Wait...</h3>
<p>So our LE Petitboot kernel boots fine thanks to FIXUP_ENDIAN, we kexec into some other kernel.. and everything falls to pieces.<br />
The problem is we've unwittingly changed one of the assumptions of booting a kernel; namely that MSR<sub>LE</sub> defaults to zero. When kexec-ing from an LE kernel we start executing the next kernel in LE mode. This itself is ok, the FIXUP_ENDIAN macro will handle the switch if needed. The problem is that the FIXUP_ENDIAN macro is relatively recent, first entering the kernel in early 2014. So if we're booting, say, an old Fedora 19 install with a v3.9 kernel - things go very bad, very quickly.</p>
<h3>Fix #1</h3>
<p>The solution seems pretty straightforward: find where we jump into the next kernel, and just before that make sure we reset the LE bit in the MSR. That's exactly what <a href="https://github.com/antonblanchard/kexec-lite/commit/150b14e76a4b51f865b929ad9a9bf4133e2d3af7">this patch</a> to kexec-lite does.<br />
That worked up until I tested on a machine with more than one CPU. Remembering that the MSR is processor-specific, we also have to <a href="https://github.com/torvalds/linux/commit/ffebf5f391dfa9da3e086abad3eef7d3e5300249">reset the endianess of each secondary CPU</a><br />
Now things are looking good! All the CPUs are reset to big-endian, the target kernel boots fine, and then... 'recursive interrupts?!'</p>
<h3>HILE</h3>
<p>Skipping the debugging process that led to this (hint: <a href="https://www.flamingspork.com/blog/2014/12/03/running-skiboot-opal-on-the-power8-simulator/">mambo</a> is actually a pretty cool tool), these were the sequence of steps leading up to the problem:</p>
<ul>
<li>Little-endian Petitboot kexecs into a big-endian kernel</li>
<li>All CPUs are reset to big-endian</li>
<li>The big-endian kernel begins to boot successfully</li>
<li>Somewhere in the device-tree parsing code we take an exception</li>
<li>Execution jumps to the exception handler at <a href="https://github.com/torvalds/linux/blob/master/arch/powerpc/kernel/exceptions-64s.S#L199">0x300</a></li>
<li>I notice that MSR<sub>LE</sub> is set to 1</li>
<li>WHAT WHY IS THE LE BIT IN THE MSR SET TO 1</li>
<li>We fail to read the first instruction at 0x300 because it's written in big-endian, so we jump to the exception handler at 0x300... oh no.</li>
</ul>
<p>And then we very busily execute nothing until the machine is killed. I spend some time staring incredulously at my screen, then appeal to a <a href="https://github.com/torvalds/linux/blob/master/MAINTAINERS">higher authority</a> who replies with "What is the HILE set to?" </p>
<p>..the WHAT?<br />
Cracking open the <a href="https://www.power.org/documentation/power-isa-v-2-07b/">PowerISA</a> reveals this tidbit:</p>
<blockquote>
<p>The Hypervisor Interrupt Little-Endian (HILE) bit is a bit
in an implementation-dependent register or similar
mechanism. The contents of the HILE bit are copied
into MSR<sub>LE</sub> by interrupts that set MSR<sub>HV</sub> to 1 (see Section
6.5), to establish the Endian mode for the interrupt
handler. The HILE bit is set, by an implementation-dependent
method, during system initialization,
and cannot be modified after system initialization.</p>
</blockquote>
<p>To be fair, there are use cases for taking exceptions in a different endianess. The problem is that while HILE gets switched on when setting MSR<sub>LE</sub> to 1, it <em>doesn't</em> get turned off when MSR<sub>LE</sub> is set to zero. In particular the line "...cannot be modified after system initialization." led to a fair amount of hand wringing from myself and whoever would listen; if we can't reset the HILE bit, we simply can't use little-endian kernels for Petitboot. </p>
<p>Luckily while on some other systems the machinations of the firmware might be a complete black box, Petitboot runs on OPAL systems - which means the firmware source is <a href="https://github.com/open-power/skiboot">right here</a>. In particular we can see here the OPAL call to <a href="https://github.com/open-power/skiboot/blob/master/core/cpu.c#L702">opal_reinit_cpus</a> which among other things resets the HILE bit.<br />
This is actually what turns on the HILE bit in the first place, and is meant to be called early on in boot since it also clobbers a large amount of state. Luckily for us we don't need to hold onto any state since we're about to jump into a new kernel. We just need to choose an appropriate place where we can be sure we won't take an exception before we get into the next kernel: thus the <a href="https://github.com/torvalds/linux/commit/e72bb8a5a884d022231149d407653923a1d79e53">final patch to support PowerNV machines.</a></p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Samuel Mendoza-Jonas</dc:creator><pubDate>Tue, 03 Nov 2015 15:02:00 +1100</pubDate><guid>tag:sthbrx.github.io,2015-11-03:blog/2015/11/03/what-the-hile-is-this/</guid><category>petitboot</category><category>power</category><category>p8</category><category>openpower</category><category>goodposts</category><category>autoboot</category><category>realcontent</category><category>kexec</category><category>kernel</category></item><item><title>Docker: Just Stop Using AUFS</title><link>https://sthbrx.github.io/blog/2015/10/30/docker-just-stop-using-aufs/</link><description><p>Docker's default storage driver on most Ubuntu installs is AUFS.</p>
<p>Don't use it. Use Overlay instead. Here's why.</p>
<p>First, some background. I'm testing the performance of the basic LAMP
stack on POWER. (LAMP is Linux + Apache + MySQL/MariaDB + PHP, by the
way.) To do more reliable and repeatable tests, I do my builds and
tests in Docker containers. (See <a href="/blog/2015/10/12/a-tale-of-two-dockers/">my previous post</a> for more info.)</p>
<p>Each test downloads the source of Apache, MariaDB and PHP, and builds
them. This should be quick: the POWER8 system I'm building on has 160
hardware threads and 128 GB of memory. But I was finding that it was
only just keeping pace with a 2 core Intel VM on BlueMix.</p>
<p>Why? Well, my first point of call was to observe a compilation under
<code>top</code>. The header is below.</p>
<p><img alt="top header, showing over 70 percent of CPU time spent in the kernel" src="/images/dja/aufs/top-bad.png" /></p>
<p>Over 70% of CPU time is spent in the kernel?! That's weird. Let's dig
deeper.</p>
<p>My next port of call for analysis of CPU-bound workloads is
<code>perf</code>. <code>perf top</code> reports astounding quantities of time in
spin-locks:</p>
<p><img alt="display from perf top, showing 80 percent of time in a spinlock" src="/images/dja/aufs/perf-top-spinlock.png" /></p>
<p><code>perf top -g</code> gives us some more information: the time is in system
calls. <code>open()</code> and <code>stat()</code> are the key culprits, and we can see a
number of file system functions are in play in the call-chains of the
spinlocks.</p>
<p><img alt="display from perf top -g, showing syscalls and file ops" src="/images/dja/aufs/perf-top-syscalls.png" /></p>
<p>Why are open and stat slow? Well, I know that the files are on an AUFS
mount. (<code>docker info</code> will tell you what you're using if you're not
sure.) So, being something of a kernel hacker, I set out to find out
why. This did not go well. AUFS isn't upstream, it's a separate patch
set. Distros have been trying to deprecate it for years. Indeed, RHEL
doesn't ship it. (To it's credit, Docker seems to be trying to move
away from it.)</p>
<p>Wanting to avoid the minor nightmare that is an out-of-tree patchset,
I looked at other storage drivers for Docker. <a href="https://jpetazzo.github.io/assets/2015-03-03-not-so-deep-dive-into-docker-storage-drivers.html">This presentation is particularly good.</a>
My choices are pretty simple: AUFS, btrfs, device-mapper or
Overlay. Overlay was an obvious choice: it doesn't need me to set up
device mapper on a cloud VM, or reformat things as btrfs.</p>
<p>It's also easy to set up on Ubuntu:</p>
<ul>
<li>
<p>export/save any docker containers you care about.</p>
</li>
<li>
<p>add <code>--storage-driver=overlay</code> option to <code>DOCKER_OPTS</code> in <code>/etc/default/docker</code>, and restart docker (<code>service docker restart</code>)</p>
</li>
<li>
<p>import/load the containters you exported</p>
</li>
<li>
<p>verify that things work, then clear away your old storage directory (<code>/var/lib/docker/aufs</code>). </p>
</li>
</ul>
<p>Having moved my base container across, I set off another build.</p>
<p>The first thing I noticed is that images are much slower to create with Overlay. But once that finishes, and a compile starts, things run much better:</p>
<p><img alt="top, showing close to zero system time, and around 90 percent user time" src="/images/dja/aufs/top-good.png" /></p>
<p>The compiles went from taking painfully long to astonishingly fast. Winning.</p>
<p>So in conclusion:</p>
<ul>
<li>
<p>If you use Docker for something that involves open()ing or stat()ing files</p>
</li>
<li>
<p>If you want your machine to do real work, rather than spin in spinlocks</p>
</li>
<li>
<p>If you want to use code that's upstream and thus much better supported</p>
</li>
<li>
<p>If you want something less disruptive than the btrfs or dm storage drivers</p>
</li>
</ul>
<p>...then drop AUFS and switch to Overlay today.</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Axtens</dc:creator><pubDate>Fri, 30 Oct 2015 13:30:00 +1100</pubDate><guid>tag:sthbrx.github.io,2015-10-30:blog/2015/10/30/docker-just-stop-using-aufs/</guid><category>docker</category><category>aufs</category><category>overlay</category><category>performance</category><category>power</category></item><item><title>A tale of two Dockers</title><link>https://sthbrx.github.io/blog/2015/10/12/a-tale-of-two-dockers/</link><description><p>(This was published in an internal technical journal last week, and is now being published here. If you already know what Docker is, feel free to skim the first half.)</p>
<p>Docker seems to be the flavour of the month in IT. Most attention is focussed on using Docker for the deployment of production services. But that's not all Docker is good for. Let's explore Docker, and two ways I use it as a software developer.</p>
<p>Docker: what is it?</p>
<p>Docker is essentially a set of tools to deal with <em>containers</em> and <em>images</em>. </p>
<p>To make up an artificial example, say you are developing a web app. You first build an <em>image</em>: a file system which contains the app, and some associated metadata. The app has to run on something, so you also install things like Python or Ruby and all the necessary libraries, usually by installing a minimal Ubuntu and any necessary packages.<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup> You then run the image inside an isolated environment called a <em>container</em>.</p>
<p>You can have multiple containers running the same image, (for example, your web app running across a fleet of servers) and the containers don't affect each other. Why? Because Docker is designed around the concept of <em>immutability</em>. Containers can write to the image they are running, but the changes are specific to that container, and aren't preserved beyond the life of the container.<sup id="fnref:2"><a class="footnote-ref" href="#fn:2" rel="footnote">2</a></sup> Indeed, once built, images can't be changed at all, only rebuilt from scratch.</p>
<p>However, as well as enabling you to easily run multiple copies, another upshot of immutability is that if your web app allows you to upload photos, and you restart the container, your photos will be gone. Your web app needs to be designed to store all of the data outside of the container, sending it to a dedicated database or object store of some sort.</p>
<p>Making your application Docker friendly is significantly more work than just spinning up a virtual machine and installing stuff. So what does all this extra work get you? Three main things: isolation, control and, as mentioned, immutability. </p>
<p><em>Isolation</em> makes containers easy to migrate and deploy, and easy to update. Once an image is built, it can be copied to another system and launched. Isolation also makes it easy to update software your app depends on: you rebuild the image with software updates, and then just deploy it. You don't have to worry about service A relying on version X of a library while service B depends on version Y; it's all self contained. </p>
<p><em>Immutability</em> also helps with upgrades, especially when deploying them across multiple servers. Normally, you would upgrade your app on each server, and have to make sure that every server gets all the same sets of updates. With Docker, you don't upgrade a running container. Instead, you rebuild your Docker image and re-deploy it, and you then know that the same version of everything is running everywhere. This immutability also guards against the situation where you have a number of different servers that are all special snowflakes with their own little tweaks, and you end up with a fractal of complexity.</p>
<p>Finally, Docker offers a lot of <em>control</em> over containers, and for a low performance penalty. Docker containers can have their CPU, memory and network controlled easily, without the overhead of a full virtual machine. This makes it an attractive solution for running untrusted executables.<sup id="fnref:3"><a class="footnote-ref" href="#fn:3" rel="footnote">3</a></sup></p>
<p>As an aside: despite the hype, very little of this is actually particularly new. Isolation and control are not new problems. All Unixes, including Linux, support 'chroots'. The name comes from “change root”: the system call changes the processes idea of what the file system root is, making it impossible for it to access things outside of the new designated root directory. FreeBSD has jails, which are more powerful, Solaris has Zones, and AIX has WPARs. Chroots are fast and low overhead. However, they offer much lower ability to control the use of system resources. At the other end of the scale, virtual machines (which have been around since ancient IBM mainframes) offer isolation much better than Docker, but with a greater performance hit.</p>
<p>Similarly, immutability isn't really new: Heroku and AWS Spot Instances are both built around the model that you get resources in a known, consistent state when you start, but in both cases your changes won't persist. In the development world, modern CI systems like Travis CI also have this immutable or disposable model – and this was originally built on VMs. Indeed, with a little bit of extra work, both chroots and VMs can give the same immutability properties that Docker gives.</p>
<p>The control properties that Docker provides are largely as a result of leveraging some Linux kernel concepts, most notably something called namespaces.</p>
<p>What Docker does well is not something novel, but the engineering feat of bringing together fine-grained control, isolation and immutability, and – importantly – a tool-chain that is easier to use than any of the alternatives. Docker's tool-chain eases a lot of pain points with regards to building containers: it's vastly simpler than chroots, and easier to customise than most VM setups. Docker also has a number of engineering tricks to reduce the disk space overhead of isolation.</p>
<p>So, to summarise: Docker provides a toolkit for isolated, immutable, finely controlled containers to run executables and services.</p>
<h2>Docker in development: why?</h2>
<p>I don't run network services at work; I do performance work. So how do I use Docker?</p>
<p>There are two things I do with Docker: I build PHP 5, and do performance regression testing on PHP 7. They're good case studies of how isolation and immutability provide real benefits in development and testing, and how the Docker tool chain makes life a lot nicer that previous solutions.</p>
<h3>PHP 5 builds</h3>
<p>I use the <em>isolation</em> that Docker provides to make building PHP 5 easier. PHP 5 depends on an old version of Bison, version 2. Ubuntu and Debian long since moved to version 3. There are a few ways I could have solved this:</p>
<ul>
<li>I could just install the old version directly on my system in <code>/usr/local/</code>, and hope everything still works and nothing else picks up Bison 2 when it needs Bison 3. Or I could install it somewhere else and remember to change my path correctly before I build PHP 5.</li>
<li>I could roll a chroot by hand. Even with tools like debootstrap and schroot, working in chroots is a painful process.</li>
<li>I could spin up a virtual machine on one of our development boxes and install the old version on that. That feels like overkill: why should I need to run an entire operating system? Why should I need to copy my source tree over the network to build it?</li>
</ul>
<p>Docker makes it easy to have a self-contained environment that has Bison 2 built from source, and to build my latest source tree in that environment. Why is Docker so much easier?</p>
<p>Firstly, Docker allows me to base my container on an existing container, and there's an online library of containers to build from.<sup id="fnref:4"><a class="footnote-ref" href="#fn:4" rel="footnote">4</a></sup> This means I don't have to roll a base image with <code>debootstrap</code> or the RHEL/CentOS/Fedora equivalent.</p>
<p>Secondly, unlike a chroot build process, which ultimately is just copying files around, a docker build process includes the ability to both copy files from the host and <em>run commands</em> in the context of the image. This is defined in a file called a <code>Dockerfile</code>, and is kicked off by a single command: <code>docker build</code>.</p>
<p>So, my PHP 5 build container loads an Ubuntu Vivid base container, uses apt-get to install the compiler, tool-chain and headers required to build PHP 5, then installs old bison from source, copies in the PHP source tree, and builds it. The vast majority of this process – the installation of the compiler, headers and bison, can be cached, so they don't have to be downloaded each time. And once the container finishes building, I have a fully built PHP interpreter ready for me to interact with.</p>
<p>I do, at the moment, rebuild PHP 5 from scratch each time. This is a bit sub-optimal from a performance point of view. I could alleviate this with a Docker volume, which is a way of sharing data persistently between a host and a guest, but I haven't been sufficiently bothered by the speed yet. However, Docker volumes are also quite fiddly, leading to the development of tools like <code>docker compose</code> to deal with them. They also are prone to subtle and difficult to debug permission issues.</p>
<h3>PHP 7 performance regression testing</h3>
<p>The second thing I use docker for takes advantage of the throwaway nature of docker environments to prevent cross-contamination.</p>
<p>PHP 7 is the next big version of PHP, slated to be released quite soon. I care about how that runs on POWER, and I preferably want to know if it suddenly deteriorates (or improves!). I use Docker to build a container with a daily build of PHP 7, and then I run a benchmark in it. This doesn't give me a particularly meaningful absolute number, but it allows me to track progress over time. Building it inside of Docker means that I can be sure that nothing from old runs persists into new runs, thus giving me more reliable data. However, because I do want the timing data I collect to persist, I send it out of the container over the network.</p>
<p>I've now been collecting this data for almost 4 months, and it's plotted below, along with a 5-point moving average. The most notable feature of the graph is a the drop in benchmark time at about the middle. Sure enough, if you look at the PHP repository, you will see that a set of changes to improve PHP performance were merged on July 29: changes submitted by our very own Anton Blanchard.<sup id="fnref:5"><a class="footnote-ref" href="#fn:5" rel="footnote">5</a></sup></p>
<p><img alt="Graph of PHP 7 performance over time" src="/images/dja/php7-perf.png" /></p>
<h2>Docker pain points</h2>
<p>Docker provides a vastly improved experience over previous solutions, but there are still a few pain points. For example:</p>
<ol>
<li>
<p>Docker was apparently written by people who had no concept that platforms other than x86 exist. This leads to major issues for cross-architectural setups. For instance, Docker identifies images by a name and a revision. For example, <code>ubuntu</code> is the name of an image, and <code>15.04</code> is a revision. There's no ability to specify an architecture. So, how you do specify that you want, say, a 64-bit, little-endian PowerPC build of an image versus an x86 build? There have been a couple of approaches, both of which are pretty bad. You could name the image differently: say <code>ubuntu_ppc64le</code>. You can also just cheat and override the <code>ubuntu</code> name with an architecture specific version. Both of these break some assumptions in the Docker ecosystem and are a pain to work with.</p>
</li>
<li>
<p>Image building is incredibly inflexible. If you have one system that requires a proxy, and one that does not, you need different Dockerfiles. As far as I can tell, there are no simple ways to hook in any changes between systems into a generic Dockerfile. This is largely by design, but it's still really annoying when you have one system behind a firewall and one system out on the public cloud (as I do in the PHP 7 setup).</p>
</li>
<li>
<p>Visibility into a Docker server is poor. You end up with lots of different, anonymous images and dead containers, and you end up needing scripts to clean them up. It's not clear what Docker puts on your file system, or where, or how to interact with it.</p>
</li>
<li>
<p>Docker is still using reasonably new technologies. This leads to occasional weird, obscure and difficult to debug issues.<sup id="fnref:6"><a class="footnote-ref" href="#fn:6" rel="footnote">6</a></sup></p>
</li>
</ol>
<h2>Final words</h2>
<p>Docker provides me with a lot of useful tools in software development: both in terms of building and testing. Making use of it requires a certain amount of careful design thought, but when applied thoughtfully it can make life significantly easier.</p>
<div class="footnote">
<hr />
<ol>
<li id="fn:1">
<p>There's some debate about how much stuff from the OS installation you should be using. You need to have key dynamic libraries available, but I would argue that you shouldn't be running long running processes other than your application. You shouldn't, for example, be running a SSH daemon in your container. (The one exception is that you must handle orphaned child processes appropriately: see <a href="https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/">https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/</a>) Considerations like debugging and monitoring the health of docker containers mean that this point of view is not universally shared.&#160;<a class="footnote-backref" href="#fnref:1" rev="footnote" title="Jump back to footnote 1 in the text">&#8617;</a></p>
</li>
<li id="fn:2">
<p>Why not simply make them read only? You may be surprised at how many things break when running on a read-only file system. Things like logs and temporary files are common issues.&#160;<a class="footnote-backref" href="#fnref:2" rev="footnote" title="Jump back to footnote 2 in the text">&#8617;</a></p>
</li>
<li id="fn:3">
<p>It is, however, easier to escape a Docker container than a VM. In Docker, an untrusted executable only needs a kernel exploit to get to root on the host, whereas in a VM you need a guest-to-host vulnerability, which are much rarer.&#160;<a class="footnote-backref" href="#fnref:3" rev="footnote" title="Jump back to footnote 3 in the text">&#8617;</a></p>
</li>
<li id="fn:4">
<p>Anyone can upload an image, so this does require running untrusted code from the Internet. Sadly, this is a distinctly retrograde step when compared to the process of installing binary packages in distros, which are all signed by a distro's private key.&#160;<a class="footnote-backref" href="#fnref:4" rev="footnote" title="Jump back to footnote 4 in the text">&#8617;</a></p>
</li>
<li id="fn:5">
<p>See <a href="https://github.com/php/php-src/pull/1326">https://github.com/php/php-src/pull/1326</a>&#160;<a class="footnote-backref" href="#fnref:5" rev="footnote" title="Jump back to footnote 5 in the text">&#8617;</a></p>
</li>
<li id="fn:6">
<p>I hit this last week: <a href="https://github.com/docker/docker/issues/16256">https://github.com/docker/docker/issues/16256</a>, although maybe that's my fault for running systemd on my laptop.&#160;<a class="footnote-backref" href="#fnref:6" rev="footnote" title="Jump back to footnote 6 in the text">&#8617;</a></p>
</li>
</ol>
</div></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Axtens</dc:creator><pubDate>Mon, 12 Oct 2015 14:14:00 +1100</pubDate><guid>tag:sthbrx.github.io,2015-10-12:blog/2015/10/12/a-tale-of-two-dockers/</guid><category>docker</category><category>php</category><category>peformance</category></item><item><title>Running ppc64le_hello on real hardware</title><link>https://sthbrx.github.io/blog/2015/06/03/ppc64le-hello-on-real-hardware/</link><description><p>So today I saw <a href="https://github.com/andreiw/ppc64le_hello">Freestanding “Hello World” for OpenPower</a> on <a href="https://news.ycombinator.com/item?id=9649490">Hacker News</a>. Sadly Andrei hadn't been able to test it on real hardware, so I set out to get it running on a real OpenPOWER box. Here's what I did.</p>
<p>Firstly, clone the repo, and, as mentioned in the README, comment out <code>mambo_write</code>. Build it.</p>
<p>Grab <a href="https://github.com/open-power/op-build">op-build</a>, and build a Habanero defconfig. To save yourself a fair bit of time, first edit <code>openpower/configs/habanero_defconfig</code> to answer <code>n</code> about a custom kernel source. That'll save you hours of waiting for git.</p>
<p>This will build you a PNOR that will boot a linux kernel with Petitboot. This is almost what you want: you need Skiboot, Hostboot and a bunch of the POWER specific bits and bobs, but you don't actually want the Linux boot kernel.</p>
<p>Then, based on <code>op-build/openpower/package/openpower-pnor/openpower-pnor.mk</code>, we look through the output of <code>op-build</code> for a <code>create_pnor_image.pl</code> command, something like this monstrosity:</p>
<p><code>PATH="/scratch/dja/public/op-build/output/host/bin:/scratch/dja/public/op-build/output/host/sbin:/scratch/dja/public/op-build/output/host/usr/bin:/scratch/dja/public/op-build/output/host/usr/sbin:/home/dja/bin:/home/dja/bin:/home/dja/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/opt/openpower/common/x86_64/bin" /scratch/dja/public/op-build/output/build/openpower-pnor-ed1682e10526ebd85825427fbf397361bb0e34aa/create_pnor_image.pl -xml_layout_file /scratch/dja/public/op-build/output/build/openpower-pnor-ed1682e10526ebd85825427fbf397361bb0e34aa/"defaultPnorLayoutWithGoldenSide.xml" -pnor_filename /scratch/dja/public/op-build/output/host/usr/powerpc64-buildroot-linux-gnu/sysroot/pnor/"habanero.pnor" -hb_image_dir /scratch/dja/public/op-build/output/host/usr/powerpc64-buildroot-linux-gnu/sysroot/hostboot_build_images/ -scratch_dir /scratch/dja/public/op-build/output/host/usr/powerpc64-buildroot-linux-gnu/sysroot/openpower_pnor_scratch/ -outdir /scratch/dja/public/op-build/output/host/usr/powerpc64-buildroot-linux-gnu/sysroot/pnor/ -payload /scratch/dja/public/op-build/output/images/"skiboot.lid" -bootkernel /scratch/dja/public/op-build/output/images/zImage.epapr -sbe_binary_filename "venice_sbe.img.ecc" -sbec_binary_filename "centaur_sbec_pad.img.ecc" -wink_binary_filename "p8.ref_image.hdr.bin.ecc" -occ_binary_filename /scratch/dja/public/op-build/output/host/usr/powerpc64-buildroot-linux-gnu/sysroot/occ/"occ.bin" -targeting_binary_filename "HABANERO_HB.targeting.bin.ecc" -openpower_version_filename /scratch/dja/public/op-build/output/host/usr/powerpc64-buildroot-linux-gnu/sysroot/openpower_version/openpower-pnor.version.txt</code></p>
<p>Replace the <code>-bootkernel</code> arguement with the path to ppc64le_hello, e.g.: <code>-bootkernel /scratch/dja/public/ppc64le_hello/ppc64le_hello</code></p>
<p>Don't forget to move it into place! </p>
<div class="highlight"><pre>mv output/host/usr/powerpc64-buildroot-linux-gnu/sysroot/pnor/habanero.pnor output/images/habanero.pnor
</pre></div>
<p>Then we can use skiboot's boot test script (written by Cyril and me, coincidentally!) to flash it.</p>
<div class="highlight"><pre>ppc64le_hello/skiboot/external/boot-tests/boot_test.sh -vp -t hab2-bmc -P &lt;path to&gt;/habanero.pnor
</pre></div>
<p>It's not going to get into Petitboot, so just interrupt it after it powers up the box and connect with IPMI. It boots, kinda:</p>
<div class="highlight"><pre>[11012941323,5] INIT: Starting kernel at 0x20010000, fdt at 0x3044db68 (size 0x11cc3)
Hello OPAL!
_start = 0x20010000
_bss = 0x20017E28
_stack = 0x20018000
_end = 0x2001A000
KPCR = 0x20017E50
OPAL = 0x30000000
FDT = 0x3044DB68
CPU0 not found?
Pick your poison:
Choices: (MMU = disabled):
(d) 5s delay
(e) test exception
(n) test nested exception
(f) dump FDT
(M) enable MMU
(m) disable MMU
(t) test MMU
(u) test non-priviledged code
(I) enable ints
(i) disable ints
(H) enable HV dec
(h) disable HV dec
(q) poweroff
1.42486|ERRL|Dumping errors reported prior to registration
</pre></div>
<p>Yes, it does wrap horribly. However, the big issue here (which you'll have to scroll to see!) is the "CPU0 not found?". Fortunately, we can fix this with a little patch to <code>cpu_init</code> in main.c to test for a PowerPC POWER8:</p>
<div class="highlight"><pre> <span class="n">cpu0_node</span> <span class="o">=</span> <span class="n">fdt_path_offset</span><span class="p">(</span><span class="n">fdt</span><span class="p">,</span> <span class="s">&quot;/cpus/cpu@0&quot;</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cpu0_node</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cpu0_node</span> <span class="o">=</span> <span class="n">fdt_path_offset</span><span class="p">(</span><span class="n">fdt</span><span class="p">,</span> <span class="s">&quot;/cpus/PowerPC,POWER8@20&quot;</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cpu0_node</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">printk</span><span class="p">(</span><span class="s">&quot;CPU0 not found?</span><span class="se">\n</span><span class="s">&quot;</span><span class="p">);</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
</pre></div>
<p>This is definitely the <em>wrong</em> way to do this, but it works for now.</p>
<p>Now, correcting for weird wrapping, we get:</p>
<div class="highlight"><pre>Hello OPAL!
_start = 0x20010000
_bss = 0x20017E28
_stack = 0x20018000
_end = 0x2001A000
KPCR = 0x20017E50
OPAL = 0x30000000
FDT = 0x3044DB68
Assuming default SLB size
SLB size = 0x20
TB freq = 512000000
[13205442015,3] OPAL: Trying a CPU re-init with flags: 0x2
Unrecoverable exception stack top @ 0x20019EC8
HTAB (2048 ptegs, mask 0x7FF, size 0x40000) @ 0x20040000
SLB entries:
1: E 0x8000000 V 0x4000000000000400
EA 0x20040000 -&gt; hash 0x20040 -&gt; pteg 0x200 = RA 0x20040000
EA 0x20041000 -&gt; hash 0x20041 -&gt; pteg 0x208 = RA 0x20041000
EA 0x20042000 -&gt; hash 0x20042 -&gt; pteg 0x210 = RA 0x20042000
EA 0x20043000 -&gt; hash 0x20043 -&gt; pteg 0x218 = RA 0x20043000
EA 0x20044000 -&gt; hash 0x20044 -&gt; pteg 0x220 = RA 0x20044000
EA 0x20045000 -&gt; hash 0x20045 -&gt; pteg 0x228 = RA 0x20045000
EA 0x20046000 -&gt; hash 0x20046 -&gt; pteg 0x230 = RA 0x20046000
EA 0x20047000 -&gt; hash 0x20047 -&gt; pteg 0x238 = RA 0x20047000
EA 0x20048000 -&gt; hash 0x20048 -&gt; pteg 0x240 = RA 0x20048000
...
</pre></div>
<p>The weird wrapping seems to be caused by NULLs getting printed to OPAL, but I haven't traced what causes that.</p>
<p>Anyway, now it largely works! Here's a transcript of some things it can do on real hardware.</p>
<div class="highlight"><pre>Choices: (MMU = disabled):
(d) 5s delay
(e) test exception
(n) test nested exception
(f) dump FDT
(M) enable MMU
(m) disable MMU
(t) test MMU
(u) test non-priviledged code
(I) enable ints
(i) disable ints
(H) enable HV dec
(h) disable HV dec
(q) poweroff
&lt;press e&gt;
Testing exception handling...
sc(feed) =&gt; 0xFEEDFACE
Choices: (MMU = disabled):
(d) 5s delay
(e) test exception
(n) test nested exception
(f) dump FDT
(M) enable MMU
(m) disable MMU
(t) test MMU
(u) test non-priviledged code
(I) enable ints
(i) disable ints
(H) enable HV dec
(h) disable HV dec
(q) poweroff
&lt;press t&gt;
EA 0xFFFFFFF000 -&gt; hash 0xFFFFFFF -&gt; pteg 0x3FF8 = RA 0x20010000
mapped 0xFFFFFFF000 to 0x20010000 correctly
EA 0xFFFFFFF000 -&gt; hash 0xFFFFFFF -&gt; pteg 0x3FF8 = unmap
EA 0xFFFFFFF000 -&gt; hash 0xFFFFFFF -&gt; pteg 0x3FF8 = RA 0x20011000
mapped 0xFFFFFFF000 to 0x20011000 incorrectly
EA 0xFFFFFFF000 -&gt; hash 0xFFFFFFF -&gt; pteg 0x3FF8 = unmap
Choices: (MMU = disabled):
(d) 5s delay
(e) test exception
(n) test nested exception
(f) dump FDT
(M) enable MMU
(m) disable MMU
(t) test MMU
(u) test non-priviledged code
(I) enable ints
(i) disable ints
(H) enable HV dec
(h) disable HV dec
(q) poweroff
&lt;press u&gt;
EA 0xFFFFFFF000 -&gt; hash 0xFFFFFFF -&gt; pteg 0x3FF8 = RA 0x20080000
returning to user code
returning to kernel code
EA 0xFFFFFFF000 -&gt; hash 0xFFFFFFF -&gt; pteg 0x3FF8 = unmap
</pre></div>
<p>I also tested the other functions and they all seem to work. Running non-priviledged code with the MMU on works. Dumping the FDT and the 5s delay both worked, although they tend to stress IPMI a <em>lot</em>. The delay seems to correspond well with real time as well.</p>
<p>It does tend to error out and reboot quite often, usually on the menu screen, for reasons that are not clear to me. It usually starts with something entirely uninformative from Hostboot, like this:</p>
<div class="highlight"><pre>1.41801|ERRL|Dumping errors reported prior to registration
2.89873|Ignoring boot flags, incorrect version 0x0
</pre></div>
<p>That may be easy to fix, but again I haven't had time to trace it.</p>
<p>All in all, it's very exciting to see something come out of the simulator and in to real hardware. Hopefully with the proliferation of OpenPOWER hardware, prices will fall and these sorts of systems will become increasingly accessible to people with cool low level projects like this!</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Axtens</dc:creator><pubDate>Wed, 03 Jun 2015 12:16:00 +1000</pubDate><guid>tag:sthbrx.github.io,2015-06-03:blog/2015/06/03/ppc64le-hello-on-real-hardware/</guid><category>openpower</category><category>power</category><category>p8</category></item><item><title>Petitboot Autoboot Changes</title><link>https://sthbrx.github.io/blog/2015/06/02/autoboot/</link><description><p>The way autoboot behaves in Petitboot has undergone some significant changes recently, so in order to ward off any angry emails lets take a quick tour of how the new system works.</p>
<h2>Old &amp; Busted</h2>
<p>For some context, here is the old (or current depending on what you're running) section of the configuration screen.</p>
<p><img alt="Old Autoboot" src="/images/sammj/oldstyle.jpg" /></p>
<p>This gives you three main options: don't autoboot, autoboot from anything, or autoboot only from a specific device. For the majority of installations this is fine, such as when you have only one default option, or know exactly which device you'll be booting from.</p>
<p>A side note about default options: it is important to note that not all boot options are valid <em>autoboot</em> options. A boot option is only considered for auto-booting if it is marked default, eg. 'set default' in GRUB and 'default' in PXE options.</p>
<h2>New Hotness</h2>
<p>Below is the new autoboot configuration.</p>
<p><img alt="New Autoboot" src="/images/sammj/newstyle.jpg" /></p>
<p>The new design allows you to specify an ordered list of autoboot options.
The last two of the three buttons are self explanatory - clear the list and autoboot any device, or clear the list completely (no autoboot).</p>
<p>Selecting the first button, 'Add Device' brings up the following screen:</p>
<p><img alt="Device Selection" src="/images/sammj/devices.jpg" /></p>
<p>From here you can select any device or <em>class</em> of device to add to the boot order. Once added to the boot order, the order of boot options can be changed with the left and right arrow keys, and removed from the list with the minus key ('-').</p>
<p>This allows you to create additional autoboot configurations such as "Try to boot from sda2, otherwise boot from the network", or "Give priority to PXE options from eth0, otherwise try any other netboot option".
You can retain the original behaviour by only putting one option into the list (either 'Any Device' or a specific device).</p>
<p>Presently you can add any option into the list and order them how you like - which means you can do silly things like this:</p>
<p><img alt="If you send me a bug report with this in it I may laugh at you" src="/images/sammj/redundant.jpg" /></p>
<h2>IPMI</h2>
<p>Slightly prior to the boot order changes Petitboot also received an update to its IPMI handling. IPMI 'bootdev' commands allow you to override the current autoboot configuration remotely, either by specifying a device type to boot (eg. PXE), or by forcing Petitboot to boot into the 'setup' or 'safe' modes. IPMI overrides are either persistent or non-persistent. A non-persistent override will disappear after a successful boot - that is, a successful boot of a boot option, not booting to Petitboot itself - whereas a persistent override will, well, persist!</p>
<p>If there is an IPMI override currently active, it will appear in the configuration screen with an option to manually clear it:</p>
<p><img alt="IPMI Overrides" src="/images/sammj/ipmi.jpg" /></p>
<hr />
<p>That sums up the recent changes to autoboot; a bit more flexibility in assigning priority, and options for more detailed autoboot order if you need it. New versions of Petitboot are backwards compatible and will recognise older saved settings, so updating your firmware won't cause your machines to start booting things at random.</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Samuel Mendoza-Jonas</dc:creator><pubDate>Tue, 02 Jun 2015 08:11:00 +1000</pubDate><guid>tag:sthbrx.github.io,2015-06-02:blog/2015/06/02/autoboot/</guid><category>petitboot</category><category>power</category><category>p8</category><category>openpower</category><category>goodposts</category><category>autoboot</category><category>realcontent</category></item><item><title>Joining the CAPI project</title><link>https://sthbrx.github.io/blog/2015/05/27/joining-the-capi-project/</link><description><p>(I wrote this blog post a couple of months ago, but it's still quite relevant.)</p>
<p>Hi, I'm Daniel! I work in OzLabs, part of IBM's Australian Development Labs. Recently, I've been assigned to the CAPI project, and I've been given the opportunity to give you an idea of what this is, and what I'll be up to in the future!</p>
<h2>What even is CAPI?</h2>
<p>To help you understand CAPI, think back to the time before computers. We had a variety of machines: machines to build things, to check things, to count things, but they were all specialised --- good at one and only one thing.</p>
<p>Specialised machines, while great at their intended task, are really expensive to develop. Not only that, it's often impossible to change how they operate, even in very small ways.</p>
<p>Computer processors, on the other hand, are generalists. They are cheap. They can do a lot of things. If you can break a task down into simple steps, it's easy to get them to do it. The trade-off is that computer processors are incredibly inefficient at everything.</p>
<p>Now imagine, if you will, that a specialised machine is a highly trained and experienced professional, a computer processor is a hungover university student.</p>
<p>Over the years, we've tried lots of things to make student faster. Firstly, we gave the student lots of caffeine to make them go as fast as they can. That worked for a while, but you can only give someone so much caffeine before they become unreliable. Then we tried teaming the student up with another student, so they can do two things at once. That worked, so we added more and more students. Unfortunately, lots of tasks can only be done by one person at a time, and team-work is complicated to co-ordinate. We've also recently noticed that some tasks come up often, so we've given them some tools for those specific tasks. Sadly, the tools are only useful for those specific situations.</p>
<p>Sometimes, what you really need is a professional.</p>
<p>However, there are a few difficulties in getting a professional to work with uni students. They don't speak the same way; they don't think the same way, and they don't work the same way. You need to teach the uni students how to work with the professional, and vice versa.</p>
<p>Previously, developing this interface – this connection between a generalist processor and a specialist machine – has been particularly difficult. The interface between processors and these specialised machines – known as <em>accelerators</em> – has also tended to suffer from bottlenecks and inefficiencies.</p>
<p>This is the problem CAPI solves. CAPI provides a simpler and more optimised way to interface specialised hardware accelerators with IBM's most recent line of processors, POWER8. It's a common 'language' that the processor and the accelerator talk, that makes it much easier to build the hardware side and easier to program the software side. In our Canberra lab, we're working primarily on the operating system side of this. We are working with some external companies who are building CAPI devices and the optimised software products which use them.</p>
<p>From a technical point of view, CAPI provides <em>coherent</em> access to system memory and processor caches, eliminating a major bottleneck in using external devices as accelerators. This is illustrated really well by the following graphic from <a href="https://www.youtube.com/watch?v=4ZyXc12J6FA">an IBM promotional video</a>. In the non-CAPI case, you can see there's a lot of data (the little boxes) stalled in the PCIe subsystem, whereas with CAPI, the accelerator has direct access to the memory subsystem, which makes everything go faster.</p>
<p><img alt="Slide showing CAPI's memory access" src="/images/dja/capi-memory.png" /></p>
<h2>Uses of CAPI</h2>
<p>CAPI technology is already powering a few really cool products.</p>
<p>Firstly, we have an implementation of Redis that sits on top of flash storage connected over CAPI. Or, to take out the buzzwords, CAPI lets us do really, really fast NoSQL databases. There's <a href="https://www.youtube.com/watch?v=cCmFc_0xsvA">a video online</a> giving more details.</p>
<p>Secondly, our partner <a href="http://www.mellanox.com/page/products_dyn?product_family=201&amp;mtag=connectx_4_vpi_card">Mellanox</a> is using CAPI to make network cards that run at speeds of up to 100Gb/s.</p>
<p>CAPI is also part of IBM's OpenPOWER initiative, where we're trying to grow a community of companies around our POWER system designs. So in many ways, CAPI is both a really cool technology, and a brand new ecosystem that we're growing here in the Canberra labs. It's very cool to be a part of!</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Axtens</dc:creator><pubDate>Wed, 27 May 2015 15:08:00 +1000</pubDate><guid>tag:sthbrx.github.io,2015-05-27:blog/2015/05/27/joining-the-capi-project/</guid><category>capi</category><category>open-power</category></item><item><title>OpenPOWER Powers Forward</title><link>https://sthbrx.github.io/blog/2015/05/21/openpower-powers-forward/</link><description><p>I wrote this blog post late last year, it is very relevant for this blog though so I'll repost it here.</p>
<p>With the launch of <a href="http://www.tyan.com/campaign/openpower/">TYAN's OpenPOWER reference system</a> now is a good time to reflect on the team responsible for so much of the research, design and development behind this very first ground breaking step of <a href="http://openpowerfoundation.org/">OpenPOWER</a> with their start to finish involvement of this new Power platform.</p>
<p>ADL Canberra have been integral to the success of this launch providing the Open Power Abstraction Layer (OPAL) firmware. OPAL breathes new life into Linux on Power finally allowing Linux to run on directly on the hardware.
While OPAL harnesses the hardware, ADL Canberra significantly improved Linux to sit on top and take direct control of IBMs new Power8 processor without needing to negotiate with a hypervisor. With all the Linux expertise present at ADL Canberra it's no wonder that a Linux based bootloader was developed to make this system work. Petitboot leverage's all the resources of the Linux kernel to create a light, fast and yet extremely versatile bootloader. Petitboot provides a massive amount of tools for debugging and system configuration without the need to load an operating system.</p>
<p>TYAN have developed great and highly customisable hardware. ADL Canberra have been there since day 1 performing vital platform enablement (bringup) of this new hardware. ADL Canberra have put all the work into the entire software stack, low level work to get OPAL and Linux to talk to the new BMC chip as well as the higher level, enabling to run Linux in either endian and Linux is even now capable of virtualising KVM guests in either endian irrespective of host endian. Furthermore a subset of ADL Canberra have been key to getting the Coherent Accelerator Processor Interface (CAPI) off the ground, enabling more almost endless customisation and greater diversity within the OpenPOWER ecosystem.</p>
<p>ADL Canberra is the home for Linux on Power and the beginning of the OpenPOWER hardware sees much of the hard work by ADL Canberra come to fruition.</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Cyril Bur</dc:creator><pubDate>Thu, 21 May 2015 11:29:00 +1000</pubDate><guid>tag:sthbrx.github.io,2015-05-21:blog/2015/05/21/openpower-powers-forward/</guid><category>open-power</category></item></channel></rss>