Skip to content

Commit 9dc5075

Browse files
committed
prelim: add philosophies of optimization chapter
1 parent daff591 commit 9dc5075

File tree

5 files changed

+209
-1
lines changed

5 files changed

+209
-1
lines changed

index.rst

+4
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,10 @@ by `Input Output Global <https://iohk.io/>`_.
2626
Release History
2727
---------------
2828

29+
* August, 2024
30+
31+
* :ref:`Philosophies of Optimization <Philosophies of Optimization Chapter>` first draft finished.
32+
2933
* April, 2024
3034

3135
* :ref:`Linux Perf <Perf Chapter>` first draft finished.

src/Measurement_Observation/Binary_Profiling/linux_perf.rst

+2-1
Original file line numberDiff line numberDiff line change
@@ -432,8 +432,9 @@ took 304 billion cycles, while ``NO-TNTC`` took 299 billion cycles. This is
432432
suspicious, and is suggestive of some kind of cache-miss because ``TNTC`` is
433433
taking *more* cycles to execute *less* instructions.
434434

435+
.. _Checking the L1 Cache:
435436

436-
Checking the L1 cache
437+
Checking the L1 Cache
437438
^^^^^^^^^^^^^^^^^^^^^
438439

439440
Let's zoom in on the CPU caches. To do so we'll ask perf to only record events
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
.. _Data-Oriented Design Chapter:
2+
3+
:lightgrey:`Data-Oriented Design`
4+
=================================
5+
6+
`TODO <https://github.com/haskellfoundation/hs-opt-handbook.github.io/issues/112>`_

src/Preliminaries/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -7,5 +7,6 @@ Preliminaries
77

88
how_to_use
99
what_makes_fast_hs
10+
philosophies_of_optimization
1011
repeatable_measurements
1112
how_to_debug
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
.. _Philosophies of Optimization Chapter:
2+
3+
Philosophies of Optimization
4+
============================
5+
6+
What does it mean to optimize a code base? What is optimization? If we do not
7+
understand what it means to optimize, then how can we possibly optimize a code
8+
base? In this chapter, we present philosophies of optimization to answers these
9+
questions. These ideas hail from the performance oriented culture of video game
10+
developers [#]_, we've only summarized them here.
11+
12+
There are three philosophies of optimization, they are:
13+
14+
.. _Optimization:
15+
16+
Optimization
17+
------------
18+
19+
This is actual optimization, the idea is to change the implementation to target
20+
a *specific* machine or kind of machine. The following tasks are optimization
21+
tasks:
22+
23+
a. Checking the specifications of the machine to determine how fast the machine
24+
can do the mathematical operations the workload and program implementation
25+
require.
26+
27+
b. Inspecting the operations the program is required to do to ensure they are
28+
optimal for the machine the implementation will run on. For example, altering
29+
the implementation to avoid specific assembly instructions such as ``jmp``
30+
and instead generating ``cmov``.
31+
32+
c. Ensuring that the algorithm the code implements is optimal for the
33+
workload.
34+
35+
d. Benchmarking the implementation and comparing the results to the theoretical
36+
maximum the machine is capable of, and then inspecting the implementation's
37+
runtime to determine where exactly the program slows down.
38+
39+
An example of optimization in Haskell would be tuning the runtime system flags
40+
to the machine the program will run on. For example, building the program with
41+
no-tables-next-to-code because we have measured that tables-next-to-code
42+
:ref:`increases L1 cache-misses <Checking the L1 Cache>` for the machine we
43+
intend to run on production.
44+
45+
Actual optimization is hard and time consuming. There is a time and place for
46+
it, but it should not be the bulk of your optimization work in normal
47+
circumstances because its benefits are overly specific to one kind of machine.
48+
So in the general case, where you are writing software that runs on machines you
49+
don't know anything about, you should instead optimize via non-pessimization.
50+
51+
.. _Non-Pessimization:
52+
53+
Non-Pessimization
54+
-----------------
55+
56+
Non-pessimization is a philosophy of crafting software where one tries to write
57+
software that does the least amount of work possible. Or in other words, this
58+
philosophy asks us to write code that minimizes extra work the CPU must do. The
59+
idea behind non-pessimization is that modern hyperscaler pipelining CPUs are
60+
extremely fast, and by *not* burdening the CPU with extra work, the
61+
implementation will necessarily be performant.
62+
63+
A typical example of pessimized code that Haskellers' should be familiar with is
64+
an excessive use of laziness for a workload that simply does not require the
65+
laziness. For example, computing the sum of an input list with a lazy
66+
accumulator. This is an example of pessimized code because the code is
67+
requesting the CPU do extra non-necessary work. That work being the allocating
68+
thunks, and then searching for thunks distributed all about the heap. Of course
69+
each thunk will and must be eventually scrutinized, but conceptually the
70+
workload does not benefit from and does not require laziness. Thus the
71+
construction and eventual scrutinization of these thunks is simply wasted time
72+
and effort placed onto the CPU.
73+
74+
Key to this approach is keeping in mind what the machine must do in order to
75+
complete the work load that your program defines. Once you have grokked this
76+
thinking, writing code that does the least amount of work will follow. In the
77+
previous example of lazy accumulation, the author of that code was not thinking
78+
in terms of the machine. Had they been thinking in terms of the operations the
79+
machine must perform, then they would have observed that the thunks were
80+
superfluous to the requisite workload.
81+
82+
Some more examples of pessimized code are:
83+
84+
a. Too much polymorphism and higher ordered functions. In general, anything that
85+
could add an :term:`Unknown Function` to hot loops that we care about is, and
86+
will be unnecessary work for the CPU.
87+
88+
b. Using lot's of libraries with code that you do not understand and have not
89+
benchmarked. Libraries will prioritize whatever the library author felt was
90+
important. Note that If one of those things is performance, and you find (by
91+
empirically measuring) that the library is suitably performant for your
92+
workload then by all means use it. The point being that you should be
93+
deliberate and selective with your dependencies and should empirically assess
94+
them.
95+
96+
c. Excessive use of Constructors and fancy types [#]_. For non-pessimized code
97+
we want to do *as little* as possible. This certainly means avoiding the
98+
creation of a lot of objects that live all over the heap.
99+
100+
d. Defining types with poor memory efficiency. Consider this example from
101+
GHC's STG implementation:
102+
103+
.. code-block:: haskell
104+
105+
data LambdaFormInfo
106+
=
107+
...
108+
| LFThunk -- Thunk (zero arity)
109+
!TopLevelFlag
110+
!Bool -- True <=> no free vars
111+
!Bool -- True <=> updatable (i.e., *not* single-entry)
112+
!StandardFormInfo
113+
!Bool -- True <=> *might* be a function type
114+
...
115+
116+
The constructor ``LFThunk`` has five fields, three of which are ``Bool``. This
117+
means, in the abstract, that we only need three bits to store the information
118+
that these ``Bool``'s represent. Yet in this constructor each ``Bool`` will be
119+
padded by GHC to a machine word. Therefore, *each* ``Bool`` is represented with
120+
64-bits on a typical x86_64 machine (32-bits for x86 and for other backends such
121+
as the JavaScript backend). Thus, one ``LFThunk`` heap object will require 320
122+
bits (192 bits for the ``Bool``'s, 128 for the other two fields), of which 188
123+
bits will always be zero because they are wasted space. Similarly,
124+
``TopLevelFlag`` is isomorphic to a ``Bool``:
125+
126+
.. code-block:: haskell
127+
128+
data TopLevelFlag
129+
= TopLevel
130+
| NotTopLevel
131+
deriving Data
132+
133+
So a more efficient representation *only requires* 4 bits and then a pointer to
134+
``StandardFormInfo`` for a total of 66 bits. However, this must still be aligned
135+
and padded; yielding a total of 72 bits, which is a 77% improvement in memory
136+
efficiency.
137+
138+
Non-pessimization should be the bulk of your optimization efforts. Not only is
139+
it portable to other machines, but it is also simpler and more future proof than
140+
actual optimization.
141+
142+
.. _Fake Optimization:
143+
144+
Fake Optimization
145+
-----------------
146+
147+
Fake optimization is a philosophy of performance that will not lead to better
148+
code or better performance. Rather, fake optimization is advice that one finds
149+
around the internet. These are sayings such as "You should never use <Foo>!", or
150+
"Google doesn't use <Bar> therefore you shouldn't either!", or "you should
151+
always use arrays and never use linked-lists". Notice that each of these
152+
statements are categorical; they claim something is *always* fast or slow or one
153+
should *never* or *always* use something or other.
154+
155+
These statements are called fake optimizations because they are advice or
156+
aphorisms that are divorced from the context of your code, the problem your code
157+
wants to solve and the work it must perform to do so. An algorithm or data
158+
structure is not *universally* bad or good, or fast or slow. It could be the
159+
case that for a particular workload, and for a particular memory access pattern,
160+
a linked-list is the right choice. The key point is that whether an algorithm or
161+
data structure is fast or not depends on numerous factors. Factors such as what
162+
your program has to do, what the properties of the data your program is
163+
processing are, and what the memory access patterns are. Another example of a
164+
fake optimization statement is "quick-sort is always faster than
165+
insertion sort". This is a fake optimization because while quick-sort has better
166+
time complexity than insertion sort, for small lists (usually less than 30
167+
elements) insertion sort will be more performant [#]_.
168+
169+
The key idea is that the performance of your code is very sensitive to the
170+
specific problem and data the code operates on. So beware of fake optimization
171+
statements for they will waste your time and iteration cycles.
172+
173+
174+
References and Footnotes
175+
========================
176+
177+
.. [#] See `this <https://youtu.be/pgoetgxecw8?si=0csotFBkya5gGDvJ>`__ series by
178+
Casey Muratori. We thank him for his labor.
179+
180+
.. [#] I hear you say "but this is Haskell!" why wouldn't I use algebraic data
181+
types to model my domain and increase the correctness and maintainability
182+
of my code! And you are correct to feel this way, but in this domain, we
183+
are looking for performance at the expense of these other properties and
184+
in this pursuit you should be prepared to kill your darlings. This does
185+
not mean you must start rewriting your entire code base. Far from it, in
186+
practice you should only need to non-pessimize certain high-performance
187+
subsystems in your code base. So it is key that one practices writing
188+
non-pessimized Haskell such that when the need arises you understand how
189+
to speed up some subsystem by employing non-pessimizing techniques.
190+
191+
.. [#] See this `keynote <https://youtu.be/FJJTYQYB1JQ?si=L2pDU5AqFNjFC1hK>`__
192+
by Andrei Alexandrescu. Another example is `timsort
193+
<https://en.wikipedia.org/wiki/Timsort>`__ in Python. Python `adopted
194+
<https://mail.python.org/pipermail/python-dev/2002-July/026837.html>`__
195+
timsort because most real-world data is nearly sorted, thus the
196+
worst-case *in practice* is vanishingly rare.

0 commit comments

Comments
 (0)