|
| 1 | +.. _Philosophies of Optimization Chapter: |
| 2 | + |
| 3 | +Philosophies of Optimization |
| 4 | +============================ |
| 5 | + |
| 6 | +What does it mean to optimize a code base? What is optimization? If we do not |
| 7 | +understand what it means to optimize, then how can we possibly optimize a code |
| 8 | +base? In this chapter, we present philosophies of optimization to answers these |
| 9 | +questions. These ideas hail from the performance oriented culture of video game |
| 10 | +developers [#]_, we've only summarized them here. |
| 11 | + |
| 12 | +There are three philosophies of optimization, they are: |
| 13 | + |
| 14 | +.. _Optimization: |
| 15 | + |
| 16 | +Optimization |
| 17 | +------------ |
| 18 | + |
| 19 | +This is actual optimization, the idea is to change the implementation to target |
| 20 | +a *specific* machine or kind of machine. The following tasks are optimization |
| 21 | +tasks: |
| 22 | + |
| 23 | +a. Checking the specifications of the machine to determine how fast the machine |
| 24 | + can do the mathematical operations the workload and program implementation |
| 25 | + require. |
| 26 | + |
| 27 | +b. Inspecting the operations the program is required to do to ensure they are |
| 28 | + optimal for the machine the implementation will run on. For example, altering |
| 29 | + the implementation to avoid specific assembly instructions such as ``jmp`` |
| 30 | + and instead generating ``cmov``. |
| 31 | + |
| 32 | +c. Ensuring that the algorithm the code implements is optimal for the |
| 33 | + workload. |
| 34 | + |
| 35 | +d. Benchmarking the implementation and comparing the results to the theoretical |
| 36 | + maximum the machine is capable of, and then inspecting the implementation's |
| 37 | + runtime to determine where exactly the program slows down. |
| 38 | + |
| 39 | +An example of optimization in Haskell would be tuning the runtime system flags |
| 40 | +to the machine the program will run on. For example, building the program with |
| 41 | +no-tables-next-to-code because we have measured that tables-next-to-code |
| 42 | +:ref:`increases L1 cache-misses <Checking the L1 Cache>` for the machine we |
| 43 | +intend to run on production. |
| 44 | + |
| 45 | +Actual optimization is hard and time consuming. There is a time and place for |
| 46 | +it, but it should not be the bulk of your optimization work in normal |
| 47 | +circumstances because its benefits are overly specific to one kind of machine. |
| 48 | +So in the general case, where you are writing software that runs on machines you |
| 49 | +don't know anything about, you should instead optimize via non-pessimization. |
| 50 | + |
| 51 | +.. _Non-Pessimization: |
| 52 | + |
| 53 | +Non-Pessimization |
| 54 | +----------------- |
| 55 | + |
| 56 | +Non-pessimization is a philosophy of crafting software where one tries to write |
| 57 | +software that does the least amount of work possible. Or in other words, this |
| 58 | +philosophy asks us to write code that minimizes extra work the CPU must do. The |
| 59 | +idea behind non-pessimization is that modern hyperscaler pipelining CPUs are |
| 60 | +extremely fast, and by *not* burdening the CPU with extra work, the |
| 61 | +implementation will necessarily be performant. |
| 62 | + |
| 63 | +A typical example of pessimized code that Haskellers' should be familiar with is |
| 64 | +an excessive use of laziness for a workload that simply does not require the |
| 65 | +laziness. For example, computing the sum of an input list with a lazy |
| 66 | +accumulator. This is an example of pessimized code because the code is |
| 67 | +requesting the CPU do extra non-necessary work. That work being the allocating |
| 68 | +thunks, and then searching for thunks distributed all about the heap. Of course |
| 69 | +each thunk will and must be eventually scrutinized, but conceptually the |
| 70 | +workload does not benefit from and does not require laziness. Thus the |
| 71 | +construction and eventual scrutinization of these thunks is simply wasted time |
| 72 | +and effort placed onto the CPU. |
| 73 | + |
| 74 | +Key to this approach is keeping in mind what the machine must do in order to |
| 75 | +complete the work load that your program defines. Once you have grokked this |
| 76 | +thinking, writing code that does the least amount of work will follow. In the |
| 77 | +previous example of lazy accumulation, the author of that code was not thinking |
| 78 | +in terms of the machine. Had they been thinking in terms of the operations the |
| 79 | +machine must perform, then they would have observed that the thunks were |
| 80 | +superfluous to the requisite workload. |
| 81 | + |
| 82 | +Some more examples of pessimized code are: |
| 83 | + |
| 84 | +a. Too much polymorphism and higher ordered functions. In general, anything that |
| 85 | + could add an :term:`Unknown Function` to hot loops that we care about is, and |
| 86 | + will be unnecessary work for the CPU. |
| 87 | + |
| 88 | +b. Using lot's of libraries with code that you do not understand and have not |
| 89 | + benchmarked. Libraries will prioritize whatever the library author felt was |
| 90 | + important. Note that If one of those things is performance, and you find (by |
| 91 | + empirically measuring) that the library is suitably performant for your |
| 92 | + workload then by all means use it. The point being that you should be |
| 93 | + deliberate and selective with your dependencies and should empirically assess |
| 94 | + them. |
| 95 | + |
| 96 | +c. Excessive use of Constructors and fancy types [#]_. For non-pessimized code |
| 97 | + we want to do *as little* as possible. This certainly means avoiding the |
| 98 | + creation of a lot of objects that live all over the heap. |
| 99 | + |
| 100 | +d. Defining types with poor memory efficiency. Consider this example from |
| 101 | + GHC's STG implementation: |
| 102 | + |
| 103 | + .. code-block:: haskell |
| 104 | +
|
| 105 | + data LambdaFormInfo |
| 106 | + = |
| 107 | + ... |
| 108 | + | LFThunk -- Thunk (zero arity) |
| 109 | + !TopLevelFlag |
| 110 | + !Bool -- True <=> no free vars |
| 111 | + !Bool -- True <=> updatable (i.e., *not* single-entry) |
| 112 | + !StandardFormInfo |
| 113 | + !Bool -- True <=> *might* be a function type |
| 114 | + ... |
| 115 | +
|
| 116 | +The constructor ``LFThunk`` has five fields, three of which are ``Bool``. This |
| 117 | +means, in the abstract, that we only need three bits to store the information |
| 118 | +that these ``Bool``'s represent. Yet in this constructor each ``Bool`` will be |
| 119 | +padded by GHC to a machine word. Therefore, *each* ``Bool`` is represented with |
| 120 | +64-bits on a typical x86_64 machine (32-bits for x86 and for other backends such |
| 121 | +as the JavaScript backend). Thus, one ``LFThunk`` heap object will require 320 |
| 122 | +bits (192 bits for the ``Bool``'s, 128 for the other two fields), of which 188 |
| 123 | +bits will always be zero because they are wasted space. Similarly, |
| 124 | +``TopLevelFlag`` is isomorphic to a ``Bool``: |
| 125 | + |
| 126 | +.. code-block:: haskell |
| 127 | +
|
| 128 | + data TopLevelFlag |
| 129 | + = TopLevel |
| 130 | + | NotTopLevel |
| 131 | + deriving Data |
| 132 | +
|
| 133 | +So a more efficient representation *only requires* 4 bits and then a pointer to |
| 134 | +``StandardFormInfo`` for a total of 66 bits. However, this must still be aligned |
| 135 | +and padded; yielding a total of 72 bits, which is a 77% improvement in memory |
| 136 | +efficiency. |
| 137 | + |
| 138 | +Non-pessimization should be the bulk of your optimization efforts. Not only is |
| 139 | +it portable to other machines, but it is also simpler and more future proof than |
| 140 | +actual optimization. |
| 141 | + |
| 142 | +.. _Fake Optimization: |
| 143 | + |
| 144 | +Fake Optimization |
| 145 | +----------------- |
| 146 | + |
| 147 | +Fake optimization is a philosophy of performance that will not lead to better |
| 148 | +code or better performance. Rather, fake optimization is advice that one finds |
| 149 | +around the internet. These are sayings such as "You should never use <Foo>!", or |
| 150 | +"Google doesn't use <Bar> therefore you shouldn't either!", or "you should |
| 151 | +always use arrays and never use linked-lists". Notice that each of these |
| 152 | +statements are categorical; they claim something is *always* fast or slow or one |
| 153 | +should *never* or *always* use something or other. |
| 154 | + |
| 155 | +These statements are called fake optimizations because they are advice or |
| 156 | +aphorisms that are divorced from the context of your code, the problem your code |
| 157 | +wants to solve and the work it must perform to do so. An algorithm or data |
| 158 | +structure is not *universally* bad or good, or fast or slow. It could be the |
| 159 | +case that for a particular workload, and for a particular memory access pattern, |
| 160 | +a linked-list is the right choice. The key point is that whether an algorithm or |
| 161 | +data structure is fast or not depends on numerous factors. Factors such as what |
| 162 | +your program has to do, what the properties of the data your program is |
| 163 | +processing are, and what the memory access patterns are. Another example of a |
| 164 | +fake optimization statement is "quick-sort is always faster than |
| 165 | +insertion sort". This is a fake optimization because while quick-sort has better |
| 166 | +time complexity than insertion sort, for small lists (usually less than 30 |
| 167 | +elements) insertion sort will be more performant [#]_. |
| 168 | + |
| 169 | +The key idea is that the performance of your code is very sensitive to the |
| 170 | +specific problem and data the code operates on. So beware of fake optimization |
| 171 | +statements for they will waste your time and iteration cycles. |
| 172 | + |
| 173 | + |
| 174 | +References and Footnotes |
| 175 | +======================== |
| 176 | + |
| 177 | +.. [#] See `this <https://youtu.be/pgoetgxecw8?si=0csotFBkya5gGDvJ>`__ series by |
| 178 | + Casey Muratori. We thank him for his labor. |
| 179 | +
|
| 180 | +.. [#] I hear you say "but this is Haskell!" why wouldn't I use algebraic data |
| 181 | + types to model my domain and increase the correctness and maintainability |
| 182 | + of my code! And you are correct to feel this way, but in this domain, we |
| 183 | + are looking for performance at the expense of these other properties and |
| 184 | + in this pursuit you should be prepared to kill your darlings. This does |
| 185 | + not mean you must start rewriting your entire code base. Far from it, in |
| 186 | + practice you should only need to non-pessimize certain high-performance |
| 187 | + subsystems in your code base. So it is key that one practices writing |
| 188 | + non-pessimized Haskell such that when the need arises you understand how |
| 189 | + to speed up some subsystem by employing non-pessimizing techniques. |
| 190 | +
|
| 191 | +.. [#] See this `keynote <https://youtu.be/FJJTYQYB1JQ?si=L2pDU5AqFNjFC1hK>`__ |
| 192 | + by Andrei Alexandrescu. Another example is `timsort |
| 193 | + <https://en.wikipedia.org/wiki/Timsort>`__ in Python. Python `adopted |
| 194 | + <https://mail.python.org/pipermail/python-dev/2002-July/026837.html>`__ |
| 195 | + timsort because most real-world data is nearly sorted, thus the |
| 196 | + worst-case *in practice* is vanishingly rare. |
0 commit comments