Skip to content

Commit ff61961

Browse files
committed
perf: wordsmithing and improved references
1 parent 6d0cc3a commit ff61961

File tree

1 file changed

+42
-13
lines changed

1 file changed

+42
-13
lines changed

src/Measurement_Observation/Binary_Profiling/linux_perf.rst

+42-13
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@ coalescing the tables into a single section in the object file. Such a strategy
144144
might lead to better branch prediction, and therefore improved runtime
145145
performance on modern hardware. In addition to other strategies, |TNTC| creates
146146
far reaching and non-obvious effects in the compiler. For example, GHC does not
147-
typically [#]_ generate ``call`` or ``ret`` instructions because
147+
typically generate ``call`` or ``ret`` instructions [#]_.
148148

149149
Assessing the impact of |TNTC|
150150
------------------------------
@@ -233,7 +233,7 @@ indirection in the runtime's evaluation of heap objects. Let's zoom into two
233233
benchmark programs that show the largest signal: ``primes`` which shows ``TNTC``
234234
performing 55% faster than ``NO-TNTC``, and ``awards`` which shows ``NO-TNTC``
235235
performing 7% faster than ``TNTC``. We'll focus on ``awards`` because we want to
236-
understand why exactly |TNTC| is degrades for this exact program.
236+
understand why exactly |TNTC| degrades for this exact program.
237237

238238
Awards
239239
------
@@ -335,13 +335,21 @@ Counters give a low level view of how our program is interacting with the
335335
operating system and our machine. Here is a description of each counter perf
336336
reported in order:
337337

338-
- ``task-clock:u``: the ``:u`` is a `modifier
339-
<https://perf.wiki.kernel.org/index.php/Tutorial#Counting_with_perf_stat>`__
340-
meaning the measured events are ``user level`` events, as opposed to ``:k``
341-
meaning kernel level; see the ``perf-list`` man page for more. ``task-clock``
342-
is a pre-defined software event that counts the time spent on the instrumented
343-
process. Not shown here is ``cpu-clock`` which measures the passage of time
344-
using the Linux CPU clock.
338+
.. note:: You may see output such as ``task-clock:u`` instead of ``task-clock``
339+
(note the extra ``:u``). These suffixes are `modifiers
340+
<https://perf.wiki.kernel.org/index.php/Tutorial#Counting_with_perf_stat>`__
341+
which indicate the level at which the event was measured. For example,
342+
``cycles:k`` is the number of cycles that perf detected in kernel mode, while
343+
``cycles:u`` is the number of cycles in user mode. By default, if given the
344+
proper permissions perf will measure both user and kernel level events. You
345+
can directly specify the levels by suffixing an event name with a modifier or
346+
combination of modifiers. For instance, ``perf stat -e task-clock:uk`` will
347+
measure the task-clock at both user and kernel level; see the ``perf-list``
348+
man page for more.
349+
350+
- ``task-clock``:. ``task-clock`` is a pre-defined software event that counts
351+
the time spent on the instrumented process. Not shown here is ``cpu-clock``
352+
which measures the passage of time using the Linux CPU clock.
345353

346354
- ``context-switches``: A context-switch is occurs when the operating system
347355
switches the CPU from executing one process or thread to another. Here we see
@@ -665,7 +673,7 @@ So we'll dump all the intermediate representations and count the references to
665673
### dump the IRs
666674
$ ghc -fforce-recomp -O2 -ddump-asm -ddump-cmm -ddump-stg-final -ddump-simpl -ddump-to-file -g Main.hs
667675
[1 of 3] Compiling QSort ( QSort.hs, QSort.o )
668-
...
676+
...
669677
[3 of 3] Linking Main [Objects changed]
670678
671679
And now count the references:
@@ -695,7 +703,7 @@ the occurrence name. Here is the first match:
695703
696704
we see that the occurrence name ``go2_r3RZ_entry`` calls
697705
``GHC.Num.Integer.integerAdd_info`` with the contents of ``R3`` and ``R2`` in
698-
block label ``c4al``.
706+
block label ``c4al``.
699707
Here is the second reference:
700708

701709
.. code-block:: haskell
@@ -819,7 +827,7 @@ Now we can rephrase our working hypothesis: the ``awards`` program exhibits an
819827
L1 instruction cache miss rate of 16% with |TNTC|, with the call to ``sum`` in
820828
``sumscores`` being responsible for 10% of the 16% miss rate. We now have a
821829
means of inspecting the program we want to optimize and a means for detecting if
822-
our optimizations have an impact.
830+
our optimizations have an impact.
823831

824832
Conclusion
825833
----------
@@ -971,7 +979,28 @@ Footnotes
971979
972980
.. [#] I say "typically" because GHC will use ``call`` in some circumstances
973981
such as creating :term:`CAF`'s, see :ghcSource:`Note [CAF management]
974-
<rts/sm/Storage.c?ref_type=heads#L425>`
982+
<rts/sm/Storage.c?ref_type=heads#L425>`. The reasons GHC does not use
983+
``call`` or ``ret`` are deeply ingrained. First, GHC maintains a separate
984+
stack from the system-supported C stack. The tradeoff is that this allows
985+
GHC to completely maintain its own stack, create many tiny stacks for
986+
each thread, check for stack overflow and reallocate the stack if
987+
necessary. If GHC generated ``call`` and ``ret`` instructions, and used
988+
the C stack then it would lose these capabilities, but in return gain
989+
better branch prediction by virtue of the ``call`` and ``ret``
990+
instructions which have special support in most hardware. In this case,
991+
GHC would have to push a return address or a continuation to the C stack
992+
in order to use the ``call`` and ``ret`` instructions. The second reason
993+
is that a ``case`` is operationally preceded by an info table that
994+
describes its stack frame layout. If the scrutinee of the ``case`` is
995+
evaluated via a ``call`` which then returns with ``ret``, then upon a
996+
return, control flow will be at the instruction *after* the case.
997+
Therefore, the info table would no longer be *immediately* before the
998+
return address and thus create havoc in the garbage collector which
999+
relies on that information. Thus, if ``case`` and ``ret`` are to be used,
1000+
then the runtime cannot rely on |TNTC| since the tables will no longer be
1001+
next to the relevant code. See :cite:t:`pointerTaggingLaziness` Section
1002+
7.1, which I have heavily borrowed from here for more.
1003+
9751004
9761005
.. [#] GHC can be built in many different ways which we call ``flavors``. The
9771006
``default`` flavor is one such provided by GHC's build tool `Hadrian

0 commit comments

Comments
 (0)