@@ -144,7 +144,7 @@ coalescing the tables into a single section in the object file. Such a strategy
144
144
might lead to better branch prediction, and therefore improved runtime
145
145
performance on modern hardware. In addition to other strategies, |TNTC | creates
146
146
far reaching and non-obvious effects in the compiler. For example, GHC does not
147
- typically [ # ]_ generate ``call `` or ``ret `` instructions because
147
+ typically generate ``call `` or ``ret `` instructions [ # ]_.
148
148
149
149
Assessing the impact of |TNTC |
150
150
------------------------------
@@ -233,7 +233,7 @@ indirection in the runtime's evaluation of heap objects. Let's zoom into two
233
233
benchmark programs that show the largest signal: ``primes `` which shows ``TNTC ``
234
234
performing 55% faster than ``NO-TNTC ``, and ``awards `` which shows ``NO-TNTC ``
235
235
performing 7% faster than ``TNTC ``. We'll focus on ``awards `` because we want to
236
- understand why exactly |TNTC | is degrades for this exact program.
236
+ understand why exactly |TNTC | degrades for this exact program.
237
237
238
238
Awards
239
239
------
@@ -335,13 +335,21 @@ Counters give a low level view of how our program is interacting with the
335
335
operating system and our machine. Here is a description of each counter perf
336
336
reported in order:
337
337
338
- - ``task-clock:u ``: the ``:u `` is a `modifier
339
- <https://perf.wiki.kernel.org/index.php/Tutorial#Counting_with_perf_stat> `__
340
- meaning the measured events are ``user level `` events, as opposed to ``:k ``
341
- meaning kernel level; see the ``perf-list `` man page for more. ``task-clock ``
342
- is a pre-defined software event that counts the time spent on the instrumented
343
- process. Not shown here is ``cpu-clock `` which measures the passage of time
344
- using the Linux CPU clock.
338
+ .. note :: You may see output such as ``task-clock:u`` instead of ``task-clock``
339
+ (note the extra ``:u ``). These suffixes are `modifiers
340
+ <https://perf.wiki.kernel.org/index.php/Tutorial#Counting_with_perf_stat> `__
341
+ which indicate the level at which the event was measured. For example,
342
+ ``cycles:k `` is the number of cycles that perf detected in kernel mode, while
343
+ ``cycles:u `` is the number of cycles in user mode. By default, if given the
344
+ proper permissions perf will measure both user and kernel level events. You
345
+ can directly specify the levels by suffixing an event name with a modifier or
346
+ combination of modifiers. For instance, ``perf stat -e task-clock:uk `` will
347
+ measure the task-clock at both user and kernel level; see the ``perf-list ``
348
+ man page for more.
349
+
350
+ - ``task-clock ``:. ``task-clock `` is a pre-defined software event that counts
351
+ the time spent on the instrumented process. Not shown here is ``cpu-clock ``
352
+ which measures the passage of time using the Linux CPU clock.
345
353
346
354
- ``context-switches ``: A context-switch is occurs when the operating system
347
355
switches the CPU from executing one process or thread to another. Here we see
@@ -665,7 +673,7 @@ So we'll dump all the intermediate representations and count the references to
665
673
# ## dump the IRs
666
674
$ ghc -fforce-recomp -O2 -ddump-asm -ddump-cmm -ddump-stg-final -ddump-simpl -ddump-to-file -g Main.hs
667
675
[1 of 3] Compiling QSort ( QSort.hs, QSort.o )
668
- ...
676
+ ...
669
677
[3 of 3] Linking Main [Objects changed]
670
678
671
679
And now count the references:
@@ -695,7 +703,7 @@ the occurrence name. Here is the first match:
695
703
696
704
we see that the occurrence name ``go2_r3RZ_entry `` calls
697
705
``GHC.Num.Integer.integerAdd_info `` with the contents of ``R3 `` and ``R2 `` in
698
- block label ``c4al ``.
706
+ block label ``c4al ``.
699
707
Here is the second reference:
700
708
701
709
.. code-block :: haskell
@@ -819,7 +827,7 @@ Now we can rephrase our working hypothesis: the ``awards`` program exhibits an
819
827
L1 instruction cache miss rate of 16% with |TNTC |, with the call to ``sum `` in
820
828
``sumscores `` being responsible for 10% of the 16% miss rate. We now have a
821
829
means of inspecting the program we want to optimize and a means for detecting if
822
- our optimizations have an impact.
830
+ our optimizations have an impact.
823
831
824
832
Conclusion
825
833
----------
@@ -971,7 +979,28 @@ Footnotes
971
979
972
980
.. [# ] I say "typically" because GHC will use ``call `` in some circumstances
973
981
such as creating :term: `CAF `'s, see :ghcSource: `Note [CAF management]
974
- <rts/sm/Storage.c?ref_type=heads#L425> `
982
+ <rts/sm/Storage.c?ref_type=heads#L425> `. The reasons GHC does not use
983
+ ``call `` or ``ret `` are deeply ingrained. First, GHC maintains a separate
984
+ stack from the system-supported C stack. The tradeoff is that this allows
985
+ GHC to completely maintain its own stack, create many tiny stacks for
986
+ each thread, check for stack overflow and reallocate the stack if
987
+ necessary. If GHC generated ``call `` and ``ret `` instructions, and used
988
+ the C stack then it would lose these capabilities, but in return gain
989
+ better branch prediction by virtue of the ``call `` and ``ret ``
990
+ instructions which have special support in most hardware. In this case,
991
+ GHC would have to push a return address or a continuation to the C stack
992
+ in order to use the ``call `` and ``ret `` instructions. The second reason
993
+ is that a ``case `` is operationally preceded by an info table that
994
+ describes its stack frame layout. If the scrutinee of the ``case `` is
995
+ evaluated via a ``call `` which then returns with ``ret ``, then upon a
996
+ return, control flow will be at the instruction *after * the case.
997
+ Therefore, the info table would no longer be *immediately * before the
998
+ return address and thus create havoc in the garbage collector which
999
+ relies on that information. Thus, if ``case `` and ``ret `` are to be used,
1000
+ then the runtime cannot rely on |TNTC | since the tables will no longer be
1001
+ next to the relevant code. See :cite:t: `pointerTaggingLaziness ` Section
1002
+ 7.1, which I have heavily borrowed from here for more.
1003
+
975
1004
976
1005
.. [# ] GHC can be built in many different ways which we call ``flavors ``. The
977
1006
``default `` flavor is one such provided by GHC's build tool `Hadrian
0 commit comments