From 6567ca416307485e23cd82650b4ca1adc09e10ae Mon Sep 17 00:00:00 2001
From: Laurence Tratt <laurie@tratt.net>
Date: Sat, 24 May 2025 20:50:21 +0100
Subject: [PATCH] Record "proper" loops.

This commit, in essence, makes `control_point` the very first thing in
the interpreter loop. This allows yk to record more "natural" traces
that, mostly, the optimiser can do a better job on. For example, it
means that traces no longer start with a "it should be optimised away,
surely" `load_inst` call -- because the `promote` is now also after the
`control_point`, we naturally optimise this away.

I experimented with this way back when, but at that point it always led
to worse results. It turns out that's because on big_loop -- which is
all I had as a benchmark then! -- the newly produced JIT IR, though
smaller, happens to cause the register allocator to do a silly spill. I
can, and will, fix that -- but I don't think it's worth holding up this
commit anymore, because it's clear that, overall, this is a win.

From a `haste` run I get this:

```
Permute/YkLua/1000        8.48% faster
Queens/YkLua/1000         7.71% faster
Towers/YkLua/600          4.90% faster
List/YkLua/1500           4.80% faster
LuLPeg/YkLua/default      3.99% faster
DeltaBlue/YkLua/12000     3.78% faster
NBody/YkLua/250000        3.26% faster
Heightmap/YkLua/2000      2.84% faster
Richards/YkLua/100        2.53% faster
Bounce/YkLua/1500         2.02% faster
Sieve/YkLua/3000          1.20% faster
Storage/YkLua/1000        0.20% slower
Mandelbrot/YkLua/500      0.25% slower
Havlak/YkLua/1500         0.57% slower
Json/YkLua/100            0.92% slower
HashIds/YkLua/6000        1.48% slower
CD/YkLua/250              3.42% slower
BigLoop/YkLua/1000000000  14.88% slower
```

HashIds, CD, and (very obviously!) BigLoop get worse; many benchmarks
aren't really effected much one way or the other (e.g. Storage and
Mandelbrot are completely within the noise; Havlak probably is too; Json
I'm somewhat unsure about).

A number of benchmarks (Permute, Queens, Towers, List, DeltaBlue, NBody,
Heightmap, Richards, and Bounce) benefit (LuLPeg is too noisy for me to
read much into it) get better, in some cases quite significantly.

Overall, I think the wins outweigh the benfits, particularly as I know
what ails BigLoop (in essence: we spill a register *before* a guard
instead of *inside* a guard). Fixing that may well partly or wholly fix
the other benchmarks that slow down. But I think that will be
interesting to do as a separate commit in yk.
---
 src/lvm.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/src/lvm.c b/src/lvm.c
index e5dd3b0..4768b84 100644
--- a/src/lvm.c
+++ b/src/lvm.c
@@ -1234,13 +1234,15 @@ void luaV_execute (lua_State *L, CallInfo *ci) {
   /* main loop of interpreter */
   for (;;) {
     Instruction i;  /* instruction being executed */
-#ifdef YKLUA_DEBUG_STRS
-    yk_debug_str(cl->p->instdebugstrs[cast_int(pc - cl->p->code)]);
-#endif
-    vmfetch();
 #ifdef USE_YK
-    YkLocation *ykloc = &cl->p->yklocs[pcRel(pc, cl->p)];
+    YkLocation *ykloc = &cl->p->yklocs[cast_int(pc - cl->p->code)];
     yk_mt_control_point(G(L)->yk_mt, ykloc);
+    vmfetch();
+#  ifdef YKLUA_DEBUG_STRS
+    yk_debug_str(cl->p->instdebugstrs[pcRel(pc, cl->p)]);
+#  endif
+#else
+    vmfetch();
 #endif
     #if 0
       /* low-level line tracing for debugging Lua */