Any questions? #1

srose69 · 2026-05-13T09:18:50Z

srose69
May 13, 2026
Maintainer

Ask questions, I'll try to answer (:

srose69 · 2026-05-15T20:39:12Z

srose69
May 15, 2026
Maintainer Author

About eab8a60 and Faced Continuous Snake Layout, may be interesting.

I switched the token packing to Faced Continuous Snake Layout (FCSL).

The short version is: this is an attempt to make the volume easier for the model to read causally, locally, and consistently, without forcing it to spend capacity undoing an awkward packing scheme first.

Why not plain row-major

Row-major is simple to implement, but geometrically it is not a very good causal path for a local 3D operator.

In row-major packing:

sequence continuity is cheap only inside a single row
at row boundaries, the next token jumps to the next line
at face / depth boundaries, it jumps again in a different way

So the local neighborhood is not homogeneous. The meaning of “the next token” depends on where you are:

sometimes it is +x
sometimes it is “wrap to next row”
sometimes it is “wrap to next face”

For a causal convolution, that matters. The kernel is shared. If the sequence path keeps changing its local orientation, the same filter has to learn several local transition patterns instead of one.

FCSL reduces that problem. It keeps the path continuous both:

inside a face
across neighboring faces

So the local causal path is more regular.

Why not Morton / Hilbert

Morton and Hilbert are good locality-preserving curves in the abstract, but they are not a great fit here for two reasons.

1. They optimize generic spatial locality, not causal readability

Those curves are designed to keep nearby coordinates nearby. That is useful for caches and some geometric problems.

But here the question is not just “are nearby tokens spatially close?”
It is also:

is the local causal direction simple?
is the continuation rule consistent?
can a shared convolutional kernel infer the token flow cheaply?

Morton / Hilbert improve locality, but they make the traversal rule more encoded. The path becomes algorithmically clever, not structurally obvious.

2. They increase decoding burden on the model

The model is not just consuming geometry. It has to recover sequence structure from that geometry.

A complicated packing is a kind of encoding. And the more encoded it is, the more work the model has to do to decode it before it can use it well.

I would summarize the intuition like this:

the simpler the encoding, the simpler the decoding

or more precisely:

a packing that is both simple and information-dense should require less effort from the model to “understand”

That is the motivation here. Not a grand claim, just an inductive-bias argument.

Why FCSL specifically

FCSL keeps the packing rule simple:

each H x W face is traversed by a 2D snake
odd faces traverse the same path in reverse
the end of face g stays adjacent to the beginning of face g+1

So sequence continuity is maintained with a very small rule set.

Example on a 2 x 3 x 3 grid:

face g=0              face g=1
 0   1   2            11  10  9
 5   4   3            12  13  14
 6   7   8            17  16  15

This gives a path that is:

locally continuous
easy to describe
easy to invert
still dense in the volume

How causality works under FCSL

The important part is that causality is defined by token order, not by naive row-major voxel order.

So the causal Conv3D mask is not “future in flatten order”.
It is “future under FCSL token order”.

Operationally:

each voxel has a token index induced by FCSL
for each kernel offset (dg, dh, dw), we check whether that neighbor corresponds to a token index <= current token index
if yes, the tap is allowed
if no, the tap is masked out

So the convolution remains causal with respect to the actual sequence path, not with respect to an unrelated memory layout convention.

That means the local Conv3D does not just scan a cube mechanically. It scans a cube under a sequence-aware causal mask.

How the scans read it

The same layout is used consistently by:

seq_to_vol
vol_to_seq
RMU sequence view
VAttn sequence view

So sequence readers are no longer fighting the packing.

The intended flow is:

tokens are packed into the volume by FCSL
local Conv3D sees a continuous causal path in that volume
RMU / VAttn read the same order back consistently
the model does not need to first learn an internal inverse permutation before doing useful work (there is already too much :D)

Why I think this is worth trying

I am not claiming this is universally better.

I am saying the geometry is more coherent for:

local causal convolutions
sequence reconstruction from volume
shared-kernel consistency

And I hope it works the way it is supposed to, at least because the tests so far look very good.

In particular:

the packing/usefulness probes behaved well
prefix sensitivity stayed at FP32-noise level
changing the suffix moved the prefix only by roughly ~2e-7 to 4e-7 in the tests

That does not prove better language modeling by itself.
But it is a good sign that the layout is not breaking causality while still giving the model a cleaner local path to work with.

So the rationale is not “fancy curve beats simple reshape”.
It is:

avoid irregular causal geometry of row-major
avoid over-encoded traversal of Morton / Hilbert
keep the packing simple, continuous, and easy for the model to decode

That is what FCSL is trying to do.

You can append this as a short follow-up section.

Why I removed the stem

I also removed the input stem.

The reason is straightforward: once the packing itself became geometry-aware and the causal masking was made sequence-aware, the stem was no longer pulling its weight.

In the previous setup, the stem had to do too many jobs at once:

partially compensate for an awkward packing
impose an additional local inductive bias before the main sequence modules
transform the raw embedding volume before RMU / VAttn even saw it

That is a dangerous place to spend complexity. If the stem is too weak, it becomes a bottleneck. If it is too expressive, it becomes another hard-to-train nonlinear subsystem in front of an already nontrivial (IMHO) model.

After switching to a cleaner packing, my conclusion was that the model should receive the packed embedding volume more directly, and that the actual contextual work should be done by the mechanisms that are already meant to do it:

RMU
VAttn
the slab pipeline itself

In other words, if the volume layout is already coherent, then adding an extra learned “interpreter” before the real sequence machinery may be more harmful than helpful.

So the current direction is:

make the packing easier to read
make the causal geometry correct
let the core sequence modules operate on that directly

This is not a claim that input stems are always bad. It is just the more defensible choice for this architecture: reduce unnecessary preprocessing, remove one more nonlinear bottleneck, and force the model to spend its capacity on the actual sequence dynamics rather than on undoing its own front-end.

What `optim_mem_device` is for

Another practical piece here is optim_mem_device.

The point of it is not to move the actual training compute off the GPU. The point is to move training-state memory pressure away from the GPU when memory, not compute, is the limiting factor.

In the current setup, optim_mem_device is used for:

low-precision gradient accumulation buffers
autograd saved-tensor offload when the target is CPU

So the main effect is:

forward still runs normally on the GPU
backward is still exact
but tensors that autograd would normally keep on the GPU can instead be staged in host memory

This is fundamentally different from checkpointing.

Checkpointing saves memory by dropping activations and recomputing forward work during backward.

This path saves memory by keeping the needed tensors, but storing them off the compute device.

So the trade is:

less recomputation
more host-memory traffic

What it is not

It is not “do N forwards, dump them all to RAM, then do one giant backward gulp at the end.”

That is not how this path works.

What actually happens is closer to this:

forward saves the tensors autograd needs
those saved tensors are offloaded to CPU memory
during backward, they are pulled back as needed by the relevant backward ops

So this is a staged autograd-tape reload path, not one giant bulk restore.

That distinction matters, because it explains both the memory behavior and the performance behavior.

Why this can be preferable to checkpointing

There are already many ways to trade memory against something else:

gradient accumulation
checkpointing / rematerialization
CPU offload
optimizer-state offload
reduced-precision grad buffers
smaller batch or shorter rows

But different machines hit different limits.

Some are:

VRAM-bound
compute-bound
host-memory-bandwidth-bound
PCIe-bound
or some combination of those

In my case, this direction is more attractive than checkpointing.

The reason is simple:

my setup is more compute-bound
and host-memory bandwidth does not appear to be saturated
(nvitop was useful here)

So for me it is cheaper to keep the forward exact, store the saved autograd state in RAM, and stream it back during backward than to keep re-running substantial parts of the graph.

That is not a universal claim. It is a machine-specific trade.

About pinned memory

One important detail here is that the CPU offload path uses pinned memory.

That matters because pinned host memory is not ordinary pageable RAM:

it is faster for CPU↔GPU transfers
but it is also “stickier”
and it usually consumes more visible host memory than people expect

So if you turn this on and suddenly see extra gigabytes in RAM usage, that is not mysterious.

A lot of that is simply the cost of using pinned host buffers for saved-tensor offload.

So the right mental model is:

lower VRAM pressure
higher host RAM usage
somewhat higher step time
and part of that RAM increase comes specifically from pinned-memory behavior

In other words: do not be surprised if RAM usage grows by more than the naive “GPU delta” would suggest.

Practical result on my machine

On my test setup this offload path slowed training by about 10 seconds per step relative to the baseline:

normal step: about 46 seconds
offloaded step: about 56 seconds

That is a real slowdown, but for me it is still a better trade than recomputing OOM-ish large parts of the forward graph.

So the intended use of optim_mem_device is:

keep compute on the GPU
offload saved training-state tensors when GPU memory is the real constraint
avoid recomputation when recomputation is more expensive than memory traffic on your hardware

I would describe it as a different tradeoff that can make more sense on compute-bound systems with enough host-memory headroom and decent transfer bandwidth.
That is a much more boring idea than fancy activation tricks, but in practice boring is good here. It is direct, exact, and easy to reason about. And overall, it's in the spirit of all the boring code here. Local code culture!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any questions? #1

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Any questions? #1

Uh oh!

srose69 May 13, 2026 Maintainer

Replies: 1 comment

Uh oh!

Uh oh!

srose69 May 15, 2026 Maintainer Author

Why not plain row-major

Why not Morton / Hilbert

1. They optimize generic spatial locality, not causal readability

2. They increase decoding burden on the model

Why FCSL specifically

How causality works under FCSL

How the scans read it

Why I think this is worth trying

Why I removed the stem

What optim_mem_device is for

What it is not

Why this can be preferable to checkpointing

About pinned memory

Practical result on my machine

srose69
May 13, 2026
Maintainer

srose69
May 15, 2026
Maintainer Author

What `optim_mem_device` is for