Replies: 1 comment
-
|
About eab8a60 and Faced Continuous Snake Layout, may be interesting. I switched the token packing to Faced Continuous Snake Layout (FCSL). The short version is: this is an attempt to make the volume easier for the model to read causally, locally, and consistently, without forcing it to spend capacity undoing an awkward packing scheme first. Why not plain row-majorRow-major is simple to implement, but geometrically it is not a very good causal path for a local 3D operator. In row-major packing:
So the local neighborhood is not homogeneous. The meaning of “the next token” depends on where you are:
For a causal convolution, that matters. The kernel is shared. If the sequence path keeps changing its local orientation, the same filter has to learn several local transition patterns instead of one. FCSL reduces that problem. It keeps the path continuous both:
So the local causal path is more regular. Why not Morton / HilbertMorton and Hilbert are good locality-preserving curves in the abstract, but they are not a great fit here for two reasons. 1. They optimize generic spatial locality, not causal readabilityThose curves are designed to keep nearby coordinates nearby. That is useful for caches and some geometric problems. But here the question is not just “are nearby tokens spatially close?”
Morton / Hilbert improve locality, but they make the traversal rule more encoded. The path becomes algorithmically clever, not structurally obvious. 2. They increase decoding burden on the modelThe model is not just consuming geometry. It has to recover sequence structure from that geometry. A complicated packing is a kind of encoding. And the more encoded it is, the more work the model has to do to decode it before it can use it well. I would summarize the intuition like this:
or more precisely:
That is the motivation here. Not a grand claim, just an inductive-bias argument. Why FCSL specificallyFCSL keeps the packing rule simple:
So sequence continuity is maintained with a very small rule set. Example on a This gives a path that is:
How causality works under FCSLThe important part is that causality is defined by token order, not by naive row-major voxel order. So the causal Conv3D mask is not “future in flatten order”. Operationally:
So the convolution remains causal with respect to the actual sequence path, not with respect to an unrelated memory layout convention. That means the local Conv3D does not just scan a cube mechanically. It scans a cube under a sequence-aware causal mask. How the scans read itThe same layout is used consistently by:
So sequence readers are no longer fighting the packing. The intended flow is:
Why I think this is worth tryingI am not claiming this is universally better. I am saying the geometry is more coherent for:
And I hope it works the way it is supposed to, at least because the tests so far look very good. In particular:
That does not prove better language modeling by itself. So the rationale is not “fancy curve beats simple reshape”.
That is what FCSL is trying to do. You can append this as a short follow-up section. Why I removed the stemI also removed the input stem. The reason is straightforward: once the packing itself became geometry-aware and the causal masking was made sequence-aware, the stem was no longer pulling its weight. In the previous setup, the stem had to do too many jobs at once:
That is a dangerous place to spend complexity. If the stem is too weak, it becomes a bottleneck. If it is too expressive, it becomes another hard-to-train nonlinear subsystem in front of an already nontrivial (IMHO) model. After switching to a cleaner packing, my conclusion was that the model should receive the packed embedding volume more directly, and that the actual contextual work should be done by the mechanisms that are already meant to do it:
In other words, if the volume layout is already coherent, then adding an extra learned “interpreter” before the real sequence machinery may be more harmful than helpful. So the current direction is:
This is not a claim that input stems are always bad. It is just the more defensible choice for this architecture: reduce unnecessary preprocessing, remove one more nonlinear bottleneck, and force the model to spend its capacity on the actual sequence dynamics rather than on undoing its own front-end. What
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Ask questions, I'll try to answer (:
Beta Was this translation helpful? Give feedback.
All reactions