Author: Scott Boudreaux
Date: December 16, 2025
Institution: Elyan Labs (Independent Research)
Hardware: IBM POWER8 S824 (320GB RAM, Dual 8-core)
| Paper | DOI | Date |
|---|---|---|
| RAM Coffers: NUMA-Distributed Weight Banking | 10.5281/zenodo.18321905 | Jan 2026 |
| Non-Bijunctive Permutation Collapse (vec_perm for LLM attention) | 10.5281/zenodo.18623920 | Feb 2026 |
| PSE Hardware Entropy for Behavioral Divergence (mftb injection) | 10.5281/zenodo.18623922 | Feb 2026 |
| Neuromorphic Prompt Translation (GRAIL-V, emotional prompting) | 10.5281/zenodo.18623594 | Feb 2026 |
| RustChain: One CPU, One Vote (Proof of Antiquity consensus) | 10.5281/zenodo.18623592 | Feb 2026 |
| Memory Scaffolding Shapes LLM Inference (persistent context effects) | 10.5281/zenodo.18817988 | Feb 2026 |
This work introduces RAM Coffers, a NUMA-aware conditional memory architecture for efficient Large Language Model (LLM) inference. The system selectively houses model knowledge across distributed RAM banks with resonance-based routing, enabling O(1) knowledge retrieval without GPU dependency.
Key innovations include:
-
NUMA-Distributed Weight Banking: Model weights partitioned across NUMA nodes by domain (e.g., core knowledge, science/tech, creative, history)
-
Resonance Routing: Query embeddings matched to coffer domain signatures via cosine similarity for intelligent weight activation
-
Non-Bijunctive Pruning: Selective path collapse before full weight fetch, reducing memory bandwidth requirements
-
DCBT Resident Prefetch: PowerPC data cache block touch hints for L2/L3 residency, achieving 147+ tokens/second on POWER8
| Coffer | NUMA Node | Capacity | Role |
|--------|-----------|----------|---------------------|
| 0 | 3 | 193 GB | Heavy/General (core)|
| 1 | 1 | 183 GB | Science/Tech domain |
| 2 | 0 | 119 GB | Creative/Long CTX |
| 3 | 2 | 62 GB | Niche/History |
- Query embed → route_to_coffer: Resonance matching selects appropriate memory bank
- activate_coffer → DCBT prefetch + numa_run_on_node: Thread affinity and cache warming
- pse_collapse_prune: Non-bijunctive path selection before full fetch
- Generate with PSE entropy: Hardware entropy injection from active coffer node
This architecture predates and conceptually parallels DeepSeek's "Engram" paper (arXiv:2601.07372, January 12, 2026) by 27 days. Both approaches address the same fundamental insight: separating static knowledge storage from dynamic computation enables more efficient LLM inference.
Key parallels:
- RAM Coffers (Dec 16, 2025): "Selectively house model information in known RAM banks with resonance routing for associative recall"
- DeepSeek Engram (Jan 12, 2026): "Separate static knowledge from dynamic compute via O(1) lookup"
Testing on this architecture led to a significant discovery: emotional language enables 20% efficiency gains in video generation, mirroring limbic gating in biological memory.
See /grail-v-paper for the full CVPR 2026 submission:
- 35 matched-pair benchmark with LPIPS validation
- 23.9% file size reduction in controlled ablation
- Cross-model validation on AnimateDiff and SVD
- Theoretical grounding via Hopfield/EBM frameworks
Key Finding: Complex multi-character emotional scenes benefit ~33% efficiency regardless of architecture.
The elyan-prime MCP server that powers the persistent memory system used during development of RAM Coffers is itself the subject of research. The paper "Memory Scaffolding Shapes LLM Inference" (DOI 10.5281/zenodo.18817988) demonstrates that persistent context (600+ memories) fundamentally changes how an LLM architects solutions — the iterative compounding that produced RAM Coffers is a direct example of this effect.
- Repository: Scottcjn/elyan-prime
- Article: Dev.to — Memory Scaffolding Shapes LLM Inference
If this repository is new to you, start in this order:
ggml-ram-coffers.h— high-level routing and coffer selection modelggml-coffer-mmap.h— memory mapping and NUMA shard placementggml-topk-collapse-vsx.h— vectorized collapse path detailsggml-vcipher-collapse.h— hardware AES alternative to vec_perm (NEW)power8-compat.h— ISA compatibility layer and portability constraints
Suggested first goal: trace one inference request from coffer selection to collapse execution, then compare against the performance table.
POWER8 ISA 2.07 includes vcipher/vcipherlast — hardware AES round instructions that execute SubBytes + ShiftRows + MixColumns + AddRoundKey in a single cycle. We repurpose these cryptographic primitives as attention collapse operators, providing capabilities impossible with vec_perm alone.
| AES Stage | Attention Analogue | vec_perm equivalent |
|---|---|---|
| SubBytes | Non-linear score ranking (S-box) | Not possible — vec_perm is linear |
| ShiftRows | Cross-position mixing | Requires multiple permutes |
| MixColumns | Cross-head diffusion (GF(2^8) multiply) | Impossible — no finite field math |
| AddRoundKey | Entropy injection (XOR with mftb timebase) |
Separate step needed |
Two-pass approach applied to ggml_compute_forward_flash_attn_ext_f16_one_chunk():
- Pass 1 (O(1) per pair):
vcipher_attention_score()— XOR first 16 bytes of Q and K, run through one AES round, sum output bytes. Cost: ~0.044µs per K-V pair. - Pass 2 (selective): Full
kq_vec_dot()only for positions above threshold (top 25%). Skips 75% of expensive dot products.
Breakeven at ~128 KV pairs. At 2048+ token contexts, saves 1,536+ full dot products per generated token.
╔══════════════════════════════════════════════════════╗
║ vec_perm collapse: 1.79 µs/iter ║
║ vcipher pattern gen: 0.016 µs/call (112x) ║
║ Hybrid vcipher+vec_perm: 1.90 µs/iter ║
║ Pure vcipher attention: 0.044 µs/score ║
║ Cross-head fusion: 0.006 µs/fuse ║
╚══════════════════════════════════════════════════════╝
The vcipher_attention_score() at 0.044µs is 23-230x cheaper than a full kq_vec_dot() on DK=128+ dimensions (1-10µs).
// Mode 1: Non-linear permute pattern via AES rounds
vector unsigned char pat = vcipher_generate_pattern(layer, pos, top_k);
// Mode 2: Score ranking through SubBytes non-linearity
vcipher_rank_scores(scores, n, layer, head);
// Mode 3: Cross-head diffusion via MixColumns (IMPOSSIBLE with vec_perm)
state = vcipher_fuse_heads(state, layer, head);
// Mode 4: O(1) attention score — replaces Q·K dot product for prefiltering
uint32_t score = vcipher_attention_score(Q, K, layer, position);cmake .. -DCMAKE_C_FLAGS="-mcpu=power8 -mvsx -maltivec -mcrypto -DGGML_PSE_VCIPHER_PREFILTER"Requires -mcrypto for __builtin_crypto_vcipher() / __builtin_crypto_vcipherlast().
| File | Description |
|---|---|
ggml-ram-coffers.h |
Multi-bank NUMA weight indexing with resonance routing |
ggml-coffer-mmap.h |
GGUF model sharding across NUMA nodes |
ggml-ram-coffer.h |
Single coffer implementation |
ggml-intelligent-collapse.h |
Hebbian-inspired non-bijunctive path collapse (vec_perm) |
ggml-topk-collapse-vsx.h |
VSX-optimized Top-K attention collapse |
ggml-vcipher-collapse.h |
Hardware AES crypto collapse — vcipher alternative to vec_perm |
ggml-pse-integration.h |
Master PSE integration (v4.0.0-vcipher) |
vcipher-flash-attn-patch.c |
Flash attention inner loop patch (ops.cpp reference) |
bench_vcipher_collapse.c |
Benchmark: vcipher vs vec_perm collapse |
pse-entropy-burst.h |
Hardware entropy injection via PowerPC timebase |
power8-compat.h |
POWER9→POWER8 intrinsic compatibility layer |
ggml-neuromorphic-coffers.h |
Brain hemisphere → NUMA cognitive routing |
ggml-symbolic-neural-bridge.h |
PowerLISP ↔ neural integration |
On IBM POWER8 S824 with TinyLlama 1.1B Q4_K:
| Configuration | Tokens/sec (pp128) |
|---|---|
| Stock llama.cpp | 16.74 |
| + POWER8 VSX | 66.49 |
| + PSE vec_perm Collapse | 84.62 |
| + RAM Coffers + DCBT | 147.54 |
8.81x speedup over stock on "obsolete" hardware.
| Metric | Speed |
|---|---|
| Prompt eval | 13.7 t/s |
| Generation | 6.0 t/s |
Running on CPU-only POWER8 S824 with 512GB RAM. vcipher prefilter active for sequences >128 tokens.
If you want to compare changes quickly, use this lightweight baseline procedure.
lscpu
numactl --hardwareUse one fixed prompt and one fixed model build so runs are comparable.
# Example shape only; adjust binary/model path to your local setup
./main -m ./models/tinyllama-1.1b-q4_k.gguf -p "Explain NUMA routing in one paragraph" -n 128 -ngl 0Record at minimum:
- tokens/sec
- prompt + generation lengths
- active NUMA node affinity policy
- whether collapse/prefetch code paths were enabled
When opening a PR, include:
- what changed
- one baseline result
- one post-change result
- exact command used
This keeps performance claims falsifiable and makes review much faster.
MIT License - Free to use, modify, and distribute with attribution.
@software{boudreaux2025ramcoffers,
author = {Boudreaux, Scott},
title = {RAM Coffers: NUMA-Distributed Conditional Memory for LLM Inference},
year = {2025},
month = {12},
day = {16},
publisher = {Zenodo},
doi = {10.5281/zenodo.18321905},
url = {https://doi.org/10.5281/zenodo.18321905},
note = {Independent research predating DeepSeek Engram (arXiv:2601.07372) by 27 days}
}
@article{boudreaux2026vecperm,
author = {Boudreaux, Scott},
title = {Non-Bijunctive Permutation Collapse: AltiVec vec\_perm Enables Single-Cycle Attention Path Selection},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.18623920},
url = {https://doi.org/10.5281/zenodo.18623920}
}
@article{boudreaux2026pse,
author = {Boudreaux, Scott},
title = {Hardware Entropy Injection for Behavioral Divergence in LLM Inference: The PSE Framework on IBM POWER8},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.18623922},
url = {https://doi.org/10.5281/zenodo.18623922}
}
@article{boudreaux2026memoryscaffolding,
author = {Boudreaux, Scott},
title = {Memory Scaffolding Shapes LLM Inference: How Persistent Context Changes What AI Builds},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.18817988},
url = {https://doi.org/10.5281/zenodo.18817988}
}- GitHub: Scottcjn
- X/Twitter: @RustchainPOA
This repository is header-focused; there is no single build script yet. A fast way to explore:
- Start from
ggml-ram-coffers.hfor the multi-bank routing path. - Follow
ggml-coffer-mmap.hfor sharding/memory-mapping details. - Read
power8-compat.h+ggml-topk-collapse-vsx.hfor ISA-specific optimizations.
- Grokipedia: Elyan Labs Reference
- Grokipedia: RAM Coffers Search
- I Run LLMs on a 768GB IBM POWER8 Server - Dev.to article covering RAM Coffers
- Proof of Antiquity: A Blockchain That Rewards Vintage Hardware - Dev.to
- Memory Scaffolding Shapes LLM Inference - Dev.to article on persistent memory effects