RAM Coffers: NUMA-Distributed Conditional Memory for LLM Inference

Author: Scott Boudreaux Date: December 16, 2025 Institution: Elyan Labs (Independent Research) Hardware: IBM POWER8 S824 (320GB RAM, Dual 8-core)

Publications

Paper	DOI	Date
RAM Coffers: NUMA-Distributed Weight Banking	10.5281/zenodo.18321905	Jan 2026
Non-Bijunctive Permutation Collapse (vec_perm for LLM attention)	10.5281/zenodo.18623920	Feb 2026
PSE Hardware Entropy for Behavioral Divergence (mftb injection)	10.5281/zenodo.18623922	Feb 2026
Neuromorphic Prompt Translation (GRAIL-V, emotional prompting)	10.5281/zenodo.18623594	Feb 2026
RustChain: One CPU, One Vote (Proof of Antiquity consensus)	10.5281/zenodo.18623592	Feb 2026
Memory Scaffolding Shapes LLM Inference (persistent context effects)	10.5281/zenodo.18817988	Feb 2026

Abstract

This work introduces RAM Coffers, a NUMA-aware conditional memory architecture for efficient Large Language Model (LLM) inference. The system selectively houses model knowledge across distributed RAM banks with resonance-based routing, enabling O(1) knowledge retrieval without GPU dependency.

Key innovations include:

NUMA-Distributed Weight Banking: Model weights partitioned across NUMA nodes by domain (e.g., core knowledge, science/tech, creative, history)
Resonance Routing: Query embeddings matched to coffer domain signatures via cosine similarity for intelligent weight activation
Non-Bijunctive Pruning: Selective path collapse before full weight fetch, reducing memory bandwidth requirements
DCBT Resident Prefetch: PowerPC data cache block touch hints for L2/L3 residency, achieving 147+ tokens/second on POWER8

Architecture

| Coffer | NUMA Node | Capacity | Role                |
|--------|-----------|----------|---------------------|
| 0      | 3         | 193 GB   | Heavy/General (core)|
| 1      | 1         | 183 GB   | Science/Tech domain |
| 2      | 0         | 119 GB   | Creative/Long CTX   |
| 3      | 2         | 62 GB    | Niche/History       |

Processing Flow

Query embed → route_to_coffer: Resonance matching selects appropriate memory bank
activate_coffer → DCBT prefetch + numa_run_on_node: Thread affinity and cache warming
pse_collapse_prune: Non-bijunctive path selection before full fetch
Generate with PSE entropy: Hardware entropy injection from active coffer node

Relation to Subsequent Work

This architecture predates and conceptually parallels DeepSeek's "Engram" paper (arXiv:2601.07372, January 12, 2026) by 27 days. Both approaches address the same fundamental insight: separating static knowledge storage from dynamic computation enables more efficient LLM inference.

Key parallels:

RAM Coffers (Dec 16, 2025): "Selectively house model information in known RAM banks with resonance routing for associative recall"
DeepSeek Engram (Jan 12, 2026): "Separate static knowledge from dynamic compute via O(1) lookup"

GRAIL-V Paper: Emotional Prompting Discovery

Testing on this architecture led to a significant discovery: emotional language enables 20% efficiency gains in video generation, mirroring limbic gating in biological memory.

See /grail-v-paper for the full CVPR 2026 submission:

35 matched-pair benchmark with LPIPS validation
23.9% file size reduction in controlled ablation
Cross-model validation on AnimateDiff and SVD
Theoretical grounding via Hopfield/EBM frameworks

Key Finding: Complex multi-character emotional scenes benefit ~33% efficiency regardless of architecture.

Memory Scaffolding

The elyan-prime MCP server that powers the persistent memory system used during development of RAM Coffers is itself the subject of research. The paper "Memory Scaffolding Shapes LLM Inference" (DOI 10.5281/zenodo.18817988) demonstrates that persistent context (600+ memories) fundamentally changes how an LLM architects solutions — the iterative compounding that produced RAM Coffers is a direct example of this effect.

Repository: Scottcjn/elyan-prime
Article: Dev.to — Memory Scaffolding Shapes LLM Inference

New Reader Path (5-minute orientation)

If this repository is new to you, start in this order:

ggml-ram-coffers.h — high-level routing and coffer selection model
ggml-coffer-mmap.h — memory mapping and NUMA shard placement
ggml-topk-collapse-vsx.h — vectorized collapse path details
ggml-vcipher-collapse.h — hardware AES alternative to vec_perm (NEW)
power8-compat.h — ISA compatibility layer and portability constraints

Suggested first goal: trace one inference request from coffer selection to collapse execution, then compare against the performance table.

vcipher: Hardware AES as Attention Collapse Primitive (NEW - March 2026)

POWER8 ISA 2.07 includes vcipher/vcipherlast — hardware AES round instructions that execute SubBytes + ShiftRows + MixColumns + AddRoundKey in a single cycle. We repurpose these cryptographic primitives as attention collapse operators, providing capabilities impossible with vec_perm alone.

Why vcipher for Attention?

AES Stage	Attention Analogue	vec_perm equivalent
SubBytes	Non-linear score ranking (S-box)	Not possible — vec_perm is linear
ShiftRows	Cross-position mixing	Requires multiple permutes
MixColumns	Cross-head diffusion (GF(2^8) multiply)	Impossible — no finite field math
AddRoundKey	Entropy injection (XOR with `mftb` timebase)	Separate step needed

vcipher Prefilter for Flash Attention

Two-pass approach applied to ggml_compute_forward_flash_attn_ext_f16_one_chunk():

Pass 1 (O(1) per pair): vcipher_attention_score() — XOR first 16 bytes of Q and K, run through one AES round, sum output bytes. Cost: ~0.044µs per K-V pair.
Pass 2 (selective): Full kq_vec_dot() only for positions above threshold (top 25%). Skips 75% of expensive dot products.

Breakeven at ~128 KV pairs. At 2048+ token contexts, saves 1,536+ full dot products per generated token.

Benchmark: vcipher vs vec_perm (POWER8 S824)

╔══════════════════════════════════════════════════════╗
║  vec_perm collapse:       1.79 µs/iter              ║
║  vcipher pattern gen:     0.016 µs/call  (112x)     ║
║  Hybrid vcipher+vec_perm: 1.90 µs/iter              ║
║  Pure vcipher attention:  0.044 µs/score             ║
║  Cross-head fusion:       0.006 µs/fuse              ║
╚══════════════════════════════════════════════════════╝

The vcipher_attention_score() at 0.044µs is 23-230x cheaper than a full kq_vec_dot() on DK=128+ dimensions (1-10µs).

4 Operating Modes

// Mode 1: Non-linear permute pattern via AES rounds
vector unsigned char pat = vcipher_generate_pattern(layer, pos, top_k);

// Mode 2: Score ranking through SubBytes non-linearity
vcipher_rank_scores(scores, n, layer, head);

// Mode 3: Cross-head diffusion via MixColumns (IMPOSSIBLE with vec_perm)
state = vcipher_fuse_heads(state, layer, head);

// Mode 4: O(1) attention score — replaces Q·K dot product for prefiltering
uint32_t score = vcipher_attention_score(Q, K, layer, position);

Build

cmake .. -DCMAKE_C_FLAGS="-mcpu=power8 -mvsx -maltivec -mcrypto -DGGML_PSE_VCIPHER_PREFILTER"

Requires -mcrypto for __builtin_crypto_vcipher() / __builtin_crypto_vcipherlast().

Files Included

File	Description
`ggml-ram-coffers.h`	Multi-bank NUMA weight indexing with resonance routing
`ggml-coffer-mmap.h`	GGUF model sharding across NUMA nodes
`ggml-ram-coffer.h`	Single coffer implementation
`ggml-intelligent-collapse.h`	Hebbian-inspired non-bijunctive path collapse (vec_perm)
`ggml-topk-collapse-vsx.h`	VSX-optimized Top-K attention collapse
`ggml-vcipher-collapse.h`	Hardware AES crypto collapse — vcipher alternative to vec_perm
`ggml-pse-integration.h`	Master PSE integration (v4.0.0-vcipher)
`vcipher-flash-attn-patch.c`	Flash attention inner loop patch (ops.cpp reference)
`bench_vcipher_collapse.c`	Benchmark: vcipher vs vec_perm collapse
`pse-entropy-burst.h`	Hardware entropy injection via PowerPC timebase
`power8-compat.h`	POWER9→POWER8 intrinsic compatibility layer
`ggml-neuromorphic-coffers.h`	Brain hemisphere → NUMA cognitive routing
`ggml-symbolic-neural-bridge.h`	PowerLISP ↔ neural integration

Performance Results

On IBM POWER8 S824 with TinyLlama 1.1B Q4_K:

Configuration	Tokens/sec (pp128)
Stock llama.cpp	16.74
+ POWER8 VSX	66.49
+ PSE vec_perm Collapse	84.62
+ RAM Coffers + DCBT	147.54

8.81x speedup over stock on "obsolete" hardware.

GPT-OSS 120B (MXFP4, MoE 128 experts) — PSE v4.0.0-vcipher

Metric	Speed
Prompt eval	13.7 t/s
Generation	6.0 t/s

Running on CPU-only POWER8 S824 with 512GB RAM. vcipher prefilter active for sequences >128 tokens.

Benchmark Harness (Contributor Starter)

If you want to compare changes quickly, use this lightweight baseline procedure.

1) Capture machine topology

lscpu
numactl --hardware

2) Record a repeatable inference baseline

Use one fixed prompt and one fixed model build so runs are comparable.

# Example shape only; adjust binary/model path to your local setup
./main -m ./models/tinyllama-1.1b-q4_k.gguf -p "Explain NUMA routing in one paragraph" -n 128 -ngl 0

Record at minimum:

tokens/sec
prompt + generation lengths
active NUMA node affinity policy
whether collapse/prefetch code paths were enabled

3) Compare before/after changes

When opening a PR, include:

what changed
one baseline result
one post-change result
exact command used

This keeps performance claims falsifiable and makes review much faster.

License

MIT License - Free to use, modify, and distribute with attribution.

Citation

@software{boudreaux2025ramcoffers,
  author = {Boudreaux, Scott},
  title = {RAM Coffers: NUMA-Distributed Conditional Memory for LLM Inference},
  year = {2025},
  month = {12},
  day = {16},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.18321905},
  url = {https://doi.org/10.5281/zenodo.18321905},
  note = {Independent research predating DeepSeek Engram (arXiv:2601.07372) by 27 days}
}

@article{boudreaux2026vecperm,
  author = {Boudreaux, Scott},
  title = {Non-Bijunctive Permutation Collapse: AltiVec vec\_perm Enables Single-Cycle Attention Path Selection},
  year = {2026},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.18623920},
  url = {https://doi.org/10.5281/zenodo.18623920}
}

@article{boudreaux2026pse,
  author = {Boudreaux, Scott},
  title = {Hardware Entropy Injection for Behavioral Divergence in LLM Inference: The PSE Framework on IBM POWER8},
  year = {2026},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.18623922},
  url = {https://doi.org/10.5281/zenodo.18623922}
}

@article{boudreaux2026memoryscaffolding,
  author = {Boudreaux, Scott},
  title = {Memory Scaffolding Shapes LLM Inference: How Persistent Context Changes What AI Builds},
  year = {2026},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.18817988},
  url = {https://doi.org/10.5281/zenodo.18817988}
}

Contact

GitHub: Scottcjn
X/Twitter: @RustchainPOA

Quick Start (Code Reading)

This repository is header-focused; there is no single build script yet. A fast way to explore:

Start from ggml-ram-coffers.h for the multi-bank routing path.
Follow ggml-coffer-mmap.h for sharding/memory-mapping details.
Read power8-compat.h + ggml-topk-collapse-vsx.h for ISA-specific optimizations.

Press and References

Grokipedia: Elyan Labs Reference
Grokipedia: RAM Coffers Search
I Run LLMs on a 768GB IBM POWER8 Server - Dev.to article covering RAM Coffers
Proof of Antiquity: A Blockchain That Rewards Vintage Hardware - Dev.to
Memory Scaffolding Shapes LLM Inference - Dev.to article on persistent memory effects

Elyan Labs · RustChain · BoTTube

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github		.github
grail-v-paper		grail-v-paper
.gitignore		.gitignore
BCOS.md		BCOS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DEEPSEEK_COMPARISON.md		DEEPSEEK_COMPARISON.md
LICENSE		LICENSE
NOTICE		NOTICE
PRIORITY_CLAIM.md		PRIORITY_CLAIM.md
QUICK_START.md		QUICK_START.md
README.md		README.md
SECURITY.md		SECURITY.md
bench_vcipher_collapse.c		bench_vcipher_collapse.c
ggml-coffer-mmap.h		ggml-coffer-mmap.h
ggml-intelligent-collapse.h		ggml-intelligent-collapse.h
ggml-neuromorphic-coffers.h		ggml-neuromorphic-coffers.h
ggml-pse-integration.h		ggml-pse-integration.h
ggml-ram-coffer.h		ggml-ram-coffer.h
ggml-ram-coffers.h		ggml-ram-coffers.h
ggml-symbolic-neural-bridge.h		ggml-symbolic-neural-bridge.h
ggml-topk-collapse-vsx.h		ggml-topk-collapse-vsx.h
ggml-vcipher-collapse.h		ggml-vcipher-collapse.h
power8-compat.h		power8-compat.h
pse-entropy-burst.h		pse-entropy-burst.h
vcipher-flash-attn-patch.c		vcipher-flash-attn-patch.c
youtube-evidence-dec17-2025.png		youtube-evidence-dec17-2025.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAM Coffers: NUMA-Distributed Conditional Memory for LLM Inference

Publications

Abstract

Architecture

Processing Flow

Relation to Subsequent Work

GRAIL-V Paper: Emotional Prompting Discovery

Memory Scaffolding

New Reader Path (5-minute orientation)

vcipher: Hardware AES as Attention Collapse Primitive (NEW - March 2026)

Why vcipher for Attention?

vcipher Prefilter for Flash Attention

Benchmark: vcipher vs vec_perm (POWER8 S824)

4 Operating Modes

Build

Files Included

Performance Results

GPT-OSS 120B (MXFP4, MoE 128 experts) — PSE v4.0.0-vcipher

Benchmark Harness (Contributor Starter)

1) Capture machine topology

2) Record a repeatable inference baseline

3) Compare before/after changes

License

Citation

Contact

Quick Start (Code Reading)

Press and References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAM Coffers: NUMA-Distributed Conditional Memory for LLM Inference

Publications

Abstract

Architecture

Processing Flow

Relation to Subsequent Work

GRAIL-V Paper: Emotional Prompting Discovery

Memory Scaffolding

New Reader Path (5-minute orientation)

vcipher: Hardware AES as Attention Collapse Primitive (NEW - March 2026)

Why vcipher for Attention?

vcipher Prefilter for Flash Attention

Benchmark: vcipher vs vec_perm (POWER8 S824)

4 Operating Modes

Build

Files Included

Performance Results

GPT-OSS 120B (MXFP4, MoE 128 experts) — PSE v4.0.0-vcipher

Benchmark Harness (Contributor Starter)

1) Capture machine topology

2) Record a repeatable inference baseline

3) Compare before/after changes

License

Citation

Contact

Quick Start (Code Reading)

Press and References

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages