Skip to content

Scottcjn/ram-coffers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

RAM Coffers: NUMA-Distributed Conditional Memory for LLM Inference

BCOS Certified Author: Scott Boudreaux Date: December 16, 2025 Institution: Elyan Labs (Independent Research) Hardware: IBM POWER8 S824 (320GB RAM, Dual 8-core)

DOI

Publications

Paper DOI Date
RAM Coffers: NUMA-Distributed Weight Banking 10.5281/zenodo.18321905 Jan 2026
Non-Bijunctive Permutation Collapse (vec_perm for LLM attention) 10.5281/zenodo.18623920 Feb 2026
PSE Hardware Entropy for Behavioral Divergence (mftb injection) 10.5281/zenodo.18623922 Feb 2026
Neuromorphic Prompt Translation (GRAIL-V, emotional prompting) 10.5281/zenodo.18623594 Feb 2026
RustChain: One CPU, One Vote (Proof of Antiquity consensus) 10.5281/zenodo.18623592 Feb 2026
Memory Scaffolding Shapes LLM Inference (persistent context effects) 10.5281/zenodo.18817988 Feb 2026

Abstract

This work introduces RAM Coffers, a NUMA-aware conditional memory architecture for efficient Large Language Model (LLM) inference. The system selectively houses model knowledge across distributed RAM banks with resonance-based routing, enabling O(1) knowledge retrieval without GPU dependency.

Key innovations include:

  1. NUMA-Distributed Weight Banking: Model weights partitioned across NUMA nodes by domain (e.g., core knowledge, science/tech, creative, history)

  2. Resonance Routing: Query embeddings matched to coffer domain signatures via cosine similarity for intelligent weight activation

  3. Non-Bijunctive Pruning: Selective path collapse before full weight fetch, reducing memory bandwidth requirements

  4. DCBT Resident Prefetch: PowerPC data cache block touch hints for L2/L3 residency, achieving 147+ tokens/second on POWER8

Architecture

| Coffer | NUMA Node | Capacity | Role                |
|--------|-----------|----------|---------------------|
| 0      | 3         | 193 GB   | Heavy/General (core)|
| 1      | 1         | 183 GB   | Science/Tech domain |
| 2      | 0         | 119 GB   | Creative/Long CTX   |
| 3      | 2         | 62 GB    | Niche/History       |

Processing Flow

  1. Query embed → route_to_coffer: Resonance matching selects appropriate memory bank
  2. activate_coffer → DCBT prefetch + numa_run_on_node: Thread affinity and cache warming
  3. pse_collapse_prune: Non-bijunctive path selection before full fetch
  4. Generate with PSE entropy: Hardware entropy injection from active coffer node

Relation to Subsequent Work

This architecture predates and conceptually parallels DeepSeek's "Engram" paper (arXiv:2601.07372, January 12, 2026) by 27 days. Both approaches address the same fundamental insight: separating static knowledge storage from dynamic computation enables more efficient LLM inference.

Key parallels:

  • RAM Coffers (Dec 16, 2025): "Selectively house model information in known RAM banks with resonance routing for associative recall"
  • DeepSeek Engram (Jan 12, 2026): "Separate static knowledge from dynamic compute via O(1) lookup"

GRAIL-V Paper: Emotional Prompting Discovery

Testing on this architecture led to a significant discovery: emotional language enables 20% efficiency gains in video generation, mirroring limbic gating in biological memory.

See /grail-v-paper for the full CVPR 2026 submission:

  • 35 matched-pair benchmark with LPIPS validation
  • 23.9% file size reduction in controlled ablation
  • Cross-model validation on AnimateDiff and SVD
  • Theoretical grounding via Hopfield/EBM frameworks

Key Finding: Complex multi-character emotional scenes benefit ~33% efficiency regardless of architecture.

Memory Scaffolding

The elyan-prime MCP server that powers the persistent memory system used during development of RAM Coffers is itself the subject of research. The paper "Memory Scaffolding Shapes LLM Inference" (DOI 10.5281/zenodo.18817988) demonstrates that persistent context (600+ memories) fundamentally changes how an LLM architects solutions — the iterative compounding that produced RAM Coffers is a direct example of this effect.


New Reader Path (5-minute orientation)

If this repository is new to you, start in this order:

  1. ggml-ram-coffers.h — high-level routing and coffer selection model
  2. ggml-coffer-mmap.h — memory mapping and NUMA shard placement
  3. ggml-topk-collapse-vsx.h — vectorized collapse path details
  4. ggml-vcipher-collapse.h — hardware AES alternative to vec_perm (NEW)
  5. power8-compat.h — ISA compatibility layer and portability constraints

Suggested first goal: trace one inference request from coffer selection to collapse execution, then compare against the performance table.

vcipher: Hardware AES as Attention Collapse Primitive (NEW - March 2026)

POWER8 ISA 2.07 includes vcipher/vcipherlast — hardware AES round instructions that execute SubBytes + ShiftRows + MixColumns + AddRoundKey in a single cycle. We repurpose these cryptographic primitives as attention collapse operators, providing capabilities impossible with vec_perm alone.

Why vcipher for Attention?

AES Stage Attention Analogue vec_perm equivalent
SubBytes Non-linear score ranking (S-box) Not possible — vec_perm is linear
ShiftRows Cross-position mixing Requires multiple permutes
MixColumns Cross-head diffusion (GF(2^8) multiply) Impossible — no finite field math
AddRoundKey Entropy injection (XOR with mftb timebase) Separate step needed

vcipher Prefilter for Flash Attention

Two-pass approach applied to ggml_compute_forward_flash_attn_ext_f16_one_chunk():

  1. Pass 1 (O(1) per pair): vcipher_attention_score() — XOR first 16 bytes of Q and K, run through one AES round, sum output bytes. Cost: ~0.044µs per K-V pair.
  2. Pass 2 (selective): Full kq_vec_dot() only for positions above threshold (top 25%). Skips 75% of expensive dot products.

Breakeven at ~128 KV pairs. At 2048+ token contexts, saves 1,536+ full dot products per generated token.

Benchmark: vcipher vs vec_perm (POWER8 S824)

╔══════════════════════════════════════════════════════╗
║  vec_perm collapse:       1.79 µs/iter              ║
║  vcipher pattern gen:     0.016 µs/call  (112x)     ║
║  Hybrid vcipher+vec_perm: 1.90 µs/iter              ║
║  Pure vcipher attention:  0.044 µs/score             ║
║  Cross-head fusion:       0.006 µs/fuse              ║
╚══════════════════════════════════════════════════════╝

The vcipher_attention_score() at 0.044µs is 23-230x cheaper than a full kq_vec_dot() on DK=128+ dimensions (1-10µs).

4 Operating Modes

// Mode 1: Non-linear permute pattern via AES rounds
vector unsigned char pat = vcipher_generate_pattern(layer, pos, top_k);

// Mode 2: Score ranking through SubBytes non-linearity
vcipher_rank_scores(scores, n, layer, head);

// Mode 3: Cross-head diffusion via MixColumns (IMPOSSIBLE with vec_perm)
state = vcipher_fuse_heads(state, layer, head);

// Mode 4: O(1) attention score — replaces Q·K dot product for prefiltering
uint32_t score = vcipher_attention_score(Q, K, layer, position);

Build

cmake .. -DCMAKE_C_FLAGS="-mcpu=power8 -mvsx -maltivec -mcrypto -DGGML_PSE_VCIPHER_PREFILTER"

Requires -mcrypto for __builtin_crypto_vcipher() / __builtin_crypto_vcipherlast().

Files Included

File Description
ggml-ram-coffers.h Multi-bank NUMA weight indexing with resonance routing
ggml-coffer-mmap.h GGUF model sharding across NUMA nodes
ggml-ram-coffer.h Single coffer implementation
ggml-intelligent-collapse.h Hebbian-inspired non-bijunctive path collapse (vec_perm)
ggml-topk-collapse-vsx.h VSX-optimized Top-K attention collapse
ggml-vcipher-collapse.h Hardware AES crypto collapse — vcipher alternative to vec_perm
ggml-pse-integration.h Master PSE integration (v4.0.0-vcipher)
vcipher-flash-attn-patch.c Flash attention inner loop patch (ops.cpp reference)
bench_vcipher_collapse.c Benchmark: vcipher vs vec_perm collapse
pse-entropy-burst.h Hardware entropy injection via PowerPC timebase
power8-compat.h POWER9→POWER8 intrinsic compatibility layer
ggml-neuromorphic-coffers.h Brain hemisphere → NUMA cognitive routing
ggml-symbolic-neural-bridge.h PowerLISP ↔ neural integration

Performance Results

On IBM POWER8 S824 with TinyLlama 1.1B Q4_K:

Configuration Tokens/sec (pp128)
Stock llama.cpp 16.74
+ POWER8 VSX 66.49
+ PSE vec_perm Collapse 84.62
+ RAM Coffers + DCBT 147.54

8.81x speedup over stock on "obsolete" hardware.

GPT-OSS 120B (MXFP4, MoE 128 experts) — PSE v4.0.0-vcipher

Metric Speed
Prompt eval 13.7 t/s
Generation 6.0 t/s

Running on CPU-only POWER8 S824 with 512GB RAM. vcipher prefilter active for sequences >128 tokens.

Benchmark Harness (Contributor Starter)

If you want to compare changes quickly, use this lightweight baseline procedure.

1) Capture machine topology

lscpu
numactl --hardware

2) Record a repeatable inference baseline

Use one fixed prompt and one fixed model build so runs are comparable.

# Example shape only; adjust binary/model path to your local setup
./main -m ./models/tinyllama-1.1b-q4_k.gguf -p "Explain NUMA routing in one paragraph" -n 128 -ngl 0

Record at minimum:

  • tokens/sec
  • prompt + generation lengths
  • active NUMA node affinity policy
  • whether collapse/prefetch code paths were enabled

3) Compare before/after changes

When opening a PR, include:

  • what changed
  • one baseline result
  • one post-change result
  • exact command used

This keeps performance claims falsifiable and makes review much faster.

License

MIT License - Free to use, modify, and distribute with attribution.

Citation

@software{boudreaux2025ramcoffers,
  author = {Boudreaux, Scott},
  title = {RAM Coffers: NUMA-Distributed Conditional Memory for LLM Inference},
  year = {2025},
  month = {12},
  day = {16},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.18321905},
  url = {https://doi.org/10.5281/zenodo.18321905},
  note = {Independent research predating DeepSeek Engram (arXiv:2601.07372) by 27 days}
}

@article{boudreaux2026vecperm,
  author = {Boudreaux, Scott},
  title = {Non-Bijunctive Permutation Collapse: AltiVec vec\_perm Enables Single-Cycle Attention Path Selection},
  year = {2026},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.18623920},
  url = {https://doi.org/10.5281/zenodo.18623920}
}

@article{boudreaux2026pse,
  author = {Boudreaux, Scott},
  title = {Hardware Entropy Injection for Behavioral Divergence in LLM Inference: The PSE Framework on IBM POWER8},
  year = {2026},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.18623922},
  url = {https://doi.org/10.5281/zenodo.18623922}
}

@article{boudreaux2026memoryscaffolding,
  author = {Boudreaux, Scott},
  title = {Memory Scaffolding Shapes LLM Inference: How Persistent Context Changes What AI Builds},
  year = {2026},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.18817988},
  url = {https://doi.org/10.5281/zenodo.18817988}
}

Contact

  • GitHub: Scottcjn
  • X/Twitter: @RustchainPOA

Quick Start (Code Reading)

This repository is header-focused; there is no single build script yet. A fast way to explore:

  1. Start from ggml-ram-coffers.h for the multi-bank routing path.
  2. Follow ggml-coffer-mmap.h for sharding/memory-mapping details.
  3. Read power8-compat.h + ggml-topk-collapse-vsx.h for ISA-specific optimizations.

Press and References


About

RAM Coffers: Conditional Memory via NUMA-Distributed Weight Banking - O(1) lookup routing for LLM inference (Dec 16, 2025 - predates DeepSeek Engram by 27 days)

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors