An alternative language model architecture that replaces O(n²) self-attention with wave equation dynamics on continuous fields. O(n log n) complexity, within 5% of standard transformer quality.
What if language models could propagate information the way physics propagates waves — through fields, interference, and conservation laws?
| Model | Test PPL | Test Acc | Complexity | Params |
|---|---|---|---|---|
| Standard Transformer | 5.9 | 51.0% | O(n²) | ~6M |
| Wave Field V3.5 | 6.2 | 50.5% | O(n log n) | ~6M |
WikiText-2, character tokenizer, 30 epochs, same hyperparameters. Within 5% of standard transformer quality.
| Sequence Length | Standard O(n²) | Wave O(n log n) | Savings |
|---|---|---|---|
| 512 | 134M ops | 14.3M ops | 9x |
| 2,048 | 2.1B ops | 68M ops | 31x |
| 8,192 | 34B ops | 319M ops | 107x |
| 32,768 | 550B ops | 1.5B ops | 367x |
This is not a modification of an existing architecture. It's a new approach where:
- Tokens live on a continuous field — not just in discrete sequence positions
- Information propagates via damped wave equations — each attention head is a physical wave with 3 learnable parameters (frequency, damping, phase)
- Heads self-organize into roles — local grammar, medium-range context, long-range document structure
- Physics-based diagnostics — every bug from V3.0 to V3.5 was found by inspecting physical quantities (energy flow, conservation, causality), not by guessing
Each attention head is parameterized as a damped oscillation:
k(t) = exp(-α·t) · cos(ω·t + φ) for t ≥ 0 (causal)
| Parameter | Controls | What It Learns |
|---|---|---|
| ω (frequency) | Oscillation speed | Attention pattern periodicity |
| α (damping) | Decay rate | How far back to attend |
| φ (phase) | Offset | Head diversity |
Convolution is computed via FFT in O(n log n).
Input tokens
│
[Token Embedding + Sinusoidal Position Encoding]
│
[Wave Field Layer ×N]
│── Pre-norm
│── Wave Field Attention:
│ │── QKV projection
│ │── Absolute position mapping (token_i → field_pos = i × stride)
│ │── Bilinear scatter (deposit values onto continuous field)
│ │── Wave convolution via FFT — O(n log n)
│ │── Static multi-field coupling (cross-head interactions)
│ │── Content-dependent gating
│ │── Bilinear gather (read from field)
│── Pre-norm FFN (GELU)
│── Field Interference (every 3 layers)
│
[LayerNorm → Output Projection (weight-tied)]
│
Next token logits
| Feature | Transformer | Mamba | Hyena | Wave Field |
|---|---|---|---|---|
| Complexity | O(n²) | O(n) | O(n log n) | O(n log n) |
| Content-dependent | Yes (Q·K) | Yes (selective) | Yes (gating) | Yes (gating) |
| Kernel type | Learned (full) | State-space | Implicit NN | Physics wave (3 params/head) |
| Multi-scale | Arbitrary | Via channels | Via order | Wave frequencies |
| Cross-head interaction | None | None | None | Static coupling |
| Interpretability | Attention maps | Opaque | Opaque | Physics quantities |
git clone https://github.com/badaramoni/wave-field-llm.git
cd wave-field-llm
pip install -r requirements.txtfrom src import WaveFieldTransformer
model = WaveFieldTransformer(
vocab_size=8000,
embedding_dim=256,
num_layers=6,
num_heads=8,
ffn_dim=1024,
field_size=1024,
max_seq_len=256,
)
logits, loss = model(input_ids, labels=target_ids)[The president of the]
The president of the Li @-@ 28 is a rectagonal vait. It was written
by Vigada and Herlla, which has a chapel of 3.6 m (13 ft) above the south
[In the year]
In the year of the German Republic, Dinness and Chester Couz was given
to be in its first most successful tour.
[He was born in]
He was born in a category of the second half-time season, after
finishing back to London in January and February.
wave-field-llm/
├── src/
│ ├── wave_field_attention.py # Core V3.5: wave kernels, bilinear scatter/gather, coupling
│ ├── wave_field_transformer.py # Full model: layers, interference, embeddings
│ ├── causal_field_attention.py # V1/V2 field attention (historical)
│ ├── causal_field_transformer.py # V1/V2 transformer (historical)
│ └── global_context.py # O(n) global context via causal pooling
├── benchmarks/
│ ├── benchmark_wikitext2.py # WikiText-2 benchmark
│ ├── train_wave_v35_bpe.py # V3.5 + BPE training
│ └── train_100m_bpe.py # 100M parameter scaling experiment
├── diagnostics/
│ ├── diagnose_physics.py # Physics-based model diagnostics
│ └── diagnose_bpe.py # BPE tokenizer diagnostics
├── tokenizers/
│ ├── field_tokenizer_v2.py # Words-first tokenizer, zero UNK
│ ├── field_tokenizer_v3.py # V3 tokenizer with BPE support
│ └── field_aware_tokenizer.py # Co-occurrence based tokenizer
├── docs/
│ ├── WAVE_FIELD_V3.md # Full technical writeup
│ ├── BENCHMARK_RESULTS.md # All benchmark data
│ └── ARCHITECTURE.md # V1 architecture (historical)
├── tests/
│ └── test_causality.py # Causality verification
├── requirements.txt
├── LICENSE
└── README.md
This architecture went through 6 major revisions. Every bug was found through physics-based diagnostics — something no other architecture supports:
| Version | What Happened | How Diagnosed |
|---|---|---|
| V3.0 | Initial physics architecture. Beat standard transformer on Shakespeare (PPL 13.5 vs 16.5) | — |
| V3.1 | Conservation shortcut bug | Energy flow trace showed residual amplification |
| V3.1→WikiText | Future token leak (PPL 1.1, garbage generation) | Training PPL impossibly low → data leak |
| V3.2 | FFT wraparound leaking future info | Causality test showed leakage |
| V3.5 | Position shifting during generation | Kernel energy fell on empty field region |
| V3.5 | Conservation crushing sparse fields | Short sequences → conservation rescales to zero |
See docs/WAVE_FIELD_V3.md for the full technical story.
What works:
- Within 5% of standard transformer on WikiText-2 (character tokenizer, 6M params)
- Clean English generation with BPE tokenizer
- Physics-based debugging that catches bugs no profiler can find
Known gap:
- With BPE (8K vocab), there's a capacity bottleneck: Wave PPL 170.7 vs Standard PPL 91.4
- This is a model capacity issue, not an architecture flaw (proven by the 5% gap at small vocab)
- Currently scaling to 100M parameters to close this gap
What's next:
- 100M parameter training with 768-dim embeddings (addresses the vocab pressure)
- Long-context benchmarks at 4K-128K tokens (where the O(n log n) advantage matters)
- Hybrid architectures: wave attention for long-range + standard attention for local
If you use this work, please cite:
@software{wave_field_llm,
title={Wave Field LLM: Language Modeling Through Physics},
author={Badaramoni Avinash},
year={2026},
url={https://github.com/badaramoni/wave-field-llm}
}
MIT License. See LICENSE.