This is the decision log for speed / quality experiments layered on top of the baseline SWA 4-chunk pipeline. Anything not in conversion/convert.py's shipping path lives here.
Legend:
- Shipping — in
convert.pyor a default path, used by pre-converted HuggingFace models. - Prototype — runs end-to-end, numerically validated, not yet plumbed into the default pipeline or the Swift runtime.
- Shelved — implemented far enough to judge, then rejected. Reason recorded.
- Planned — designed but not implemented.
See docs/SPEED_8K.md for the overall speed roadmap and tier assignments.
- Files:
conversion/build_wfa.py,conversion/models/gemma4_swa_wfa.py,conversion/benchmark_wfa.py - Idea: Cap the 7 full-attention layers' KV at a fixed window
FW(e.g. 2048) instead of the full context. Same shift-based update as the 28 sliding layers, so every layer is O(W) per step. - Build speed: at
FW=2048, decode throughput matches the 2K baseline (~31 tok/s) even at 8K context. Bandwidth cost is the bottleneck the knob hits. - Reason shelved: recall drops on prompts that need attention to tokens beyond
FW. The quality regression on long-context tasks is the exact class of failure this project refuses to absorb silently. Seedocs/SPEED_8K.md §0("naive WFA quality NG past FW"). Kept around because DuoAttention-style per-head windowing (Tier A) reuses the same shift-cache machinery.
- Files:
conversion/build_flash.py,conversion/models/gemma4_swa_flash.py - Idea: Split
Q @ K^Talong the K dimension into fixed-size chunks (default 1024) and recombine with an online softmax. Mathematically identical to standard attention within fp16 rounding (cosine > 0.999999 vs standard attention on test tensors). - Status: builder works, outputs validated. Not integrated into the Swift runtime because the expected win on ANE is small — ANE is not SRAM-bound for 8K × 512 K/V (1 MB per chunk fits easily in 32 MB SRAM), so tiling buys less than it does on GPU. Keep for future much-longer-context variants.
- File:
conversion/models/gemma4_swa_merged.py - Idea: Merge chunks 2 and 3 (L8–24) so
kv13/kv14are produced and consumed on the ANE side without a round-trip to the Swift runtime. Saves 32 MB of fp16 KV I/O per decode step. - Status: builder works, not yet benched on-device. Trade-off vs current 4-chunk split: the merged chunk is larger, which may push the ANE compiler past the ~15-layers-per-chunk stability line we hit elsewhere. Measure before shipping.
- Files:
conversion/models/gemma4_lite_chunks.py,conversion/models/gemma4_lite_wrapper.py - Idea: Two chunks (embedding + L0–14) and (L15–34 + LM head). Fewer chunks = less per-step overhead, but risks the ANE compiler stability ceiling.
- Status: kept as a fallback in case the 4-chunk split has problems on a new OS release. Not the default. Note: predates SWA, uses ctx-sized KV cache for every layer — not suitable for shipping.
- Files:
conversion/models/gemma4_swa_merged2.py(2-chunk),conversion/models/gemma4_swa_merged1.py(1-chunk),conversion/build_merged_chunks.py,conversion/test_merged_parity.py,Sources/CoreMLLLM/ChunkedEngine.swift(auto-detect + dispatch),docs/CHUNK_CONSOLIDATION_BENCH.md - Motivation:
docs/BASELINE_SPEED_AUDIT.mdshows 4×2.3 ms dispatch overhead per step. Halving dispatches ≈ +14 tok/s on the 2K decode path. - Idea:
- 2-chunk:
MergedChunk12(L0-14 + PLE, owns KV) ->MergedChunk34(L15-34 + norm + LM head). Reuses_run_layer_swafrom the shipping builder, so layer math is byte-identical — composition only. - 1-chunk:
MergedChunk1(all 35 layers, PLE, norm, LM head) in a single graph. kv13/kv14 stay internal so the Swift runtime never materialises them.
- 2-chunk:
- Runtime:
ChunkedEngineauto-detects layout by file presence; falls back to the 4-chunk path if merged files are missing. Merged layouts allocate their own 12×W + 3×ctx KV buffers. Prefill / speculative-verify paths keep the 4-chunk split (dispatch savings there are negligible and their shapes differ). - Risk (not yet exercised on device): the ANE compiler stability ceiling. 15-layer merged chunk1 is near the line we hit in the WFA experiment; 35-layer merged_full is almost certainly past it.
ComputePlanAuditgates shipping. See the bench doc for pass/ship criteria. - Status as of 2026-04-15: builder, parity test and runtime wiring landed. Numerical parity on CPU-torch unverified (requires HF weights) but the merged forward paths re-use the reference
_run_layer_swa, so failure would indicate a composition bug, not a math bug. Device bench pending.
- File:
conversion/models/gemma4_stateless_chunks.py - Idea: Explicit KV input/output tensors instead of Apple's
MLStateAPI. Chosen over the Monolithic/MLStatepath becauseMLStateintroduces int64 state indices that break ANE placement on the model sizes we ship (seedocs/CONVERSION.mdon "Explicit KV I/O"). - This is what the default Gemma 4 E2B conversion actually produces.
- File:
conversion/exporter.py :: _quantize_model - Chosen for weight-only compression. See
docs/CONVERSION.md"Quantization" section for the size/quality/latency trade-off and why not INT8 or FP16.
- File:
conversion/build_w8a8.py(modew8) - Idea: symmetric per-channel INT8. Smaller than FP16, larger than INT4. Zero latency gain on ANE (ANE is FP16 internally, so weight-only INT8 gives size only — same as INT4 but at double the storage).
- Status: shelved as a shipping option for size reasons, kept as a baseline for measuring INT4's quality cost. Not useful standalone.
- File:
conversion/build_w8a8.py(modew8a8) - Idea: activation + weight INT8. This is the only quantization that unlocks ANE's INT8×INT8 compute path (~1.3–1.6× per Apple's ResNet-50 docs).
- Calibration used 5 random samples. Quality regressed visibly on chat outputs.
- Reason shelved: insufficient calibration. Superseded by
build_w8a8_proper.py.
- File:
conversion/build_w8a8_proper.py - Idea: collect real activation traces by running the INT4 model on 32+ prompts at positions 0..31, then quantize. Also provides a W4A8 fallback (INT4 palette weights + INT8 activations) which is more stable than full W8A8 in practice.
- Outcome: Mac Studio M4 Max compiles and runs but 0% speedup (ANE still runs FP16 internally); iPhone 17 Pro
ANECCompile() FAILED— thequantize/dequantizeMIL ops thatcoremltools.optimize.coreml.linear_quantize_activationsinserts are not compilable by the iPhone ANE compiler. Seedocs/SPEED_8K.md §1 A2 / Tier D. No INT8 path reaches ANE on iOS 26 / coremltools 9.0. - Side finding:
linear_quantize_activationsleaks ~1 temp.mlpackageper calibration op-group (tied toatexitrather than__del__), ~38 GB per Gemma 4 E2B chunk2 calibration run. Fillable disk. Worth filing upstream if anyone revives this path.
- File:
conversion/build_flash.py --nbits 2 - Idea: Apple's own 3.18B ships W2 palettized weights (per Foundation Models 2025 tech report). Reduce weight bandwidth 8× vs FP16 (2× vs INT4). Post-training palettization via
OpPalettizerConfig(nbits=2, granularity="per_grouped_channel", group_size=32). Expected ×1.4–2.0 decode speedup from bandwidth reduction. - Size: chunk total 546 MB (vs 1,091 MB INT4 = exactly 50%). Conversion succeeds, no compile errors.
- Quality: complete gibberish. Multilingual garbage tokens, zero coherent output on 3 test prompts (France capital, photosynthesis, haiku). Tested via
conversion/smoke_w2_quality.pywith real INT8 embeddings + RoPE, autoregressive generation on Mac CPU. - Root cause: 4 codewords per group (2-bit) has insufficient representational capacity for post-training palettization. Apple sustains quality at W2 via QAT (Quality-Aware Training) — their shipping recipe includes quantization-aware fine-tuning, not post-training compression. Post-training W2 is not viable without QAT (days of GPU training).
- File:
conversion/build_flash.py --nbits 3 - Idea: 8 codewords per group might survive post-training where 4 did not. Weight size = 75% of INT4.
- Size: chunk total 818 MB (vs 1,091 MB INT4 = 75%).
- Quality: still gibberish, though pattern differs from W2 — repetitive underscore-separated fragments instead of multilingual noise. Not usable.
- Conclusion: post-training palettization quality cliff is between 3-bit and 4-bit for Gemma 4 E2B. Only ≥4-bit palettization (already shipping) works without QAT. Sub-4-bit requires QAT or knowledge distillation, which is a multi-day GPU effort outside current scope.
- Files:
conversion/collect_eagle_hidden_states.py,conversion/download_eagle_corpus.py,conversion/train_eagle_draft.ipynb,conversion/train_eagle3_draft.ipynb(notebooks are untracked while actively iterated) - Idea: train a small decoder-layer draft model on Gemma 4 E2B hidden states; verify in-graph against the target.
- Training corpus: WikiText + C4 + Alpaca + Dolly + CodeAlpaca + UltraChat, formatted with Gemma 4's chat template. ~50 k samples.
- Current acceptance metric (acc0) tracked in notebook; see
docs/SPEED_8K.md §3 P1for the latest snapshot (dated, not live). - Integration path: once trained, draft model →
build_speculative.py→ paired with verify chunks.
- Files:
conversion/train_medusa_heads.py, parts ofconversion/build_speculative.py - Idea: 3 lightweight ResBlocks predict the next 3 tokens from the final hidden state; verify with target model.
- Reason shelved: published Medusa acceptance on Gemma-class models is ~1.3 %, far below EAGLE-3's 50–70 %. Confirmed on a small internal run. Kept because
build_speculative.pyreuses the same verify-chunk plumbing for EAGLE-3.
- Files:
conversion/models/gemma4_stateful_chunks.py,conversion/build_stateful.py - Idea: use
coremltools.StateType(iOS 18+) to declare KV cache as internal model state, eliminating per-dispatch IOSurface round-trip. Per FUNDAMENTAL_UNTRIED.md §2 / PRIORITY_ROADMAP.md Phase 1 item 2. Expected ×1.3–2.0 if dispatch overhead is the bottleneck. - Implementation: StatefulChunk2 (L8-14) with
kv_sliding(10,1,512,512) andkv_full(4,1,8192,512) asct.StateType. CoreML conversion succeeds, INT4 palettized, 134 MB. - Result:
error code: -14("Failed to build the model execution plan") on both Mac ANE and iPhone 17 Pro ANE. Thecoreml_update_stateop does not generate a valid ANE execution plan. - Root cause: MLState is GPU-only on current hardware/OS. HuggingFace's WWDC24 Mistral CoreML reference explicitly states stateful KV is "excellent for GPUs on Mac computers" and ANE requires "additional adaptations." Apple's own on-device Llama 3.1 and smpanaro/coreml-llm-cli both use stateless explicit-I/O KV, not MLState. The
coreml_update_stateMIL op is not supported by the ANE compiler as of iOS 26 / coremltools 9.0. - Conclusion: dispatch-overhead hypothesis (FUNDAMENTAL_UNTRIED.md §0) remains valid as a bottleneck description, but MLState cannot address it on ANE. Alternative paths to reduce dispatch overhead: (1) chunk consolidation (4→2 chunks), (2) speculative decoding (amortize dispatch across multiple tokens per burst).
- Files:
conversion/models/gemma4_stateful_padded.py,conversion/build_stateful_padded.py(on branchworktree-agent-ad21e314, not in main). Findings indocs/SPLIT_ROTATE_FINDINGS.mdanddocs/SPLIT_ROTATE_BENCH.md. - Idea: hypothesis that error -14 was caused by the non-mod-32
num_kv_heads=1dim being rejected by ANE's 32-wide tile scheduler. Probe: pad KV heads 1→32 in both stateless (PaddedKVChunk2) and stateful (StatefulPaddedChunk2) variants. - Part 1 result: PyTorch parity between
StatelessChunk2andPaddedKVChunk2is bit-exact on all finite outputs; padded heads stay zero as expected. - Part 2 result:
build_stateful_padded.py --ctx 512 --nbits 0 --smoke-test→ conversion emits the sameerror code: -14warning at save time; Mac CPU_ONLY predict returns finite outputs in 49 ms (graph is runtime-valid on CPU). Device retry would reproduce -14 on ANE. - Conclusion: 32-alignment is not the cause of error -14. The ANE compiler does not schedule
coreml_update_stateregardless of the padded tensor widths. ANEMLL's--split-rotateis a separate multi-function-loading workaround, not an alignment fix. No further MLState-on-ANE work is warranted; any stateful path must target GPU (CPU_AND_GPU) as the WWDC24 Mistral demo does.
- Files:
Sources/CoreMLLLM/SuffixTree.swift,Sources/CoreMLLLM/SuffixDecoding.swift(on branchclaude/suffix-decoding-impl, not merged to main) - Idea: build suffix tree from all prior model outputs, draft K tokens via CPU trie lookup (~20µs), verify with Q=K ANE verifier. Paper reports 1.9-5.3× on chat/agentic workloads (NeurIPS 2025 Spotlight, arXiv 2411.04975).
- T=1 instrumentation results on iPhone 17 Pro (4 multi-turn generations):
- hit rate: 2% → 29% → 48% (climbs as tree grows, as expected)
- T1 accuracy: 18.4% (of hits, how often the top-1 draft matches model output)
- tree: 10k nodes after 4 generations
- Performance overhead: insert at end-of-generation blocks next generation start (~1-2s for 500 token sequence). Draft lookup adds ~2 tok/s overhead even at 4th-token sampling. Async insert + NSLock fixes the blocking but doesn't eliminate lookup cost.
- Assessment: T1=18% is too low to be the primary speculative method. EAGLE-3 (acc0=75%, workload-independent) is strictly better as the main draft source. SuffixDecoding is workload-dependent — the paper's high numbers come from production workloads with stable system prompts, RAG, and code editing patterns. Random diverse chat is the worst case.
- Decision: demote to auxiliary draft source. Use suffix tree when it has a high-confidence match (e.g., count > threshold), fall back to EAGLE-3 otherwise. Don't build Q=K verifier specifically for SuffixDecoding — build it for EAGLE-3, and SuffixDecoding can reuse it later.
- See
docs/SPEED_8K.md §1 A3 / A4. Requires block-static top-k redesign to stay on ANE. Not started.
- See
docs/SPEED_8K.md §1 A1 / §3 P3. Offline head classification + two KV banks per layer. High ROI, training-free at inference time. Next candidate after EAGLE-3 lands.
- File:
Sources/CoreMLLLM/ComputePlanAudit.swift - Idea: use
MLComputePlan.deviceUsage(for:)(iOS 17+) to walk every MIL op in chunk1-4 and identify any op whose preferred device is not Neural Engine. Per UNEXPLORED_APPROACHES_V2.md §G2 / PRIORITY_ROADMAP.md Phase 0a. - Result on iPhone 17 Pro (8K chunks):
- chunk1 (L0-7): 0 compute ops on CPU/GPU. All matmul/conv/softmax/attention on ANE.
- chunk2 (L8-14): 0 compute ops on CPU/GPU. Same.
- chunk3 (L15-24): 0 compute ops on CPU/GPU. 1
identityon unknown (no-op). - chunk4 (L25-34 + LM head): 8 compute ops on CPU — the entire
InModelArgmaxtail:mul(×2),tanh,squeeze(×2),reduce_argmax,expand_dims,gather_along_axis. Total estimated cost = 0.0028. This is the tanh-based softargmax + gather pipeline fromane_ops.py::InModelArgmax. - All
constexpr_lut_to_dense(INT4 depalettization) andconstops reported as "unknown" device — these are weight-loading ops that run once at model load, not per decode step.
- Interpretation: the only per-step CPU fallback is chunk4's argmax tail (~8 ops, ~0.5-2 ms ANE↔CPU round-trip). chunk1-3 are 100% ANE. This confirms the dispatch-overhead hypothesis from FUNDAMENTAL_UNTRIED.md §0: the bottleneck is not individual ops falling to CPU, but the 4× per-step IOSurface round-trip between chunks.
- Actionable: (1) InModelArgmax CPU fallback is small but fixable — rewrite to ANE-compatible ops or accept the ~1-3% cost. (2) No other compute ops need fixing. (3) MLState (stateful KV to eliminate IOSurface round-trips) remains the highest-leverage next step.
- File:
conversion/prune_vocab.py - Idea: the Gemma 4 vocab is 262 K tokens. Embedding + LM-head weights dominate on-device size. Analysis shows large blocks (rare scripts, emoji variants) are near-never used for English chat.
- Reason abandoned: (1) Gemma's tokenizer is sentencepiece — dropping tokens changes BPE merges and breaks round-trip tokenization; (2) the
gemma4_lite_wrapper.pyroute (external per-layer embedding in Swift) already reclaimed the main memory win (~40 %). Keep the analysis as reference.
- File:
conversion/rebuild_chunk4_8k.py - Purpose: regenerate
chunk4withcausal_mask_fullat(1,1,1,8192)when the earlier build shipped with a 2048-sized mask. This is the kind of silent mismatch theChunkedEngineauto-detection (see commit4311991) now guards against at load time.
- Pick a short name (e.g.
wfa,flash,merged). - Put the model variant in
conversion/models/gemma4_<name>.py. - Put the builder in
conversion/build_<name>.py. Keep it runnable standalone (python build_<name>.py --output ./output/<name>). - Put the A/B benchmark in
conversion/benchmark_<name>.pycomparing against the shipping baseline on the same prompts. - Write one row in this file: what you tried, status after the first real measurement, reason if shelved.
Rule: every experiment that gets shelved gets a one-paragraph obituary here with the numeric reason. Future-us will otherwise rebuild the same thing.