Skip to content

Latest commit

 

History

History
200 lines (139 loc) · 18.5 KB

File metadata and controls

200 lines (139 loc) · 18.5 KB

Experiments — What We Tried, What Shipped, What Didn't

This is the decision log for speed / quality experiments layered on top of the baseline SWA 4-chunk pipeline. Anything not in conversion/convert.py's shipping path lives here.

Legend:

  • Shipping — in convert.py or a default path, used by pre-converted HuggingFace models.
  • Prototype — runs end-to-end, numerically validated, not yet plumbed into the default pipeline or the Swift runtime.
  • Shelved — implemented far enough to judge, then rejected. Reason recorded.
  • Planned — designed but not implemented.

See docs/SPEED_8K.md for the overall speed roadmap and tier assignments.


Attention-variant experiments (8K context)

WFA — Windowed Full Attention — Shelved

  • Files: conversion/build_wfa.py, conversion/models/gemma4_swa_wfa.py, conversion/benchmark_wfa.py
  • Idea: Cap the 7 full-attention layers' KV at a fixed window FW (e.g. 2048) instead of the full context. Same shift-based update as the 28 sliding layers, so every layer is O(W) per step.
  • Build speed: at FW=2048, decode throughput matches the 2K baseline (~31 tok/s) even at 8K context. Bandwidth cost is the bottleneck the knob hits.
  • Reason shelved: recall drops on prompts that need attention to tokens beyond FW. The quality regression on long-context tasks is the exact class of failure this project refuses to absorb silently. See docs/SPEED_8K.md §0 ("naive WFA quality NG past FW"). Kept around because DuoAttention-style per-head windowing (Tier A) reuses the same shift-cache machinery.

Flash Decoding — Prototype

  • Files: conversion/build_flash.py, conversion/models/gemma4_swa_flash.py
  • Idea: Split Q @ K^T along the K dimension into fixed-size chunks (default 1024) and recombine with an online softmax. Mathematically identical to standard attention within fp16 rounding (cosine > 0.999999 vs standard attention on test tensors).
  • Status: builder works, outputs validated. Not integrated into the Swift runtime because the expected win on ANE is small — ANE is not SRAM-bound for 8K × 512 K/V (1 MB per chunk fits easily in 32 MB SRAM), so tiling buys less than it does on GPU. Keep for future much-longer-context variants.

Merged chunk2+chunk3 — Prototype

  • File: conversion/models/gemma4_swa_merged.py
  • Idea: Merge chunks 2 and 3 (L8–24) so kv13 / kv14 are produced and consumed on the ANE side without a round-trip to the Swift runtime. Saves 32 MB of fp16 KV I/O per decode step.
  • Status: builder works, not yet benched on-device. Trade-off vs current 4-chunk split: the merged chunk is larger, which may push the ANE compiler past the ~15-layers-per-chunk stability line we hit elsewhere. Measure before shipping.

Lite 2-chunk variant — Prototype

  • Files: conversion/models/gemma4_lite_chunks.py, conversion/models/gemma4_lite_wrapper.py
  • Idea: Two chunks (embedding + L0–14) and (L15–34 + LM head). Fewer chunks = less per-step overhead, but risks the ANE compiler stability ceiling.
  • Status: kept as a fallback in case the 4-chunk split has problems on a new OS release. Not the default. Note: predates SWA, uses ctx-sized KV cache for every layer — not suitable for shipping.

SWA 2-chunk / 1-chunk consolidation — Prototype (2026-04-15)

  • Files: conversion/models/gemma4_swa_merged2.py (2-chunk), conversion/models/gemma4_swa_merged1.py (1-chunk), conversion/build_merged_chunks.py, conversion/test_merged_parity.py, Sources/CoreMLLLM/ChunkedEngine.swift (auto-detect + dispatch), docs/CHUNK_CONSOLIDATION_BENCH.md
  • Motivation: docs/BASELINE_SPEED_AUDIT.md shows 4×2.3 ms dispatch overhead per step. Halving dispatches ≈ +14 tok/s on the 2K decode path.
  • Idea:
    • 2-chunk: MergedChunk12 (L0-14 + PLE, owns KV) -> MergedChunk34 (L15-34 + norm + LM head). Reuses _run_layer_swa from the shipping builder, so layer math is byte-identical — composition only.
    • 1-chunk: MergedChunk1 (all 35 layers, PLE, norm, LM head) in a single graph. kv13/kv14 stay internal so the Swift runtime never materialises them.
  • Runtime: ChunkedEngine auto-detects layout by file presence; falls back to the 4-chunk path if merged files are missing. Merged layouts allocate their own 12×W + 3×ctx KV buffers. Prefill / speculative-verify paths keep the 4-chunk split (dispatch savings there are negligible and their shapes differ).
  • Risk (not yet exercised on device): the ANE compiler stability ceiling. 15-layer merged chunk1 is near the line we hit in the WFA experiment; 35-layer merged_full is almost certainly past it. ComputePlanAudit gates shipping. See the bench doc for pass/ship criteria.
  • Status as of 2026-04-15: builder, parity test and runtime wiring landed. Numerical parity on CPU-torch unverified (requires HF weights) but the merged forward paths re-use the reference _run_layer_swa, so failure would indicate a composition bug, not a math bug. Device bench pending.

Stateless 4-chunk (no MLState) — Shipping

  • File: conversion/models/gemma4_stateless_chunks.py
  • Idea: Explicit KV input/output tensors instead of Apple's MLState API. Chosen over the Monolithic/MLState path because MLState introduces int64 state indices that break ANE placement on the model sizes we ship (see docs/CONVERSION.md on "Explicit KV I/O").
  • This is what the default Gemma 4 E2B conversion actually produces.

Quantization experiments

INT4 palettization, group_size=32 — Shipping

  • File: conversion/exporter.py :: _quantize_model
  • Chosen for weight-only compression. See docs/CONVERSION.md "Quantization" section for the size/quality/latency trade-off and why not INT8 or FP16.

INT8 weight-only — Prototype

  • File: conversion/build_w8a8.py (mode w8)
  • Idea: symmetric per-channel INT8. Smaller than FP16, larger than INT4. Zero latency gain on ANE (ANE is FP16 internally, so weight-only INT8 gives size only — same as INT4 but at double the storage).
  • Status: shelved as a shipping option for size reasons, kept as a baseline for measuring INT4's quality cost. Not useful standalone.

W8A8 (naive calibration) — Shelved

  • File: conversion/build_w8a8.py (mode w8a8)
  • Idea: activation + weight INT8. This is the only quantization that unlocks ANE's INT8×INT8 compute path (~1.3–1.6× per Apple's ResNet-50 docs).
  • Calibration used 5 random samples. Quality regressed visibly on chat outputs.
  • Reason shelved: insufficient calibration. Superseded by build_w8a8_proper.py.

W8A8 (realistic calibration) — Rejected (2026-04-13)

  • File: conversion/build_w8a8_proper.py
  • Idea: collect real activation traces by running the INT4 model on 32+ prompts at positions 0..31, then quantize. Also provides a W4A8 fallback (INT4 palette weights + INT8 activations) which is more stable than full W8A8 in practice.
  • Outcome: Mac Studio M4 Max compiles and runs but 0% speedup (ANE still runs FP16 internally); iPhone 17 Pro ANECCompile() FAILED — the quantize/dequantize MIL ops that coremltools.optimize.coreml.linear_quantize_activations inserts are not compilable by the iPhone ANE compiler. See docs/SPEED_8K.md §1 A2 / Tier D. No INT8 path reaches ANE on iOS 26 / coremltools 9.0.
  • Side finding: linear_quantize_activations leaks ~1 temp .mlpackage per calibration op-group (tied to atexit rather than __del__), ~38 GB per Gemma 4 E2B chunk2 calibration run. Fillable disk. Worth filing upstream if anyone revives this path.

W2A16 palettization (2-bit, post-training) — Rejected (2026-04-13)

  • File: conversion/build_flash.py --nbits 2
  • Idea: Apple's own 3.18B ships W2 palettized weights (per Foundation Models 2025 tech report). Reduce weight bandwidth 8× vs FP16 (2× vs INT4). Post-training palettization via OpPalettizerConfig(nbits=2, granularity="per_grouped_channel", group_size=32). Expected ×1.4–2.0 decode speedup from bandwidth reduction.
  • Size: chunk total 546 MB (vs 1,091 MB INT4 = exactly 50%). Conversion succeeds, no compile errors.
  • Quality: complete gibberish. Multilingual garbage tokens, zero coherent output on 3 test prompts (France capital, photosynthesis, haiku). Tested via conversion/smoke_w2_quality.py with real INT8 embeddings + RoPE, autoregressive generation on Mac CPU.
  • Root cause: 4 codewords per group (2-bit) has insufficient representational capacity for post-training palettization. Apple sustains quality at W2 via QAT (Quality-Aware Training) — their shipping recipe includes quantization-aware fine-tuning, not post-training compression. Post-training W2 is not viable without QAT (days of GPU training).

W3A16 palettization (3-bit, post-training) — Rejected (2026-04-13)

  • File: conversion/build_flash.py --nbits 3
  • Idea: 8 codewords per group might survive post-training where 4 did not. Weight size = 75% of INT4.
  • Size: chunk total 818 MB (vs 1,091 MB INT4 = 75%).
  • Quality: still gibberish, though pattern differs from W2 — repetitive underscore-separated fragments instead of multilingual noise. Not usable.
  • Conclusion: post-training palettization quality cliff is between 3-bit and 4-bit for Gemma 4 E2B. Only ≥4-bit palettization (already shipping) works without QAT. Sub-4-bit requires QAT or knowledge distillation, which is a multi-day GPU effort outside current scope.

Speculative decoding experiments

EAGLE-3 — In training

  • Files: conversion/collect_eagle_hidden_states.py, conversion/download_eagle_corpus.py, conversion/train_eagle_draft.ipynb, conversion/train_eagle3_draft.ipynb (notebooks are untracked while actively iterated)
  • Idea: train a small decoder-layer draft model on Gemma 4 E2B hidden states; verify in-graph against the target.
  • Training corpus: WikiText + C4 + Alpaca + Dolly + CodeAlpaca + UltraChat, formatted with Gemma 4's chat template. ~50 k samples.
  • Current acceptance metric (acc0) tracked in notebook; see docs/SPEED_8K.md §3 P1 for the latest snapshot (dated, not live).
  • Integration path: once trained, draft model → build_speculative.py → paired with verify chunks.

Medusa (3 heads) — Shelved

  • Files: conversion/train_medusa_heads.py, parts of conversion/build_speculative.py
  • Idea: 3 lightweight ResBlocks predict the next 3 tokens from the final hidden state; verify with target model.
  • Reason shelved: published Medusa acceptance on Gemma-class models is ~1.3 %, far below EAGLE-3's 50–70 %. Confirmed on a small internal run. Kept because build_speculative.py reuses the same verify-chunk plumbing for EAGLE-3.

MLState stateful KV cache — Rejected (2026-04-13)

  • Files: conversion/models/gemma4_stateful_chunks.py, conversion/build_stateful.py
  • Idea: use coremltools.StateType (iOS 18+) to declare KV cache as internal model state, eliminating per-dispatch IOSurface round-trip. Per FUNDAMENTAL_UNTRIED.md §2 / PRIORITY_ROADMAP.md Phase 1 item 2. Expected ×1.3–2.0 if dispatch overhead is the bottleneck.
  • Implementation: StatefulChunk2 (L8-14) with kv_sliding (10,1,512,512) and kv_full (4,1,8192,512) as ct.StateType. CoreML conversion succeeds, INT4 palettized, 134 MB.
  • Result: error code: -14 ("Failed to build the model execution plan") on both Mac ANE and iPhone 17 Pro ANE. The coreml_update_state op does not generate a valid ANE execution plan.
  • Root cause: MLState is GPU-only on current hardware/OS. HuggingFace's WWDC24 Mistral CoreML reference explicitly states stateful KV is "excellent for GPUs on Mac computers" and ANE requires "additional adaptations." Apple's own on-device Llama 3.1 and smpanaro/coreml-llm-cli both use stateless explicit-I/O KV, not MLState. The coreml_update_state MIL op is not supported by the ANE compiler as of iOS 26 / coremltools 9.0.
  • Conclusion: dispatch-overhead hypothesis (FUNDAMENTAL_UNTRIED.md §0) remains valid as a bottleneck description, but MLState cannot address it on ANE. Alternative paths to reduce dispatch overhead: (1) chunk consolidation (4→2 chunks), (2) speculative decoding (amortize dispatch across multiple tokens per burst).

MLState + KV heads padded to 32 — Rejected (2026-04-15)

  • Files: conversion/models/gemma4_stateful_padded.py, conversion/build_stateful_padded.py (on branch worktree-agent-ad21e314, not in main). Findings in docs/SPLIT_ROTATE_FINDINGS.md and docs/SPLIT_ROTATE_BENCH.md.
  • Idea: hypothesis that error -14 was caused by the non-mod-32 num_kv_heads=1 dim being rejected by ANE's 32-wide tile scheduler. Probe: pad KV heads 1→32 in both stateless (PaddedKVChunk2) and stateful (StatefulPaddedChunk2) variants.
  • Part 1 result: PyTorch parity between StatelessChunk2 and PaddedKVChunk2 is bit-exact on all finite outputs; padded heads stay zero as expected.
  • Part 2 result: build_stateful_padded.py --ctx 512 --nbits 0 --smoke-test → conversion emits the same error code: -14 warning at save time; Mac CPU_ONLY predict returns finite outputs in 49 ms (graph is runtime-valid on CPU). Device retry would reproduce -14 on ANE.
  • Conclusion: 32-alignment is not the cause of error -14. The ANE compiler does not schedule coreml_update_state regardless of the padded tensor widths. ANEMLL's --split-rotate is a separate multi-function-loading workaround, not an alignment fix. No further MLState-on-ANE work is warranted; any stateful path must target GPU (CPU_AND_GPU) as the WWDC24 Mistral demo does.

SuffixDecoding (CPU-only draft) — Measured, demoted to auxiliary (2026-04-13)

  • Files: Sources/CoreMLLLM/SuffixTree.swift, Sources/CoreMLLLM/SuffixDecoding.swift (on branch claude/suffix-decoding-impl, not merged to main)
  • Idea: build suffix tree from all prior model outputs, draft K tokens via CPU trie lookup (~20µs), verify with Q=K ANE verifier. Paper reports 1.9-5.3× on chat/agentic workloads (NeurIPS 2025 Spotlight, arXiv 2411.04975).
  • T=1 instrumentation results on iPhone 17 Pro (4 multi-turn generations):
    • hit rate: 2% → 29% → 48% (climbs as tree grows, as expected)
    • T1 accuracy: 18.4% (of hits, how often the top-1 draft matches model output)
    • tree: 10k nodes after 4 generations
  • Performance overhead: insert at end-of-generation blocks next generation start (~1-2s for 500 token sequence). Draft lookup adds ~2 tok/s overhead even at 4th-token sampling. Async insert + NSLock fixes the blocking but doesn't eliminate lookup cost.
  • Assessment: T1=18% is too low to be the primary speculative method. EAGLE-3 (acc0=75%, workload-independent) is strictly better as the main draft source. SuffixDecoding is workload-dependent — the paper's high numbers come from production workloads with stable system prompts, RAG, and code editing patterns. Random diverse chat is the worst case.
  • Decision: demote to auxiliary draft source. Use suffix tree when it has a high-confidence match (e.g., count > threshold), fall back to EAGLE-3 otherwise. Don't build Q=K verifier specifically for SuffixDecoding — build it for EAGLE-3, and SuffixDecoding can reuse it later.

TriForce / Quest (sparse KV retrieval) — Planned

  • See docs/SPEED_8K.md §1 A3 / A4. Requires block-static top-k redesign to stay on ANE. Not started.

DuoAttention (retrieval vs streaming heads) — Planned

  • See docs/SPEED_8K.md §1 A1 / §3 P3. Offline head classification + two KV banks per layer. High ROI, training-free at inference time. Next candidate after EAGLE-3 lands.

MLComputePlan silent-fallback audit — Measured (2026-04-13)

  • File: Sources/CoreMLLLM/ComputePlanAudit.swift
  • Idea: use MLComputePlan.deviceUsage(for:) (iOS 17+) to walk every MIL op in chunk1-4 and identify any op whose preferred device is not Neural Engine. Per UNEXPLORED_APPROACHES_V2.md §G2 / PRIORITY_ROADMAP.md Phase 0a.
  • Result on iPhone 17 Pro (8K chunks):
    • chunk1 (L0-7): 0 compute ops on CPU/GPU. All matmul/conv/softmax/attention on ANE.
    • chunk2 (L8-14): 0 compute ops on CPU/GPU. Same.
    • chunk3 (L15-24): 0 compute ops on CPU/GPU. 1 identity on unknown (no-op).
    • chunk4 (L25-34 + LM head): 8 compute ops on CPU — the entire InModelArgmax tail: mul (×2), tanh, squeeze (×2), reduce_argmax, expand_dims, gather_along_axis. Total estimated cost = 0.0028. This is the tanh-based softargmax + gather pipeline from ane_ops.py::InModelArgmax.
    • All constexpr_lut_to_dense (INT4 depalettization) and const ops reported as "unknown" device — these are weight-loading ops that run once at model load, not per decode step.
  • Interpretation: the only per-step CPU fallback is chunk4's argmax tail (~8 ops, ~0.5-2 ms ANE↔CPU round-trip). chunk1-3 are 100% ANE. This confirms the dispatch-overhead hypothesis from FUNDAMENTAL_UNTRIED.md §0: the bottleneck is not individual ops falling to CPU, but the 4× per-step IOSurface round-trip between chunks.
  • Actionable: (1) InModelArgmax CPU fallback is small but fixable — rewrite to ANE-compatible ops or accept the ~1-3% cost. (2) No other compute ops need fixing. (3) MLState (stateful KV to eliminate IOSurface round-trips) remains the highest-leverage next step.

Vocabulary pruning — Abandoned

  • File: conversion/prune_vocab.py
  • Idea: the Gemma 4 vocab is 262 K tokens. Embedding + LM-head weights dominate on-device size. Analysis shows large blocks (rare scripts, emoji variants) are near-never used for English chat.
  • Reason abandoned: (1) Gemma's tokenizer is sentencepiece — dropping tokens changes BPE merges and breaks round-trip tokenization; (2) the gemma4_lite_wrapper.py route (external per-layer embedding in Swift) already reclaimed the main memory win (~40 %). Keep the analysis as reference.

8K chunk4 rebuild — Shipping (maintenance fix)

  • File: conversion/rebuild_chunk4_8k.py
  • Purpose: regenerate chunk4 with causal_mask_full at (1,1,1,8192) when the earlier build shipped with a 2048-sized mask. This is the kind of silent mismatch the ChunkedEngine auto-detection (see commit 4311991) now guards against at load time.

How to add a new experiment

  1. Pick a short name (e.g. wfa, flash, merged).
  2. Put the model variant in conversion/models/gemma4_<name>.py.
  3. Put the builder in conversion/build_<name>.py. Keep it runnable standalone (python build_<name>.py --output ./output/<name>).
  4. Put the A/B benchmark in conversion/benchmark_<name>.py comparing against the shipping baseline on the same prompts.
  5. Write one row in this file: what you tried, status after the first real measurement, reason if shelved.

Rule: every experiment that gets shelved gets a one-paragraph obituary here with the numeric reason. Future-us will otherwise rebuild the same thing.