One short section per method that actually ran, describing what was measured
and how. Numbers live in RESULTS.md and results/tables/; this
document is about what each row means.
Every method shares the harness in src/tinycompress/eval/:
- Forward latency: fixed
seq_len_for_latency(128 for 0.5B/1.5B, 64 for 3B; seesrc/tinycompress/run_config.py); the firstn_warmupiterations are discarded before percentiles are computed. - Generation: greedy decode from
prompts.txt, recording wall-clock and tokens/s per prompt. - Perplexity: sliding-window PPL on wikitext-2-raw-v1 test split
(
ppl_max_lengthtokens per chunk,ppl_strideoverlap). Calibration sets (for calibration-based methods) only ever touch the train split -self_audit.pycheck H enforces this. - Peak memory: process RSS sampled on a background thread (
profile/memory.py). - Hardware snapshot: chip/os/torch/python versions captured in every JSON.
@startuml
title Harness dataflow - one (model, method) cell
skinparam componentStyle rectangle
[run_<task>.py] as runner
[loader.load] as loader
[method transform\n(quant / compile / prune / …)] as transform
[profile.latency] as lat
[eval.generation] as gen
[eval.perplexity] as ppl
[profile.memory] as mem
[hardware_info] as hw
[results_io.write_atomic] as writer
database "results/raw/<model>/<method>.json" as json
runner --> loader : from_pretrained
loader --> transform : nn.Module
transform --> lat : fixed-seq forward ×N
transform --> gen : prompts.txt, greedy
transform --> ppl : wikitext-2 test, sliding window
mem ..> runner : RSS peak sampler (background)
hw ..> runner : chip/os/torch snapshot
runner --> writer
writer --> json
@endumlThree straight loads of each model with no compression:
fp32_cpu- reference numerics. Slowest but deterministic.fp16_mps- same weights cast to fp16 on the Apple MPS backend.bf16_cpu- CPU bf16, useful to see how far pure-CPU gets you without MPS.
torch.ao.quantization.quantize_dynamic(model, {nn.Linear}, dtype=qint8).
On Apple Silicon the default quantized engine is "none"; the module
explicitly selects qnnpack (the only engine shipped by torch 2.11 on M-series)
before running - otherwise linear_prepack raises NoQEngine. Static weights,
activations quantized per-batch at runtime. Compared here to weight-only int8
to isolate "int8 kernel math" from "int8 storage with fp math".
A from-scratch reimplementation: each nn.Linear weight row is quantized to int8 with a
per-output-channel scale. Forward path dequantizes to the activation dtype and
runs a regular fp matmul, so there is no int8 kernel win - on purpose. This
method isolates the storage-side savings and the numerical drift of per-channel
symmetric quantization from any kernel effect. PPL should stay close to the fp
baseline; throughput will not improve.
Same idea, but 4 bits, packed two nibbles per byte, with one fp16 scale per group of 128 weights along the input dimension. Same "storage win only, no kernel win" story as int8. Quality is expected to degrade more; the experiment quantifies how much.
A small, educational GPTQ-style reimplementation: per-column
error-compensation using a Cholesky factor of H = XᵀX + λI from a tiny
wikitext-2 train calibration slice (8 batches at seq_len 256). This is a
"study the idea" cell rather than a faithful reproduction of the paper; it
runs on all three models so the regularization effect is visible at each
capacity. PPL is reported next to naive int4.
torch.compile(model) in the default mode, then the standard harness is run
through the compiled module. The first forward is timed separately and written
to compile_first_run_ms - self_audit refuses any compile_* row that
folds the first run into the steady-state mean. The eval is driven by the
normal harness so latency/PPL/tok-s are directly comparable to the matching
baseline row.
Peak memory for compile_fp16_mps rows is flagged unreliable. The
process-level PeakRSS sampler opens its window at the top of run_eval,
which is after the torch.compile pass has already run. On MPS unified
memory the compile pass allocates and then frees scratch buffers before the
sampler sees them, and the baseline RSS captured at sampler open can sit
below the true working set. Symptom: the 3B compile_fp16_mps row
originally reported peak_bytes = 1315.4 MB, below the 2644.6 MB of the
eager fp16 MPS row for the same model. The JSON now carries
peak_memory.sampler_note = "unreliable: …" for MPS compile rows;
scripts/make_tables.py renders the peak-MB cell as - when that note is
present. The underlying number stays in the JSON for inspection. The CPU
compile rows (compile_fp32_cpu) do not show the same pathology and are
not flagged.
Export path: torch.onnx.export(wrapped_model, (input_ids,), ..., dynamo=True)
where wrapped_model returns only the .logits tensor. The HF output type
CausalLMOutputWithPast contains a DynamicCache, which is not a registered
torch.export pytree, so the wrapper is necessary. Falls back to the
TorchScript path if dynamo fails; raises if both fail.
On disk the export produces a .onnx protobuf plus a .onnx.data sidecar
(weights ≥ 2 GiB live outside the graph protobuf). on_disk_bytes sums both
files so the reported size reflects the real artifact.
Forward latency is measured by onnxruntime.InferenceSession on
CPUExecutionProvider, at the same seq_len_for_latency as the eager baseline.
No PPL or generation: the exported graph is fixed-shape and has no KV-cache
plumbing - that is explicit scope, not an oversight.
Attempted and omitted on this setup. coremltools==9.0 fails to load its
native bindings (_MLModelProxy / BlobWriter) under the installed
torch==2.11.0 + python==3.14.3 combination, so ct.convert() cannot even
produce an .mlpackage. Before that, torch.jit.trace through the HF
Qwen2 model hits a compatibility bug inside transformers.masking_utils.sdpa_mask
(a q_length.shape[0] on what has become a 0-d trace proxy). Both failures are
deep in dependencies; no honest CoreML number could be produced. Noted in
run_export_all.sh and LIMITATIONS.md.
At a fixed prompt, prime the model, then extend generation to a ladder of sequence lengths [128, 256, 512, 1024, 2048, 4096]. At each probe, read:
- analytic bytes/token =
n_layers × 2 × n_kv_heads × head_dim × dtype_bytes - measured cache_tensor_bytes = sum of
cache.layers[i].keys/.valuesbyte sizes - rss_delta_bytes = process RSS now vs at baseline snapshot
The analytic and measured cache bytes should match within rounding - they do (see RESULTS). The RSS delta is noisy because unrelated caches get freed and reallocated; it is reported with an explicit "process-only, psutil-sampled" note.
An Int8KVCache subclass of transformers.DynamicCache that stores K and V as
int8 with per-token / per-head symmetric scales in fp16. On each update()
the incoming K/V tensors are quantized, appended along the sequence dim, and the
full-sequence tensors are dequantized back to the caller's dtype before being
returned. Attention math then runs in the original dtype on dequantized
tensors; the int8 is only in storage.
This is a reimplementation, not a production int8-KV scheme calibrated per layer. Bytes/token are reported next to the fp16 baseline (expected ~½) and tokens/s next to the same-prompt fp16 run (expected slightly slower because of the dequant overhead on every forward).
src/tinycompress/decode/speculative.py, greedy only, shared Qwen2.5
tokenizer. For each prompt two things run in the same process:
- Speculative: the 0.5B draft proposes K tokens (K=4), the target verifies all K+1 positions in one forward, accepts the longest greedy-matching prefix, crops the target KV cache back on a reject, and loops.
- Target-only reference: same prompt, same target, same device / dtype,
same
max_new_tokens, no draft - a like-for-like tokens/s reading.
accept_rate = total_accepted / total_proposed. wallclock_speedup = target_only_elapsed / spec_elapsed.
No sampling: logits are argmaxed on both sides. This is the regime where speculative decoding is exactly rejection-free when draft == target; the test suite uses that as a property check.
@startuml
title Speculative decode loop (greedy, K=4)
start
:prompt tokens → draft + target KV caches;
while (total_new_tokens < max_new_tokens ?) is (yes)
:draft proposes K=4 tokens\n(K sequential forwards on draft);
:target verifies in ONE forward over K+1 positions;
:compare argmax(draft[i]) vs argmax(target[i])\nfor i = 0..K-1;
if (first mismatch at position j ?) then (yes)
:accept j tokens;
:append target[j] as bonus token;
:crop target KV cache back to accepted length;
else (no, all K match)
:accept all K tokens;
:append target[K] as bonus token;
endif
:total_accepted += j (or K);
:total_proposed += K;
endwhile (no)
:return accept_rate = total_accepted / total_proposed\n wallclock_speedup = target_only_elapsed / spec_elapsed;
stop
@endumlTimes F.scaled_dot_product_attention at three shapes (seq = 128 / 512 / 2048)
under each of MATH / FLASH / EFFICIENT / CUDNN kernels via
torch.nn.attention.sdpa_kernel. On Apple Silicon with torch 2.11, FLASH /
EFFICIENT / CUDNN are CUDA/CUDNN-specific and record "unavailable" - the
slots are kept in the JSON so it is explicit which kernels were tried and why
they didn't run.
Unstructured magnitude prune per nn.Linear: compute |W|.flatten(), pick the
kthvalue threshold at the target sparsity, mask in place with W.mul_(mask).
lm_head is skipped by name pattern because zeroing the head weights breaks
generation outright. Weights stay dense in memory - M5 has no sparse Linear
kernel that would accelerate this - so the reported fwd_ms will be
statistically indistinguishable from the fp baseline. This is a quality-vs-
sparsity study, not a speed-up study.
200 KL-divergence steps with temperature T=2.0 and an AdamW optimizer on
student parameters only. Teacher is frozen (requires_grad_(False)). Training
windows are random offsets into the wikitext-2 train split; evaluation PPL is
on the test split as usual.
Result framing: this is explicitly a signed-delta study - "did a short run of distillation move the needle in the right direction on this laptop". It is not a parity claim against the teacher. A negative delta (post-distill PPL worse than pre-distill PPL) is a real finding, not a bug, and is reported as such.
The 3B → 1.5B cell is intentionally omitted: the working set exceeds 32 GB unified memory once gradients and AdamW state for the 1.5B student are allocated alongside the frozen 3B teacher, so the process swap-thrashes and wall-clock stops measuring the method. Documented in LIMITATIONS.md; the 1.5B → 0.5B cell is the one signed-delta reading the repo ships.