Skip to content

Latest commit

 

History

History
249 lines (202 loc) · 11.2 KB

File metadata and controls

249 lines (202 loc) · 11.2 KB

Methods

One short section per method that actually ran, describing what was measured and how. Numbers live in RESULTS.md and results/tables/; this document is about what each row means.

Every method shares the harness in src/tinycompress/eval/:

  • Forward latency: fixed seq_len_for_latency (128 for 0.5B/1.5B, 64 for 3B; see src/tinycompress/run_config.py); the first n_warmup iterations are discarded before percentiles are computed.
  • Generation: greedy decode from prompts.txt, recording wall-clock and tokens/s per prompt.
  • Perplexity: sliding-window PPL on wikitext-2-raw-v1 test split (ppl_max_length tokens per chunk, ppl_stride overlap). Calibration sets (for calibration-based methods) only ever touch the train split - self_audit.py check H enforces this.
  • Peak memory: process RSS sampled on a background thread (profile/memory.py).
  • Hardware snapshot: chip/os/torch/python versions captured in every JSON.
@startuml
title Harness dataflow - one (model, method) cell
skinparam componentStyle rectangle

[run_<task>.py] as runner
[loader.load] as loader
[method transform\n(quant / compile / prune / …)] as transform
[profile.latency] as lat
[eval.generation] as gen
[eval.perplexity] as ppl
[profile.memory] as mem
[hardware_info] as hw
[results_io.write_atomic] as writer
database "results/raw/<model>/<method>.json" as json

runner --> loader : from_pretrained
loader --> transform : nn.Module
transform --> lat : fixed-seq forward ×N
transform --> gen : prompts.txt, greedy
transform --> ppl : wikitext-2 test, sliding window
mem ..> runner : RSS peak sampler (background)
hw ..> runner : chip/os/torch snapshot
runner --> writer
writer --> json
@enduml

Baselines

Three straight loads of each model with no compression:

  • fp32_cpu - reference numerics. Slowest but deterministic.
  • fp16_mps - same weights cast to fp16 on the Apple MPS backend.
  • bf16_cpu - CPU bf16, useful to see how far pure-CPU gets you without MPS.

Dynamic int8 (torch built-in)

torch.ao.quantization.quantize_dynamic(model, {nn.Linear}, dtype=qint8). On Apple Silicon the default quantized engine is "none"; the module explicitly selects qnnpack (the only engine shipped by torch 2.11 on M-series) before running - otherwise linear_prepack raises NoQEngine. Static weights, activations quantized per-batch at runtime. Compared here to weight-only int8 to isolate "int8 kernel math" from "int8 storage with fp math".

Weight-only int8 (per-channel symmetric)

A from-scratch reimplementation: each nn.Linear weight row is quantized to int8 with a per-output-channel scale. Forward path dequantizes to the activation dtype and runs a regular fp matmul, so there is no int8 kernel win - on purpose. This method isolates the storage-side savings and the numerical drift of per-channel symmetric quantization from any kernel effect. PPL should stay close to the fp baseline; throughput will not improve.

Weight-only int4 (group-wise)

Same idea, but 4 bits, packed two nibbles per byte, with one fp16 scale per group of 128 weights along the input dimension. Same "storage win only, no kernel win" story as int8. Quality is expected to degrade more; the experiment quantifies how much.

Optional GPTQ-like (unverified)

A small, educational GPTQ-style reimplementation: per-column error-compensation using a Cholesky factor of H = XᵀX + λI from a tiny wikitext-2 train calibration slice (8 batches at seq_len 256). This is a "study the idea" cell rather than a faithful reproduction of the paper; it runs on all three models so the regularization effect is visible at each capacity. PPL is reported next to naive int4.

torch.compile

torch.compile(model) in the default mode, then the standard harness is run through the compiled module. The first forward is timed separately and written to compile_first_run_ms - self_audit refuses any compile_* row that folds the first run into the steady-state mean. The eval is driven by the normal harness so latency/PPL/tok-s are directly comparable to the matching baseline row.

Peak memory for compile_fp16_mps rows is flagged unreliable. The process-level PeakRSS sampler opens its window at the top of run_eval, which is after the torch.compile pass has already run. On MPS unified memory the compile pass allocates and then frees scratch buffers before the sampler sees them, and the baseline RSS captured at sampler open can sit below the true working set. Symptom: the 3B compile_fp16_mps row originally reported peak_bytes = 1315.4 MB, below the 2644.6 MB of the eager fp16 MPS row for the same model. The JSON now carries peak_memory.sampler_note = "unreliable: …" for MPS compile rows; scripts/make_tables.py renders the peak-MB cell as - when that note is present. The underlying number stays in the JSON for inspection. The CPU compile rows (compile_fp32_cpu) do not show the same pathology and are not flagged.

ONNX + ONNX Runtime CPU

Export path: torch.onnx.export(wrapped_model, (input_ids,), ..., dynamo=True) where wrapped_model returns only the .logits tensor. The HF output type CausalLMOutputWithPast contains a DynamicCache, which is not a registered torch.export pytree, so the wrapper is necessary. Falls back to the TorchScript path if dynamo fails; raises if both fail.

On disk the export produces a .onnx protobuf plus a .onnx.data sidecar (weights ≥ 2 GiB live outside the graph protobuf). on_disk_bytes sums both files so the reported size reflects the real artifact.

Forward latency is measured by onnxruntime.InferenceSession on CPUExecutionProvider, at the same seq_len_for_latency as the eager baseline. No PPL or generation: the exported graph is fixed-shape and has no KV-cache plumbing - that is explicit scope, not an oversight.

CoreML

Attempted and omitted on this setup. coremltools==9.0 fails to load its native bindings (_MLModelProxy / BlobWriter) under the installed torch==2.11.0 + python==3.14.3 combination, so ct.convert() cannot even produce an .mlpackage. Before that, torch.jit.trace through the HF Qwen2 model hits a compatibility bug inside transformers.masking_utils.sdpa_mask (a q_length.shape[0] on what has become a 0-d trace proxy). Both failures are deep in dependencies; no honest CoreML number could be produced. Noted in run_export_all.sh and LIMITATIONS.md.

KV-cache growth

At a fixed prompt, prime the model, then extend generation to a ladder of sequence lengths [128, 256, 512, 1024, 2048, 4096]. At each probe, read:

  • analytic bytes/token = n_layers × 2 × n_kv_heads × head_dim × dtype_bytes
  • measured cache_tensor_bytes = sum of cache.layers[i].keys/.values byte sizes
  • rss_delta_bytes = process RSS now vs at baseline snapshot

The analytic and measured cache bytes should match within rounding - they do (see RESULTS). The RSS delta is noisy because unrelated caches get freed and reallocated; it is reported with an explicit "process-only, psutil-sampled" note.

Int8 KV-cache

An Int8KVCache subclass of transformers.DynamicCache that stores K and V as int8 with per-token / per-head symmetric scales in fp16. On each update() the incoming K/V tensors are quantized, appended along the sequence dim, and the full-sequence tensors are dequantized back to the caller's dtype before being returned. Attention math then runs in the original dtype on dequantized tensors; the int8 is only in storage.

This is a reimplementation, not a production int8-KV scheme calibrated per layer. Bytes/token are reported next to the fp16 baseline (expected ~½) and tokens/s next to the same-prompt fp16 run (expected slightly slower because of the dequant overhead on every forward).

Speculative decoding (0.5B drafts 1.5B / 3B)

src/tinycompress/decode/speculative.py, greedy only, shared Qwen2.5 tokenizer. For each prompt two things run in the same process:

  1. Speculative: the 0.5B draft proposes K tokens (K=4), the target verifies all K+1 positions in one forward, accepts the longest greedy-matching prefix, crops the target KV cache back on a reject, and loops.
  2. Target-only reference: same prompt, same target, same device / dtype, same max_new_tokens, no draft - a like-for-like tokens/s reading.

accept_rate = total_accepted / total_proposed. wallclock_speedup = target_only_elapsed / spec_elapsed.

No sampling: logits are argmaxed on both sides. This is the regime where speculative decoding is exactly rejection-free when draft == target; the test suite uses that as a property check.

@startuml
title Speculative decode loop (greedy, K=4)
start
:prompt tokens → draft + target KV caches;
while (total_new_tokens < max_new_tokens ?) is (yes)
  :draft proposes K=4 tokens\n(K sequential forwards on draft);
  :target verifies in ONE forward over K+1 positions;
  :compare argmax(draft[i]) vs argmax(target[i])\nfor i = 0..K-1;
  if (first mismatch at position j ?) then (yes)
    :accept j tokens;
    :append target[j] as bonus token;
    :crop target KV cache back to accepted length;
  else (no, all K match)
    :accept all K tokens;
    :append target[K] as bonus token;
  endif
  :total_accepted += j  (or K);
  :total_proposed += K;
endwhile (no)
:return accept_rate = total_accepted / total_proposed\n     wallclock_speedup = target_only_elapsed / spec_elapsed;
stop
@enduml

SDPA backend probe

Times F.scaled_dot_product_attention at three shapes (seq = 128 / 512 / 2048) under each of MATH / FLASH / EFFICIENT / CUDNN kernels via torch.nn.attention.sdpa_kernel. On Apple Silicon with torch 2.11, FLASH / EFFICIENT / CUDNN are CUDA/CUDNN-specific and record "unavailable" - the slots are kept in the JSON so it is explicit which kernels were tried and why they didn't run.

Magnitude pruning

Unstructured magnitude prune per nn.Linear: compute |W|.flatten(), pick the kthvalue threshold at the target sparsity, mask in place with W.mul_(mask). lm_head is skipped by name pattern because zeroing the head weights breaks generation outright. Weights stay dense in memory - M5 has no sparse Linear kernel that would accelerate this - so the reported fwd_ms will be statistically indistinguishable from the fp baseline. This is a quality-vs- sparsity study, not a speed-up study.

Short-run distillation

200 KL-divergence steps with temperature T=2.0 and an AdamW optimizer on student parameters only. Teacher is frozen (requires_grad_(False)). Training windows are random offsets into the wikitext-2 train split; evaluation PPL is on the test split as usual.

Result framing: this is explicitly a signed-delta study - "did a short run of distillation move the needle in the right direction on this laptop". It is not a parity claim against the teacher. A negative delta (post-distill PPL worse than pre-distill PPL) is a real finding, not a bug, and is reported as such.

The 3B → 1.5B cell is intentionally omitted: the working set exceeds 32 GB unified memory once gradients and AdamW state for the 1.5B student are allocated alongside the frozen 3B teacher, so the process swap-thrashes and wall-clock stops measuring the method. Documented in LIMITATIONS.md; the 1.5B → 0.5B cell is the one signed-delta reading the repo ships.