Add pplx-embed (Perplexity) bidirectional Qwen3 encoder on the ANE by dokterbob · Pull Request #169 · john-rocky/CoreML-LLM

dokterbob · 2026-06-17T12:58:44Z

Adds a stateless bidirectional Qwen3-0.6B encoder path for Perplexity's pplx-embed embedding models, running 99.8% on the ANE. Two variants: plain sentence embeddings and late chunking (per-chunk via a pool_matrix matmul, one encoder pass over the whole window).

Conversion (`conversion/`)

models/qwen3_encoder.py — ANE encoder reusing the existing primitives (Conv2d-1×1 projections, repeat_kv_ane, stable_attention): QK-norm, RoPE θ=1e6, SwiGLU, GQA 16/8, full bidirectional pad-mask. Derives seq-len + batch from the input. fp16 residual rescale (K=8) keeps the 28-layer down_proj accumulation in range (exact for a pre-norm net). RMSNorm is now a local Qwen3RMSNorm selected via norm_impl (see workstream A).
build_pplx_embed_bundle.py — fixed-shape ANE buckets (the fast path) + an optional flexible RangeDim GPU model (--dynamic-upper N) as the >max-bucket catch-all; plain + context; native int8 / pooled_fp16 output.
pplx_embed_reference.py — fp32 golden oracle, bit-exact with the model's own st_quantize.py (int8/binary/ubinary). test_pplx_embed_parity.py gates it.
config.py — registry entries pplx-embed, pplx-embed-context.

Swift (`Sources/`)

CoreMLLLM/PplxEmbed.swift — [String] → int8/binary/ubinary (plain) and per-chunk context; tokenize → smallest fitting bucket → ANE, with n > largest bucket routed to the flexible GPU model (non-padded). Adds PplxEmbed.load(repo:buckets:preferCompiled:) for manifest-driven download (see workstream C). pplx-embed-demo CLI (--repo) + pplx-embed-bench harness.

Fidelity (cosine vs fp32 reference)

plain int8 ≥0.999, context ≥0.998, dynamic GPU 0.9996. Swift-vs-Python end-to-end 0.9998.

Design decision: fixed buckets only on the ANE

EnumeratedShapes/RangeDim force CPU fallback on the ANE and are ~10× slower (measured: 969 ms @512 vs 101 ms). So the ANE path is one fixed-shape .mlpackage per bucket (pad to the smallest fitting); the flexible RangeDim model is GPU-only and used solely as the >max-bucket catch-all.

Side investigations (reproducible scripts + write-ups in `docs/`)

Quantization — docs/PPLX_EMBED_W8A8.md + conversion/experiment_w8a8.py. Weight-only int8 collapses (~0.42 cos), int4-palettize only reaches 0.905, full W8A8 collapses to ~0. ANE residency stays fine, so it's purely numerical — and moot anyway since int8 activations are only ~9% faster here (not bandwidth-bound). Conclusion: ship fp16 + buckets; recovery would need full QAT.
Batching throughput — docs/PPLX_EMBED_BATCHING.md + conversion/experiment_batching.py. Batching is not a useful throughput lever: the ANE is batch-1 by design (batching hurts, 0.69×, confirmed 100% on-ANE), the GPU path batches only modestly (~1.4× at small L) and saturates by L≈512. Batch-1 on the ANE at the smallest bucket is the throughput winner. (Measured on an Apple M4 Max.)

What changed (this session)

Four workstreams on top of the encoder path above. All measurements: Apple M4 Max, macOS 26, coremltools 9, torch 2.11, B=1.

A — Native RMSNorm, shipped as the default

A new local Qwen3RMSNorm (x * rsqrt(mean(x²) + eps) * w) plus a norm_impl config selecting it at the 5 encoder norm sites, leaving the shared ane_ops.ANERMSNorm (the cat([x, −x]) → LayerNorm → chunk trick) untouched. The A/B (conversion/experiment_ane_rmsnorm.py, changing only the norm sites and holding Conv2d-1×1 projections + tensor layout fixed) isolated the RMSNorm as the cause of the ANE speedup seen incidentally during the earlier GPU-residency investigation. The cat/chunk trick predated a fast native ANE rsqrt; on this stack it is now a de-optimization. norm_impl="native" is now the pplx-embed default. Resolves the open follow-up flagged in docs/PPLX_EMBED_GPU_RESIDENCY.md. Shared-decoder rollout (the ~10 other families on ane_ops.ANERMSNorm) is flagged but not done — needs per-family re-validation — in docs/ANE_RMSNORM_FOLLOWUP.md.

B — Larger fixed ANE bucket (L=8192): measured, does NOT ship

conversion/measure_l8192_bucket.py shows a fixed L=8192 bucket statically maps 99.81% of ops to the ANE, but the ANE runtime fails at inference (ANEProgramProcessRequestDirect status=0x15): the 8192² full-attention intermediates (16 heads × 8192² fp16 ≈ 2 GB per score tensor) exceed ANE buffer limits, and the ANE compile alone takes ~25 min. So the largest fixed ANE bucket stays 4096; 4097–8192 tokens keep the existing dynamic GPU catch-all.

C — Single fixed RoPE table → fully shared weights

The encoder now builds its RoPE cos/sin tables at a single fixed size (max_position_embeddings) for every bucket, applied via a runtime position_ids gather (fold-proof — a static [:S] slice gets const-folded back to per-bucket). The only per-bucket-varying constant is gone, so weight.bin is now byte-identical across all buckets. Output is bit-identical to the old per-bucket RoPE (parity + dynamic-RangeDim + Swift int8 all verified). This is what makes the distribution below dedup.

D — Hugging Face distribution with end-to-end dedup

conversion/upload_pplx_embed.py publishes the prebuilt buckets to one HF repo — per-bucket subfolders + manifest.json, shipping both .mlmodelc (default — no on-device compile) and .mlpackage per bucket. Because all buckets now share one weight blob:

Upload (~2.3 GB, not ~14 GB): conversion/upload_pplx_embed_dedup.py uploads each unique blob once and server-side-copies the duplicate paths (CommitOperationCopy) — sidestepping hf upload-large-folder's re-upload of identical oids across its parallel batches.
Download (~1.15 GB, not ~3.5 GB for the default 3 buckets): PplxEmbed.load(repo:buckets:preferCompiled:) uses HF's native Swift Hub client (swift-huggingface HubClient.downloadSnapshot, with the Xet trait — HF stores large files Xet-backed by default) into the content-addressed cache, so the shared weight is fetched once no matter how many buckets you pull. Selective per-bucket, one format per bucket, exact-path matching; feeds the existing load(bundleDir:). pplx-embed-demo --repo exercises it end to end.

Verification

A: native RMSNorm is +12.7% @l256 / +21.5% @L512 on CPU_AND_NE vs ane_cat, at identical 99.81% ANE residency and cosine 0.99998 vs the fp32 oracle. Parity gate PASS.
B: L=8192 static plan = 99.81% ANE but ANE inference fails (status=0x15); ~25 min compile. Decision: do not ship; 4097–8192 stays on the GPU catch-all.
C: single fixed RoPE table → weight.bin byte-identical across buckets (verified by sha256); parity PASS, dynamic-RangeDim + Swift int8 bit-identical to the old encoder.
D: end-to-end download verified live — pplx-embed-demo --repo dokterbob/pplx-embed-coreml --buckets 512 returns correct embeddings, and the content-addressed cache holds one 1.1 GB blob with two weight.bin paths pointing to it (download dedup confirmed). Upload deduped via server-side copy.

Follow-ups

Shared-decoder native RMSNorm — roll norm_impl="native" to the shared ane_ops.ANERMSNorm only per-family, after re-validating decode + prefill latency, residency, and parity (docs/ANE_RMSNORM_FOLLOWUP.md). Do not flip the shared default globally off the pplx-embed result alone.
B3 — mMARCO calibration + multilingual retrieval eval (nDCG@10 across languages).
chunk-and-pool for >8192 tokens — stay within a fixed ANE bucket instead of the GPU catch-all.

Uses the existing pip/venv conversion workflow (a separate PR adds an optional uv manifest). Verified: test_pplx_embed_parity.py PASS, swift build clean.

🤖 Generated with Claude Code

Adds a stateless bidirectional Qwen3-0.6B encoder path for Perplexity's pplx-embed models (plain + late-chunking context), running 99.8% on the ANE. Conversion (conversion/): - models/qwen3_encoder.py — ANE encoder (Conv2d-1x1, ANERMSNorm, QK-norm, RoPE theta=1e6, SwiGLU, GQA 16/8, full bidirectional pad-mask). Derives seq-len + batch from the input; fp16 residual rescale (K=8) for overflow. - build_pplx_embed_bundle.py — fixed-shape ANE buckets (the fast path) + a flexible RangeDim GPU model (--dynamic-upper) for the >max-bucket catch-all; plain + context (pool_matrix); native int8 / pooled_fp16 output. - pplx_embed_reference.py — fp32 golden oracle, bit-exact with the model's st_quantize.py (int8/binary/ubinary). test_pplx_embed_parity.py gates it. - config.py — registry entries pplx-embed, pplx-embed-context. Swift (Sources/): - CoreMLLLM/PplxEmbed.swift — [String] -> int8/binary/ubinary (plain) and per-chunk context; tokenize -> smallest fitting bucket -> ANE, with >max-bucket routed to the flexible GPU model. pplx-embed-demo CLI + pplx-embed-bench harness. Fidelity (cosine vs fp32): plain int8 >=0.999, context >=0.998, dynamic GPU 0.9996. Docs in docs/PPLX_EMBED.md (+ W8A8 and batching findings). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…dency Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

…artition) Add docs/PPLX_EMBED_GPU_RESIDENCY.md (op×dtype×device breakdown, GPU-native control, timing proving CPU_AND_GPU < CPU_ONLY at B=1); resolve the batching doc's open question. Narrow .gitignore to generated artifacts only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@l256

WIP across three workstreams on the pplx-embed encoder. A — ANE micro-opt (native RMSNorm), shipped default. Adds a local Qwen3RMSNorm (native rsqrt(mean(x²))·w) + a `norm_impl` config selecting it at the 5 encoder norm sites, leaving the shared ane_ops.ANERMSNorm untouched. A/B (experiment_ane_rmsnorm.py) isolates the RMSNorm as the cause of the earlier GPU-residency-investigation speedup: native is 12.7% faster @l256 / 21.5% @L512 on CPU_AND_NE at identical 99.81% ANE residency and cosine 0.99998 vs the fp32 oracle. norm_impl="native" is now the default; parity PASS. Shared decoder rollout flagged in docs/ANE_RMSNORM_FOLLOWUP.md (not done). B — Larger fixed ANE bucket (L=8192): measured, does NOT ship. measure_l8192_bucket.py shows L=8192 statically maps 99.81% to ANE but the ANE runtime FAILS at inference (ANEProgramProcessRequestDirect status=0x15) — the 8192² attention intermediates exceed ANE buffer limits and the ANE compile alone takes ~25 min. Largest fixed ANE bucket stays 4096; 4097–8192 tokens keep the dynamic GPU catch-all. C — HF distribution. upload_pplx_embed.py publishes one repo with per-bucket subfolders + manifest.json (sizes only; no upfront hashing), compiles + ships both .mlmodelc and .mlpackage, stages a hardlink tree, and prints a resumable `hf upload-large-folder` command. PplxEmbed.load(repo:buckets:preferCompiled:) does manifest-driven selective download (one format per bucket) → existing load(bundleDir:); pplx-embed-demo gains --repo. Note: RoPE tables baked into weight.bin make each bucket a distinct ~1.19 GB blob (cross-bucket dedup follow-up flagged in ROADMAP). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… across buckets The RoPE cos/sin tables were baked into weight.bin sized to each bucket's max_seq_len, so every bucket shipped a distinct ~1.19 GB blob (no cross-bucket LFS dedup; ~7 GB for 6 buckets). Build the RoPE table once to a single fixed length (max_position_embeddings, 32768), decoupled from the bucket. A plain static `cos_cached[:S]` slice is const-folded by coremltools back to a per-bucket [S, head_dim] constant (verified: weight.bin stayed distinct, differing by exactly the per-bucket RoPE delta). Fix is fold-proof: forward() derives position_ids from a runtime input (cumsum(attention_mask) − 1) and gathers rows [0..S-1] via index_select, so the indices are runtime-dependent and the gather can't be const-folded. No new model input, no Swift contract change. Result: weight.bin is byte-identical across buckets (L512 ≡ L1024, sha256 1d844c71…, size 1,208,897,088 B) → HF LFS stores it once (~1.2 GB total vs ~7 GB). Each blob is ~1.21 GB now (full 32768-row table) but shared. Gates held: pre-convert parity pooled 0.999993 / int8 0.999941; CoreML L512 pooled_fp16 cosine vs fp32 oracle min 0.999962; ANE residency 99.29% (unchanged); compile time unchanged (~0.3 s). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…uckets After the single-fixed-RoPE-table change, all plain buckets share one weight.bin blob (context is a second), so the real HF upload is ~2 weight blobs regardless of bucket count — not ~1.19 GB/bucket. Update the script's closing message. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… + dedup uploader Download: PplxEmbed.load(repo:) now uses swift-huggingface's HubClient.downloadSnapshot (glob-filtered to the requested buckets/format) instead of the custom raw-HTTP downloader. Its content-addressed cache fetches the byte-identical encoder weight.bin ONCE by etag and reuses it across buckets, so the default 3-bucket pull moves ~1.15 GB instead of ~3.5 GB — native download dedup, no custom code. swift-huggingface is added as a direct dependency; it is standalone (no swift-transformers dep), so it stays orthogonal to the 1.0.x cap kept for MLX consumers. Drops the hand-rolled content-addressing, the manifest sha256, and the per-file download list (globs derive from subfolder + formats). Upload: conversion/upload_pplx_embed_dedup.py uploads each unique blob once and uses HF server-side CommitOperationCopy to materialize the duplicate weight paths — net transfer is the 2 unique weight blobs (~2.3 GB), not ~14 GB, sidestepping upload-large-folder's re-upload of identical oids across its parallel batches. Skips dot-dirs (the .cache/huggingface resume folder). Manifest reverts to size-only (the cache dedups by etag, so no per-file sha is needed and staging stays instant). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ent-addressed download The card claimed weights differ by RoPE length per bucket; after the single fixed RoPE table they are byte-identical across buckets, and the Swift load(repo:) (HF Hub client) fetches the shared weight once by etag. Update the Use-it example to the optional-into: signature + note both formats / preferCompiled. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…paths Two fixes to get PplxEmbed.load(repo:) downloading from a real HF repo: 1. Enable the swift-huggingface `Xet` package trait (bumps swift-tools-version to 6.1). HF stores large files Xet-backed by default (the encoder weight.bin has a xet_hash even when uploaded with HF_HUB_DISABLE_XET=1); without the trait the client forces the LFS transport. Xet also gives chunk-level dedup on top of the etag blob cache. 2. Pass the manifest's EXACT file paths to downloadSnapshot(matching:) instead of wildcard globs. `listFiles(recursive:)` also returns directory entries, so a glob like `encoder.mlmodelc/*` matched the `analytics/`/`weights/` dirs and 404'd ("Entry not found") trying to GET a directory. Exact paths match only the file blobs. Verified end to end: `pplx-embed-demo --repo dokterbob/pplx-embed-coreml --buckets 512` downloads via the content-addressed cache and returns bit-identical embeddings; the shared weight is fetched ONCE (cache holds one 1.1 GB blob, two weight.bin paths point to it) — download dedup confirmed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

dokterbob commented Jun 18, 2026

View reviewed changes

Comment thread docs/PPLX_EMBED_BATCHING.md Outdated

dokterbob commented Jun 18, 2026

View reviewed changes

Comment thread docs/PPLX_EMBED_BATCHING.md

dokterbob commented Jun 18, 2026

View reviewed changes

Comment thread docs/PPLX_EMBED.md Outdated

docs: address review — remove MLX framing, note M4 Max, flag GPU resi…

75e4f52

…dency Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

dokterbob marked this pull request as ready for review June 18, 2026 10:48

Copilot AI review requested due to automatic review settings June 18, 2026 10:48

Copilot AI reviewed Jun 18, 2026

dokterbob and others added 7 commits June 18, 2026 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pplx-embed (Perplexity) bidirectional Qwen3 encoder on the ANE#169

Add pplx-embed (Perplexity) bidirectional Qwen3 encoder on the ANE#169
dokterbob wants to merge 9 commits into
john-rocky:mainfrom
dokterbob:feat/pplx-embed

dokterbob commented Jun 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dokterbob commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conversion (conversion/)

Swift (Sources/)

Fidelity (cosine vs fp32 reference)

Design decision: fixed buckets only on the ANE

Side investigations (reproducible scripts + write-ups in docs/)

What changed (this session)

A — Native RMSNorm, shipped as the default

B — Larger fixed ANE bucket (L=8192): measured, does NOT ship

C — Single fixed RoPE table → fully shared weights

D — Hugging Face distribution with end-to-end dedup

Verification

Follow-ups

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dokterbob commented Jun 17, 2026 •

edited

Loading

Conversion (`conversion/`)

Swift (`Sources/`)

Side investigations (reproducible scripts + write-ups in `docs/`)