Skip to content

Add pplx-embed (Perplexity) bidirectional Qwen3 encoder on the ANE#169

Open
dokterbob wants to merge 9 commits into
john-rocky:mainfrom
dokterbob:feat/pplx-embed
Open

Add pplx-embed (Perplexity) bidirectional Qwen3 encoder on the ANE#169
dokterbob wants to merge 9 commits into
john-rocky:mainfrom
dokterbob:feat/pplx-embed

Conversation

@dokterbob

@dokterbob dokterbob commented Jun 17, 2026

Copy link
Copy Markdown

Adds a stateless bidirectional Qwen3-0.6B encoder path for Perplexity's pplx-embed embedding models, running 99.8% on the ANE. Two variants: plain sentence embeddings and late chunking (per-chunk via a pool_matrix matmul, one encoder pass over the whole window).

Conversion (conversion/)

  • models/qwen3_encoder.py — ANE encoder reusing the existing primitives (Conv2d-1×1 projections, repeat_kv_ane, stable_attention): QK-norm, RoPE θ=1e6, SwiGLU, GQA 16/8, full bidirectional pad-mask. Derives seq-len + batch from the input. fp16 residual rescale (K=8) keeps the 28-layer down_proj accumulation in range (exact for a pre-norm net). RMSNorm is now a local Qwen3RMSNorm selected via norm_impl (see workstream A).
  • build_pplx_embed_bundle.py — fixed-shape ANE buckets (the fast path) + an optional flexible RangeDim GPU model (--dynamic-upper N) as the >max-bucket catch-all; plain + context; native int8 / pooled_fp16 output.
  • pplx_embed_reference.py — fp32 golden oracle, bit-exact with the model's own st_quantize.py (int8/binary/ubinary). test_pplx_embed_parity.py gates it.
  • config.py — registry entries pplx-embed, pplx-embed-context.

Swift (Sources/)

  • CoreMLLLM/PplxEmbed.swift[String] → int8/binary/ubinary (plain) and per-chunk context; tokenize → smallest fitting bucket → ANE, with n > largest bucket routed to the flexible GPU model (non-padded). Adds PplxEmbed.load(repo:buckets:preferCompiled:) for manifest-driven download (see workstream C). pplx-embed-demo CLI (--repo) + pplx-embed-bench harness.

Fidelity (cosine vs fp32 reference)

plain int8 ≥0.999, context ≥0.998, dynamic GPU 0.9996. Swift-vs-Python end-to-end 0.9998.

Design decision: fixed buckets only on the ANE

EnumeratedShapes/RangeDim force CPU fallback on the ANE and are ~10× slower (measured: 969 ms @512 vs 101 ms). So the ANE path is one fixed-shape .mlpackage per bucket (pad to the smallest fitting); the flexible RangeDim model is GPU-only and used solely as the >max-bucket catch-all.

Side investigations (reproducible scripts + write-ups in docs/)

  1. Quantization — docs/PPLX_EMBED_W8A8.md + conversion/experiment_w8a8.py. Weight-only int8 collapses (~0.42 cos), int4-palettize only reaches 0.905, full W8A8 collapses to ~0. ANE residency stays fine, so it's purely numerical — and moot anyway since int8 activations are only ~9% faster here (not bandwidth-bound). Conclusion: ship fp16 + buckets; recovery would need full QAT.
  2. Batching throughput — docs/PPLX_EMBED_BATCHING.md + conversion/experiment_batching.py. Batching is not a useful throughput lever: the ANE is batch-1 by design (batching hurts, 0.69×, confirmed 100% on-ANE), the GPU path batches only modestly (~1.4× at small L) and saturates by L≈512. Batch-1 on the ANE at the smallest bucket is the throughput winner. (Measured on an Apple M4 Max.)

What changed (this session)

Four workstreams on top of the encoder path above. All measurements: Apple M4 Max, macOS 26, coremltools 9, torch 2.11, B=1.

A — Native RMSNorm, shipped as the default

A new local Qwen3RMSNorm (x * rsqrt(mean(x²) + eps) * w) plus a norm_impl config selecting it at the 5 encoder norm sites, leaving the shared ane_ops.ANERMSNorm (the cat([x, −x]) → LayerNorm → chunk trick) untouched. The A/B (conversion/experiment_ane_rmsnorm.py, changing only the norm sites and holding Conv2d-1×1 projections + tensor layout fixed) isolated the RMSNorm as the cause of the ANE speedup seen incidentally during the earlier GPU-residency investigation. The cat/chunk trick predated a fast native ANE rsqrt; on this stack it is now a de-optimization. norm_impl="native" is now the pplx-embed default. Resolves the open follow-up flagged in docs/PPLX_EMBED_GPU_RESIDENCY.md. Shared-decoder rollout (the ~10 other families on ane_ops.ANERMSNorm) is flagged but not done — needs per-family re-validation — in docs/ANE_RMSNORM_FOLLOWUP.md.

B — Larger fixed ANE bucket (L=8192): measured, does NOT ship

conversion/measure_l8192_bucket.py shows a fixed L=8192 bucket statically maps 99.81% of ops to the ANE, but the ANE runtime fails at inference (ANEProgramProcessRequestDirect status=0x15): the 8192² full-attention intermediates (16 heads × 8192² fp16 ≈ 2 GB per score tensor) exceed ANE buffer limits, and the ANE compile alone takes ~25 min. So the largest fixed ANE bucket stays 4096; 4097–8192 tokens keep the existing dynamic GPU catch-all.

C — Single fixed RoPE table → fully shared weights

The encoder now builds its RoPE cos/sin tables at a single fixed size (max_position_embeddings) for every bucket, applied via a runtime position_ids gather (fold-proof — a static [:S] slice gets const-folded back to per-bucket). The only per-bucket-varying constant is gone, so weight.bin is now byte-identical across all buckets. Output is bit-identical to the old per-bucket RoPE (parity + dynamic-RangeDim + Swift int8 all verified). This is what makes the distribution below dedup.

D — Hugging Face distribution with end-to-end dedup

conversion/upload_pplx_embed.py publishes the prebuilt buckets to one HF repo — per-bucket subfolders + manifest.json, shipping both .mlmodelc (default — no on-device compile) and .mlpackage per bucket. Because all buckets now share one weight blob:

  • Upload (~2.3 GB, not ~14 GB): conversion/upload_pplx_embed_dedup.py uploads each unique blob once and server-side-copies the duplicate paths (CommitOperationCopy) — sidestepping hf upload-large-folder's re-upload of identical oids across its parallel batches.
  • Download (~1.15 GB, not ~3.5 GB for the default 3 buckets): PplxEmbed.load(repo:buckets:preferCompiled:) uses HF's native Swift Hub client (swift-huggingface HubClient.downloadSnapshot, with the Xet trait — HF stores large files Xet-backed by default) into the content-addressed cache, so the shared weight is fetched once no matter how many buckets you pull. Selective per-bucket, one format per bucket, exact-path matching; feeds the existing load(bundleDir:). pplx-embed-demo --repo exercises it end to end.

Verification

  • A: native RMSNorm is +12.7% @l256 / +21.5% @L512 on CPU_AND_NE vs ane_cat, at identical 99.81% ANE residency and cosine 0.99998 vs the fp32 oracle. Parity gate PASS.
  • B: L=8192 static plan = 99.81% ANE but ANE inference fails (status=0x15); ~25 min compile. Decision: do not ship; 4097–8192 stays on the GPU catch-all.
  • C: single fixed RoPE table → weight.bin byte-identical across buckets (verified by sha256); parity PASS, dynamic-RangeDim + Swift int8 bit-identical to the old encoder.
  • D: end-to-end download verified live — pplx-embed-demo --repo dokterbob/pplx-embed-coreml --buckets 512 returns correct embeddings, and the content-addressed cache holds one 1.1 GB blob with two weight.bin paths pointing to it (download dedup confirmed). Upload deduped via server-side copy.

Follow-ups

  • Shared-decoder native RMSNorm — roll norm_impl="native" to the shared ane_ops.ANERMSNorm only per-family, after re-validating decode + prefill latency, residency, and parity (docs/ANE_RMSNORM_FOLLOWUP.md). Do not flip the shared default globally off the pplx-embed result alone.
  • B3 — mMARCO calibration + multilingual retrieval eval (nDCG@10 across languages).
  • chunk-and-pool for >8192 tokens — stay within a fixed ANE bucket instead of the GPU catch-all.

Uses the existing pip/venv conversion workflow (a separate PR adds an optional uv manifest). Verified: test_pplx_embed_parity.py PASS, swift build clean.

🤖 Generated with Claude Code

Adds a stateless bidirectional Qwen3-0.6B encoder path for Perplexity's
pplx-embed models (plain + late-chunking context), running 99.8% on the ANE.

Conversion (conversion/):
- models/qwen3_encoder.py — ANE encoder (Conv2d-1x1, ANERMSNorm, QK-norm,
  RoPE theta=1e6, SwiGLU, GQA 16/8, full bidirectional pad-mask). Derives
  seq-len + batch from the input; fp16 residual rescale (K=8) for overflow.
- build_pplx_embed_bundle.py — fixed-shape ANE buckets (the fast path) + a
  flexible RangeDim GPU model (--dynamic-upper) for the >max-bucket catch-all;
  plain + context (pool_matrix); native int8 / pooled_fp16 output.
- pplx_embed_reference.py — fp32 golden oracle, bit-exact with the model's
  st_quantize.py (int8/binary/ubinary). test_pplx_embed_parity.py gates it.
- config.py — registry entries pplx-embed, pplx-embed-context.

Swift (Sources/):
- CoreMLLLM/PplxEmbed.swift — [String] -> int8/binary/ubinary (plain) and
  per-chunk context; tokenize -> smallest fitting bucket -> ANE, with >max-bucket
  routed to the flexible GPU model. pplx-embed-demo CLI + pplx-embed-bench harness.

Fidelity (cosine vs fp32): plain int8 >=0.999, context >=0.998, dynamic GPU 0.9996.
Docs in docs/PPLX_EMBED.md (+ W8A8 and batching findings).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread docs/PPLX_EMBED_BATCHING.md Outdated
Comment thread docs/PPLX_EMBED_BATCHING.md
Comment thread docs/PPLX_EMBED.md Outdated
…dency

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dokterbob dokterbob marked this pull request as ready for review June 18, 2026 10:48
Copilot AI review requested due to automatic review settings June 18, 2026 10:48

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

dokterbob and others added 7 commits June 18, 2026 12:25
…artition)

Add docs/PPLX_EMBED_GPU_RESIDENCY.md (op×dtype×device breakdown, GPU-native
control, timing proving CPU_AND_GPU < CPU_ONLY at B=1); resolve the batching
doc's open question. Narrow .gitignore to generated artifacts only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
WIP across three workstreams on the pplx-embed encoder.

A — ANE micro-opt (native RMSNorm), shipped default. Adds a local Qwen3RMSNorm
(native rsqrt(mean(x²))·w) + a `norm_impl` config selecting it at the 5 encoder
norm sites, leaving the shared ane_ops.ANERMSNorm untouched. A/B
(experiment_ane_rmsnorm.py) isolates the RMSNorm as the cause of the earlier
GPU-residency-investigation speedup: native is 12.7% faster @l256 / 21.5% @L512
on CPU_AND_NE at identical 99.81% ANE residency and cosine 0.99998 vs the fp32
oracle. norm_impl="native" is now the default; parity PASS. Shared decoder
rollout flagged in docs/ANE_RMSNORM_FOLLOWUP.md (not done).

B — Larger fixed ANE bucket (L=8192): measured, does NOT ship. measure_l8192_bucket.py
shows L=8192 statically maps 99.81% to ANE but the ANE runtime FAILS at inference
(ANEProgramProcessRequestDirect status=0x15) — the 8192² attention intermediates
exceed ANE buffer limits and the ANE compile alone takes ~25 min. Largest fixed
ANE bucket stays 4096; 4097–8192 tokens keep the dynamic GPU catch-all.

C — HF distribution. upload_pplx_embed.py publishes one repo with per-bucket
subfolders + manifest.json (sizes only; no upfront hashing), compiles + ships both
.mlmodelc and .mlpackage, stages a hardlink tree, and prints a resumable
`hf upload-large-folder` command. PplxEmbed.load(repo:buckets:preferCompiled:) does
manifest-driven selective download (one format per bucket) → existing load(bundleDir:);
pplx-embed-demo gains --repo. Note: RoPE tables baked into weight.bin make each
bucket a distinct ~1.19 GB blob (cross-bucket dedup follow-up flagged in ROADMAP).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… across buckets

The RoPE cos/sin tables were baked into weight.bin sized to each bucket's
max_seq_len, so every bucket shipped a distinct ~1.19 GB blob (no cross-bucket
LFS dedup; ~7 GB for 6 buckets).

Build the RoPE table once to a single fixed length (max_position_embeddings,
32768), decoupled from the bucket. A plain static `cos_cached[:S]` slice is
const-folded by coremltools back to a per-bucket [S, head_dim] constant
(verified: weight.bin stayed distinct, differing by exactly the per-bucket RoPE
delta). Fix is fold-proof: forward() derives position_ids from a runtime input
(cumsum(attention_mask) − 1) and gathers rows [0..S-1] via index_select, so the
indices are runtime-dependent and the gather can't be const-folded. No new model
input, no Swift contract change.

Result: weight.bin is byte-identical across buckets (L512 ≡ L1024, sha256
1d844c71…, size 1,208,897,088 B) → HF LFS stores it once (~1.2 GB total vs
~7 GB). Each blob is ~1.21 GB now (full 32768-row table) but shared.

Gates held: pre-convert parity pooled 0.999993 / int8 0.999941; CoreML L512
pooled_fp16 cosine vs fp32 oracle min 0.999962; ANE residency 99.29% (unchanged);
compile time unchanged (~0.3 s).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…uckets

After the single-fixed-RoPE-table change, all plain buckets share one weight.bin
blob (context is a second), so the real HF upload is ~2 weight blobs regardless of
bucket count — not ~1.19 GB/bucket. Update the script's closing message.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… + dedup uploader

Download: PplxEmbed.load(repo:) now uses swift-huggingface's HubClient.downloadSnapshot
(glob-filtered to the requested buckets/format) instead of the custom raw-HTTP
downloader. Its content-addressed cache fetches the byte-identical encoder weight.bin
ONCE by etag and reuses it across buckets, so the default 3-bucket pull moves ~1.15 GB
instead of ~3.5 GB — native download dedup, no custom code. swift-huggingface is added as
a direct dependency; it is standalone (no swift-transformers dep), so it stays orthogonal
to the 1.0.x cap kept for MLX consumers. Drops the hand-rolled content-addressing, the
manifest sha256, and the per-file download list (globs derive from subfolder + formats).

Upload: conversion/upload_pplx_embed_dedup.py uploads each unique blob once and uses HF
server-side CommitOperationCopy to materialize the duplicate weight paths — net transfer
is the 2 unique weight blobs (~2.3 GB), not ~14 GB, sidestepping upload-large-folder's
re-upload of identical oids across its parallel batches. Skips dot-dirs (the
.cache/huggingface resume folder). Manifest reverts to size-only (the cache dedups by
etag, so no per-file sha is needed and staging stays instant).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ent-addressed download

The card claimed weights differ by RoPE length per bucket; after the single fixed RoPE
table they are byte-identical across buckets, and the Swift load(repo:) (HF Hub client)
fetches the shared weight once by etag. Update the Use-it example to the optional-into:
signature + note both formats / preferCompiled.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…paths

Two fixes to get PplxEmbed.load(repo:) downloading from a real HF repo:

1. Enable the swift-huggingface `Xet` package trait (bumps swift-tools-version to 6.1).
   HF stores large files Xet-backed by default (the encoder weight.bin has a xet_hash
   even when uploaded with HF_HUB_DISABLE_XET=1); without the trait the client forces the
   LFS transport. Xet also gives chunk-level dedup on top of the etag blob cache.

2. Pass the manifest's EXACT file paths to downloadSnapshot(matching:) instead of wildcard
   globs. `listFiles(recursive:)` also returns directory entries, so a glob like
   `encoder.mlmodelc/*` matched the `analytics/`/`weights/` dirs and 404'd ("Entry not
   found") trying to GET a directory. Exact paths match only the file blobs.

Verified end to end: `pplx-embed-demo --repo dokterbob/pplx-embed-coreml --buckets 512`
downloads via the content-addressed cache and returns bit-identical embeddings; the
shared weight is fetched ONCE (cache holds one 1.1 GB blob, two weight.bin paths point to
it) — download dedup confirmed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants