Add pplx-embed (Perplexity) bidirectional Qwen3 encoder on the ANE#169
Open
dokterbob wants to merge 9 commits into
Open
Add pplx-embed (Perplexity) bidirectional Qwen3 encoder on the ANE#169dokterbob wants to merge 9 commits into
dokterbob wants to merge 9 commits into
Conversation
Adds a stateless bidirectional Qwen3-0.6B encoder path for Perplexity's pplx-embed models (plain + late-chunking context), running 99.8% on the ANE. Conversion (conversion/): - models/qwen3_encoder.py — ANE encoder (Conv2d-1x1, ANERMSNorm, QK-norm, RoPE theta=1e6, SwiGLU, GQA 16/8, full bidirectional pad-mask). Derives seq-len + batch from the input; fp16 residual rescale (K=8) for overflow. - build_pplx_embed_bundle.py — fixed-shape ANE buckets (the fast path) + a flexible RangeDim GPU model (--dynamic-upper) for the >max-bucket catch-all; plain + context (pool_matrix); native int8 / pooled_fp16 output. - pplx_embed_reference.py — fp32 golden oracle, bit-exact with the model's st_quantize.py (int8/binary/ubinary). test_pplx_embed_parity.py gates it. - config.py — registry entries pplx-embed, pplx-embed-context. Swift (Sources/): - CoreMLLLM/PplxEmbed.swift — [String] -> int8/binary/ubinary (plain) and per-chunk context; tokenize -> smallest fitting bucket -> ANE, with >max-bucket routed to the flexible GPU model. pplx-embed-demo CLI + pplx-embed-bench harness. Fidelity (cosine vs fp32): plain int8 >=0.999, context >=0.998, dynamic GPU 0.9996. Docs in docs/PPLX_EMBED.md (+ W8A8 and batching findings). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
dokterbob
commented
Jun 18, 2026
dokterbob
commented
Jun 18, 2026
dokterbob
commented
Jun 18, 2026
…dency Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…artition) Add docs/PPLX_EMBED_GPU_RESIDENCY.md (op×dtype×device breakdown, GPU-native control, timing proving CPU_AND_GPU < CPU_ONLY at B=1); resolve the batching doc's open question. Narrow .gitignore to generated artifacts only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
WIP across three workstreams on the pplx-embed encoder. A — ANE micro-opt (native RMSNorm), shipped default. Adds a local Qwen3RMSNorm (native rsqrt(mean(x²))·w) + a `norm_impl` config selecting it at the 5 encoder norm sites, leaving the shared ane_ops.ANERMSNorm untouched. A/B (experiment_ane_rmsnorm.py) isolates the RMSNorm as the cause of the earlier GPU-residency-investigation speedup: native is 12.7% faster @l256 / 21.5% @L512 on CPU_AND_NE at identical 99.81% ANE residency and cosine 0.99998 vs the fp32 oracle. norm_impl="native" is now the default; parity PASS. Shared decoder rollout flagged in docs/ANE_RMSNORM_FOLLOWUP.md (not done). B — Larger fixed ANE bucket (L=8192): measured, does NOT ship. measure_l8192_bucket.py shows L=8192 statically maps 99.81% to ANE but the ANE runtime FAILS at inference (ANEProgramProcessRequestDirect status=0x15) — the 8192² attention intermediates exceed ANE buffer limits and the ANE compile alone takes ~25 min. Largest fixed ANE bucket stays 4096; 4097–8192 tokens keep the dynamic GPU catch-all. C — HF distribution. upload_pplx_embed.py publishes one repo with per-bucket subfolders + manifest.json (sizes only; no upfront hashing), compiles + ships both .mlmodelc and .mlpackage, stages a hardlink tree, and prints a resumable `hf upload-large-folder` command. PplxEmbed.load(repo:buckets:preferCompiled:) does manifest-driven selective download (one format per bucket) → existing load(bundleDir:); pplx-embed-demo gains --repo. Note: RoPE tables baked into weight.bin make each bucket a distinct ~1.19 GB blob (cross-bucket dedup follow-up flagged in ROADMAP). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… across buckets The RoPE cos/sin tables were baked into weight.bin sized to each bucket's max_seq_len, so every bucket shipped a distinct ~1.19 GB blob (no cross-bucket LFS dedup; ~7 GB for 6 buckets). Build the RoPE table once to a single fixed length (max_position_embeddings, 32768), decoupled from the bucket. A plain static `cos_cached[:S]` slice is const-folded by coremltools back to a per-bucket [S, head_dim] constant (verified: weight.bin stayed distinct, differing by exactly the per-bucket RoPE delta). Fix is fold-proof: forward() derives position_ids from a runtime input (cumsum(attention_mask) − 1) and gathers rows [0..S-1] via index_select, so the indices are runtime-dependent and the gather can't be const-folded. No new model input, no Swift contract change. Result: weight.bin is byte-identical across buckets (L512 ≡ L1024, sha256 1d844c71…, size 1,208,897,088 B) → HF LFS stores it once (~1.2 GB total vs ~7 GB). Each blob is ~1.21 GB now (full 32768-row table) but shared. Gates held: pre-convert parity pooled 0.999993 / int8 0.999941; CoreML L512 pooled_fp16 cosine vs fp32 oracle min 0.999962; ANE residency 99.29% (unchanged); compile time unchanged (~0.3 s). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…uckets After the single-fixed-RoPE-table change, all plain buckets share one weight.bin blob (context is a second), so the real HF upload is ~2 weight blobs regardless of bucket count — not ~1.19 GB/bucket. Update the script's closing message. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… + dedup uploader Download: PplxEmbed.load(repo:) now uses swift-huggingface's HubClient.downloadSnapshot (glob-filtered to the requested buckets/format) instead of the custom raw-HTTP downloader. Its content-addressed cache fetches the byte-identical encoder weight.bin ONCE by etag and reuses it across buckets, so the default 3-bucket pull moves ~1.15 GB instead of ~3.5 GB — native download dedup, no custom code. swift-huggingface is added as a direct dependency; it is standalone (no swift-transformers dep), so it stays orthogonal to the 1.0.x cap kept for MLX consumers. Drops the hand-rolled content-addressing, the manifest sha256, and the per-file download list (globs derive from subfolder + formats). Upload: conversion/upload_pplx_embed_dedup.py uploads each unique blob once and uses HF server-side CommitOperationCopy to materialize the duplicate weight paths — net transfer is the 2 unique weight blobs (~2.3 GB), not ~14 GB, sidestepping upload-large-folder's re-upload of identical oids across its parallel batches. Skips dot-dirs (the .cache/huggingface resume folder). Manifest reverts to size-only (the cache dedups by etag, so no per-file sha is needed and staging stays instant). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ent-addressed download The card claimed weights differ by RoPE length per bucket; after the single fixed RoPE table they are byte-identical across buckets, and the Swift load(repo:) (HF Hub client) fetches the shared weight once by etag. Update the Use-it example to the optional-into: signature + note both formats / preferCompiled. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…paths
Two fixes to get PplxEmbed.load(repo:) downloading from a real HF repo:
1. Enable the swift-huggingface `Xet` package trait (bumps swift-tools-version to 6.1).
HF stores large files Xet-backed by default (the encoder weight.bin has a xet_hash
even when uploaded with HF_HUB_DISABLE_XET=1); without the trait the client forces the
LFS transport. Xet also gives chunk-level dedup on top of the etag blob cache.
2. Pass the manifest's EXACT file paths to downloadSnapshot(matching:) instead of wildcard
globs. `listFiles(recursive:)` also returns directory entries, so a glob like
`encoder.mlmodelc/*` matched the `analytics/`/`weights/` dirs and 404'd ("Entry not
found") trying to GET a directory. Exact paths match only the file blobs.
Verified end to end: `pplx-embed-demo --repo dokterbob/pplx-embed-coreml --buckets 512`
downloads via the content-addressed cache and returns bit-identical embeddings; the
shared weight is fetched ONCE (cache holds one 1.1 GB blob, two weight.bin paths point to
it) — download dedup confirmed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a stateless bidirectional Qwen3-0.6B encoder path for Perplexity's
pplx-embedembedding models, running 99.8% on the ANE. Two variants: plain sentence embeddings and late chunking (per-chunk via apool_matrixmatmul, one encoder pass over the whole window).Conversion (
conversion/)models/qwen3_encoder.py— ANE encoder reusing the existing primitives (Conv2d-1×1 projections,repeat_kv_ane,stable_attention): QK-norm, RoPE θ=1e6, SwiGLU, GQA 16/8, full bidirectional pad-mask. Derives seq-len + batch from the input. fp16 residual rescale (K=8) keeps the 28-layerdown_projaccumulation in range (exact for a pre-norm net). RMSNorm is now a localQwen3RMSNormselected vianorm_impl(see workstream A).build_pplx_embed_bundle.py— fixed-shape ANE buckets (the fast path) + an optional flexibleRangeDimGPU model (--dynamic-upper N) as the >max-bucket catch-all; plain + context; native int8 /pooled_fp16output.pplx_embed_reference.py— fp32 golden oracle, bit-exact with the model's ownst_quantize.py(int8/binary/ubinary).test_pplx_embed_parity.pygates it.config.py— registry entriespplx-embed,pplx-embed-context.Swift (
Sources/)CoreMLLLM/PplxEmbed.swift—[String] → int8/binary/ubinary(plain) and per-chunk context; tokenize → smallest fitting bucket → ANE, withn > largest bucketrouted to the flexible GPU model (non-padded). AddsPplxEmbed.load(repo:buckets:preferCompiled:)for manifest-driven download (see workstream C).pplx-embed-demoCLI (--repo) +pplx-embed-benchharness.Fidelity (cosine vs fp32 reference)
plain int8 ≥0.999, context ≥0.998, dynamic GPU 0.9996. Swift-vs-Python end-to-end 0.9998.
Design decision: fixed buckets only on the ANE
EnumeratedShapes/RangeDimforce CPU fallback on the ANE and are ~10× slower (measured: 969 ms @512 vs 101 ms). So the ANE path is one fixed-shape.mlpackageper bucket (pad to the smallest fitting); the flexibleRangeDimmodel is GPU-only and used solely as the >max-bucket catch-all.Side investigations (reproducible scripts + write-ups in
docs/)docs/PPLX_EMBED_W8A8.md+conversion/experiment_w8a8.py. Weight-only int8 collapses (~0.42 cos), int4-palettize only reaches 0.905, full W8A8 collapses to ~0. ANE residency stays fine, so it's purely numerical — and moot anyway since int8 activations are only ~9% faster here (not bandwidth-bound). Conclusion: ship fp16 + buckets; recovery would need full QAT.docs/PPLX_EMBED_BATCHING.md+conversion/experiment_batching.py. Batching is not a useful throughput lever: the ANE is batch-1 by design (batching hurts, 0.69×, confirmed 100% on-ANE), the GPU path batches only modestly (~1.4× at small L) and saturates by L≈512. Batch-1 on the ANE at the smallest bucket is the throughput winner. (Measured on an Apple M4 Max.)What changed (this session)
Four workstreams on top of the encoder path above. All measurements: Apple M4 Max, macOS 26, coremltools 9, torch 2.11, B=1.
A — Native RMSNorm, shipped as the default
A new local
Qwen3RMSNorm(x * rsqrt(mean(x²) + eps) * w) plus anorm_implconfig selecting it at the 5 encoder norm sites, leaving the sharedane_ops.ANERMSNorm(thecat([x, −x]) → LayerNorm → chunktrick) untouched. The A/B (conversion/experiment_ane_rmsnorm.py, changing only the norm sites and holding Conv2d-1×1 projections + tensor layout fixed) isolated the RMSNorm as the cause of the ANE speedup seen incidentally during the earlier GPU-residency investigation. The cat/chunk trick predated a fast native ANErsqrt; on this stack it is now a de-optimization.norm_impl="native"is now the pplx-embed default. Resolves the open follow-up flagged indocs/PPLX_EMBED_GPU_RESIDENCY.md. Shared-decoder rollout (the ~10 other families onane_ops.ANERMSNorm) is flagged but not done — needs per-family re-validation — indocs/ANE_RMSNORM_FOLLOWUP.md.B — Larger fixed ANE bucket (L=8192): measured, does NOT ship
conversion/measure_l8192_bucket.pyshows a fixed L=8192 bucket statically maps 99.81% of ops to the ANE, but the ANE runtime fails at inference (ANEProgramProcessRequestDirect status=0x15): the 8192² full-attention intermediates (16 heads × 8192² fp16 ≈ 2 GB per score tensor) exceed ANE buffer limits, and the ANE compile alone takes ~25 min. So the largest fixed ANE bucket stays 4096; 4097–8192 tokens keep the existing dynamic GPU catch-all.C — Single fixed RoPE table → fully shared weights
The encoder now builds its RoPE cos/sin tables at a single fixed size (
max_position_embeddings) for every bucket, applied via a runtimeposition_idsgather (fold-proof — a static[:S]slice gets const-folded back to per-bucket). The only per-bucket-varying constant is gone, soweight.binis now byte-identical across all buckets. Output is bit-identical to the old per-bucket RoPE (parity + dynamic-RangeDim + Swift int8 all verified). This is what makes the distribution below dedup.D — Hugging Face distribution with end-to-end dedup
conversion/upload_pplx_embed.pypublishes the prebuilt buckets to one HF repo — per-bucket subfolders +manifest.json, shipping both.mlmodelc(default — no on-device compile) and.mlpackageper bucket. Because all buckets now share one weight blob:conversion/upload_pplx_embed_dedup.pyuploads each unique blob once and server-side-copies the duplicate paths (CommitOperationCopy) — sidesteppinghf upload-large-folder's re-upload of identical oids across its parallel batches.PplxEmbed.load(repo:buckets:preferCompiled:)uses HF's native Swift Hub client (swift-huggingfaceHubClient.downloadSnapshot, with the Xet trait — HF stores large files Xet-backed by default) into the content-addressed cache, so the shared weight is fetched once no matter how many buckets you pull. Selective per-bucket, one format per bucket, exact-path matching; feeds the existingload(bundleDir:).pplx-embed-demo --repoexercises it end to end.Verification
CPU_AND_NEvsane_cat, at identical 99.81% ANE residency and cosine 0.99998 vs the fp32 oracle. Parity gate PASS.status=0x15); ~25 min compile. Decision: do not ship; 4097–8192 stays on the GPU catch-all.weight.binbyte-identical across buckets (verified by sha256); parity PASS, dynamic-RangeDim + Swift int8 bit-identical to the old encoder.pplx-embed-demo --repo dokterbob/pplx-embed-coreml --buckets 512returns correct embeddings, and the content-addressed cache holds one 1.1 GB blob with twoweight.binpaths pointing to it (download dedup confirmed). Upload deduped via server-side copy.Follow-ups
norm_impl="native"to the sharedane_ops.ANERMSNormonly per-family, after re-validating decode + prefill latency, residency, and parity (docs/ANE_RMSNORM_FOLLOWUP.md). Do not flip the shared default globally off the pplx-embed result alone.Uses the existing pip/venv conversion workflow (a separate PR adds an optional
uvmanifest). Verified:test_pplx_embed_parity.pyPASS,swift buildclean.🤖 Generated with Claude Code