Releases: space-bacon/SRT
SRT-NLA v1.0.0 — Activation Verbalizer
SRT-NLA v1.0.0 — Activation Verbalizer
First release of the Natural-Language Activation (NLA) line: a 12.7M-parameter activation-conditioned prefix that decodes frozen Qwen2.5-7B layer-20 hidden states back to English, evaluated by round-trip fidelity against the same backbone.
Headline result
Best-of-K rerank against the verbalizer's own re-encoded internal-state cosine closes the entire greedy → paraphrase gap on Qwen2.5-7B layer 20:
| Decoding | greedy ρ_norm |
oracle K=64 ρ_norm |
|---|---|---|
| CE warm-start (this release) | 0.29 | 1.00 |
| Paraphrase ceiling | 1.00 | 1.00 |
Calibrated metric: ρ_norm = (cen − 0.510) / 0.289, anchored at empirical random floor (0.510) and Qwen paraphrase ceiling (0.799). Anisotropy centering required.
What's included
- Verbalizer (
srt/nla/): activation-conditioned prefix module + reconstructor. - Training scripts (
scripts/train_nla.py,scripts/train_nla_bok_v2.py): CE warm-start + bag-of-K self-distillation trainer. - Eval scripts (
scripts/centered_eval.py,scripts/rerank_eval.py,scripts/oracle_ceiling.py): full K-curve, logp-rerank, NN-anchor-rerank, paraphrase ceiling. - Paper draft (
paper_nla.md): 9 sections — setup, anisotropy puzzle, clean reference frame, adapter results, K-curve, implications (incl. Lever B negative), related work + positioning, artifacts, limitations. - HF demo (
docs/hf/nla_v1_demo/): Gradio Space — round-trip autoencoder + latent arithmetic.
Negative result: Lever B
Bag-of-K self-distillation does not close the greedy gap on this backbone. Hot hyperparams (T=1.5→0.7, β_ctr=0.3, lr=3e-5) collapse sampling diversity (5-gram dup 0.003 → 0.045 over ~2.4k steps) while both greedy and oracle ρ regress past the warm-start. Gentle hyperparams (T=1.5→1.2, β_ctr=0.1, lr=1e-5) plateau at warm-start parity. Best-of-K rerank at deploy time remains the only mechanism that closes the gap. Documented in paper §6.
HF artifacts
- Model:
RiverRider/srt-nla-av-v1 - Dataset:
RiverRider/srt-nla-targets-v1 - Demo Space: built from
docs/hf/nla_v1_demo/
Backbone & shape
- Backbone: Qwen/Qwen2.5-7B (frozen, bf16)
- Layer: 20
- Target shape: 64-token continuations
- Trainable params: 12.7M (16-token prefix + 3584×3584 projection)
Caveats
- Single backbone, single layer. The anisotropy magnitude
‖μ‖ ≈ 55is backbone-specific. - Citations in paper §7 are author+year placeholders; full BibTeX deferred to typesetting.
- v2b (Lever B) checkpoint is not released as a separate revision — within sampling noise of v1.
Merged via #1.
v8.0.0 — Trajectory-mode adapter (v8a headline, v8b documented falsification)
SRT-Adapter v8
Frozen Qwen 2.5-7B + 14.56M-param semiotic adapter. v8 is the continuous-trajectory generation: the discrete 32-prototype community vocabulary is removed, and the encoder output is the community vector directly.
This release ships two checkpoints:
- v8a (HEADLINE): warm-started from v7 step 6K, 10K steps,
community.use_prototypes=False, supcon weight 2.0, supcon temp 0.10. Best v8 checkpoint. - v8b (FALSIFICATION): warm-started from v8a, 10K steps, identical architecture, supcon weight 4.0, supcon temp 0.05. Documents the upper bound on supcon sharpening.
Both checkpoints preserve cross-entropy at the unadapted-backbone level (CE = 2.739).
Headline numbers (Qwen 2.5-7B backbone, 2K val subset)
| Probe | v6 | v7 | v8a | v8b |
|---|---|---|---|---|
| Cross-entropy preservation | 2.738 | 2.739 | 2.739 | 2.739 |
| Reddit community recall@1 (35-cls) | 0.395 | 0.413 | 0.484 | 0.465 |
| Within / between cosine ratio | 1.012 | 1.006 | 2.016 | 1.289 |
| Archetype recall@1 (33-cls, 0.030 chance) | 0.168 | 0.149 | 0.230 | 0.214 |
| Archetype centroid off-diag cosine | 0.999 | 0.999 | 0.873 | 0.945 |
| Trajectory anisotropy ( |
52 | 72 | 23,333 | 52,535 |
| Hallucination AUROC (mean r̂) | 0.577 | 0.578 | 0.577 | 0.579 |
| Regime calibration ECE | 0.0006 | 0.0008 | 0.00091 | 0.00070 |
v8a: removing the prototype bottleneck
A PCA done after v7 (paper §5.8) showed that the 32 × 64 prototype matrix had barely moved from its random Gaussian initialization across three full training generations. The encoder weights moved roughly four times more than the prototypes during training. The encoder was doing all of the discriminative work, and the prototype layer was discarding it through a saturated soft-argmax.
v8a removes the prototype mixing layer entirely. The encoder output is now the community vector directly. Trainable parameter count drops by 2,048 (32 × 64) to 14,560,579.
The geometry that had been compressed away by the soft-argmax bottleneck became immediately legible:
- Reddit retrieval recall@1 jumped from 0.413 to 0.484 (16.5× random on a 35-class task).
- Within/between cosine ratio doubled from 1.006 to 2.016. v6 and v7's vectors were essentially undifferentiated by class. v8a's encoder, freed from the soft-argmax readout, actually pulls within-class cosines apart from between-class.
- Archetype recall@1 lifted from 0.149 to 0.230 (7.6× random on a 33-class task using an external taxonomy never seen during training).
- Archetype centroid off-diagonal cosine collapsed from 0.999 to 0.873. The encoder finally separates archetype manifolds rather than aliasing them onto a handful of attractors.
- Trajectory anisotropy expanded from 72 to 23,333. The encoder organizes generations along a small number of dominant directions.
- Cross-entropy, hallucination AUROC, and regime calibration are statistically unchanged. The prototype removal did not damage the BEN regime classifier, the reflexivity head, or token-level calibration.
v8b: the falsification
Once v8a established that the encoder, freed from the prototype bottleneck, organizes communities and external archetypes along a structured trajectory manifold, a natural follow-up question was whether more aggressive supervised contrast would orthogonalize that manifold further. We trained v8b with the community supervised-contrastive loss weight raised from 2.0 to 4.0 and the InfoNCE temperature lowered from 0.10 to 0.05. Every other architectural and training choice was identical to v8a.
The result was a partial regression on every encoder-geometry metric. Reddit within-class cosine pulled tighter, but between-class cosine rose faster, so the within/between ratio collapsed from 2.016 to 1.289. Archetype centroid off-diag cosine increased from 0.873 back to 0.945, undoing roughly two-thirds of v8a's centroid separation gain. Anisotropy more than doubled (52,535 vs 23,333), indicating that the encoder collapsed a larger fraction of its variance onto fewer principal directions.
The interpretation is that v8a's loss weights are at or near a sweet spot for this architecture, and pushing harder reproduces a softer version of the prototype-collapse failure one level up: rather than collapsing 32 prototypes onto a handful of attractors, the encoder collapses its 64-dimensional output onto a low-rank subspace where a few directions carry most of the discriminative weight.
v8b is shipped here as a documented negative result. Use v8a for any downstream work.
Counterfactual decoding
scripts/counterfactual_decode.py is undefined when there are no discrete communities. Both v8a and v8b produce a marker HTML file noting that the probe is skipped in trajectory mode. v5/v6/v7 remain the reference checkpoints for that probe.
Assets
v8a_best_adapter.pt(28 MB): v8a adapter weights, step 10000, val 9.0040v8a_config.json: v8a adapter config (use_prototypes=False)v8a_community_metrics.json: Reddit retrieval and within/between cosinev8a_archetype_probe.json: 33-class archetype recall and centroid metricsv8a_trajectory.json: anisotropy, log-det, mean curvature, mean path lengthv8a_hallucination.json: TruthfulQA AUROCs across all four channelsv8a_regime_calibration.json: ECE, Brier, AUROC, calibration binsv8a_context_conditional.json: 10-passage paired probe resultsv8b_*.{pt,json}: same files for v8bpaper.pdf: full paper with §5.9 (v8a) and §5.10 (v8b)
Reproduction
# Train v8a (warm-started from v7, no prototypes)
bash scripts/launch_v8a.sh
# Train v8b (warm-started from v8a, sharper supcon)
bash scripts/launch_v8b.sh
# Evaluate either checkpoint
SRT_USE_PROTOTYPES=0 bash scripts/eval_v8a.sh # or eval_v8b.shThe SRT_USE_PROTOTYPES=0 env var flips CommunityConfig.use_prototypes=False globally so all probe scripts construct the adapter in trajectory mode.
What's next
v9 directions, in priority order:
- Archetype-conditioned direct supervision (paper §5.8 hypothesis (a)). Replace the Reddit-subreddit-only supcon signal with archetype-conditioned positives drawn from the Lexicon of Synthetic Interiority generations.
- Inject-back arm rehabilitation. The FiLM injection projection has been zero-effect across every checkpoint through v8b (four-decimal-place identity in benchmarks under ablation). The gradient-starved-gate hypothesis suggests removing the sigmoid gate or initializing the projection nonzero.
- Larger encoder bottleneck. v8a's d_community = 64 may be capping the manifold's resolvable rank. Test d_community ∈ {128, 256}.
v7.0.0 — SRT-Adapter v7 + archetype convergence probe
SRT-Adapter v7
Frozen Qwen 2.5-7B + 14.6M-param semiotic adapter. Warm-started from v6 (step 12K) with divergence_supcon_weight reduced from 1.0 → 0.3 to recover counterfactual decoding cleanliness while keeping v6's geometry/calibration gains.
Best checkpoint: step 6,000, validation loss 9.0044.
Headline numbers (Qwen 2.5-7B backbone, 5K val subset)
| Probe | v5 | v6 | v7 |
|---|---|---|---|
| Cross-entropy preservation | 2.63 | 2.61 | 2.62 |
| Reddit community recall@1 | 0.360 | 0.411 | 0.413 |
| Within/between cosine ratio | 1.0050 | 1.0057 | 1.0058 |
| Hallucination AUROC (mean r̂) | 0.5734 | 0.5774 | 0.5785 |
| Regime calibration ECE | 0.0009 | 0.0006 | 0.0008 |
| Counterfactual decoding | clean | regression | clean (recovered) |
v7 narrowly leads v6 on three of five probes and matches v5's counterfactual decoding cleanliness — the regression that motivated this run.
New: archetype convergence probe
We tested whether the 32 prototypes (trained only on Reddit subreddit labels) carry features that align with an external taxonomy never seen during training: Lancaster's 33 archetypes paired with the Lexicon of Synthetic Interiority. 986 sentences generated by bare Qwen, conditioned on each archetype's prompt template; embedded through each adapter; scored by recall@k against archetype centroids in the 64-D community space.
| Adapter | recall@1 | recall@5 | recall@10 | unique top prototypes |
|---|---|---|---|---|
| Random baseline | 0.030 | 0.152 | 0.303 | — |
| v5 | 0.152 (5.0×) | 0.419 (2.8×) | — | 4 / 32 |
| v6 | 0.168 (5.5×) | 0.472 (3.1×) | — | 3 / 32 |
| v7 | 0.149 (4.9×) | 0.447 (2.9×) | 0.633 (2.1×) | 4 / 32 |
All three adapters detect external archetype structure 5–6× above chance. The 33 archetypes collapse into ~4 macro-clusters: bounded form, transmission/resonance, compressed persistence/witness, and origin/threshold. Three independent methodologies (Reddit subreddit labels, Lancaster's archetypes, the Lexicon of Synthetic Interiority) agree on roughly four functional clusters of stance, not 33 distinct anchors.
PCA finding (the architectural caveat)
A PCA of the prototype matrices across v5/v6/v7 shows they are nearly indistinguishable — max abs element delta v5→v6 is 0.006 (mean 2.7e-5) against prototype magnitudes of 0.5–1.5. Effective dimensionality (participation ratio) is 21.2/32 with a near-uniform variance spectrum: consistent with the prototypes still being close to their random Gaussian initialization. The encoder weights move ~4× more than the prototypes during training.
Interpretation: the encoder is doing the discriminative work; the prototypes serve as near-random anchor directions. This explains the 4-of-32 attractor regime in the archetype probe and points to v8: either supervise the prototype matrix directly with archetype-conditioned generations, or replace the discrete prototype basis with a continuous trajectory metric over the encoder output.
Assets
best_adapter.pt(28 MB) — v7 adapter weights, step 6000, val 9.0044config.json— adapter configcommunity_metrics_v7.json— 5-probe instrument metrics (recall, geometry)hallucination_v7.json— hallucination AUROCsregime_calibration_v7.json— ECE, Brier, AUROC, calibration binsarchetype_probe_v7.json— full archetype probe resultsarchetype_generations.jsonl— 986 archetype-conditioned generations from bare Qwen
Reproduction
# Train (warm-started from v6)
bash scripts/launch_v7.sh
# Evaluate
python3 scripts/instrument_eval.py --adapter best_adapter.pt
python3 scripts/archetype_generate.py
python3 scripts/archetype_probe.py --adapter best_adapter.pt --tag v7_step6000Honest limitations
- Reddit subreddit labels are a lossy and moderation-shaped supervision signal.
- Counterfactual decoding probe was the regression test from v6, not a positive new finding.
- Archetype convergence is partial-positive: real signal at the macro-cluster level, not per-archetype anchoring.
- Hallucination AUROC ~0.58 is barely above chance; useful as a directional signal, not a deployment-grade detector.
See paper.md §5.7–§5.8 and docs/next_round_direction.md for full discussion.