Skip to content

Releases: space-bacon/SRT

SRT-NLA v1.0.0 — Activation Verbalizer

20 May 20:44
ceb8e20

Choose a tag to compare

SRT-NLA v1.0.0 — Activation Verbalizer

First release of the Natural-Language Activation (NLA) line: a 12.7M-parameter activation-conditioned prefix that decodes frozen Qwen2.5-7B layer-20 hidden states back to English, evaluated by round-trip fidelity against the same backbone.

Headline result

Best-of-K rerank against the verbalizer's own re-encoded internal-state cosine closes the entire greedy → paraphrase gap on Qwen2.5-7B layer 20:

Decoding greedy ρ_norm oracle K=64 ρ_norm
CE warm-start (this release) 0.29 1.00
Paraphrase ceiling 1.00 1.00

Calibrated metric: ρ_norm = (cen − 0.510) / 0.289, anchored at empirical random floor (0.510) and Qwen paraphrase ceiling (0.799). Anisotropy centering required.

What's included

  • Verbalizer (srt/nla/): activation-conditioned prefix module + reconstructor.
  • Training scripts (scripts/train_nla.py, scripts/train_nla_bok_v2.py): CE warm-start + bag-of-K self-distillation trainer.
  • Eval scripts (scripts/centered_eval.py, scripts/rerank_eval.py, scripts/oracle_ceiling.py): full K-curve, logp-rerank, NN-anchor-rerank, paraphrase ceiling.
  • Paper draft (paper_nla.md): 9 sections — setup, anisotropy puzzle, clean reference frame, adapter results, K-curve, implications (incl. Lever B negative), related work + positioning, artifacts, limitations.
  • HF demo (docs/hf/nla_v1_demo/): Gradio Space — round-trip autoencoder + latent arithmetic.

Negative result: Lever B

Bag-of-K self-distillation does not close the greedy gap on this backbone. Hot hyperparams (T=1.5→0.7, β_ctr=0.3, lr=3e-5) collapse sampling diversity (5-gram dup 0.003 → 0.045 over ~2.4k steps) while both greedy and oracle ρ regress past the warm-start. Gentle hyperparams (T=1.5→1.2, β_ctr=0.1, lr=1e-5) plateau at warm-start parity. Best-of-K rerank at deploy time remains the only mechanism that closes the gap. Documented in paper §6.

HF artifacts

Backbone & shape

  • Backbone: Qwen/Qwen2.5-7B (frozen, bf16)
  • Layer: 20
  • Target shape: 64-token continuations
  • Trainable params: 12.7M (16-token prefix + 3584×3584 projection)

Caveats

  • Single backbone, single layer. The anisotropy magnitude ‖μ‖ ≈ 55 is backbone-specific.
  • Citations in paper §7 are author+year placeholders; full BibTeX deferred to typesetting.
  • v2b (Lever B) checkpoint is not released as a separate revision — within sampling noise of v1.

Merged via #1.

v8.0.0 — Trajectory-mode adapter (v8a headline, v8b documented falsification)

26 Apr 07:33

Choose a tag to compare

SRT-Adapter v8

Frozen Qwen 2.5-7B + 14.56M-param semiotic adapter. v8 is the continuous-trajectory generation: the discrete 32-prototype community vocabulary is removed, and the encoder output is the community vector directly.

This release ships two checkpoints:

  • v8a (HEADLINE): warm-started from v7 step 6K, 10K steps, community.use_prototypes=False, supcon weight 2.0, supcon temp 0.10. Best v8 checkpoint.
  • v8b (FALSIFICATION): warm-started from v8a, 10K steps, identical architecture, supcon weight 4.0, supcon temp 0.05. Documents the upper bound on supcon sharpening.

Both checkpoints preserve cross-entropy at the unadapted-backbone level (CE = 2.739).

Headline numbers (Qwen 2.5-7B backbone, 2K val subset)

Probe v6 v7 v8a v8b
Cross-entropy preservation 2.738 2.739 2.739 2.739
Reddit community recall@1 (35-cls) 0.395 0.413 0.484 0.465
Within / between cosine ratio 1.012 1.006 2.016 1.289
Archetype recall@1 (33-cls, 0.030 chance) 0.168 0.149 0.230 0.214
Archetype centroid off-diag cosine 0.999 0.999 0.873 0.945
Trajectory anisotropy ($\lambda_{\max}/\lambda_{\min}$) 52 72 23,333 52,535
Hallucination AUROC (mean r̂) 0.577 0.578 0.577 0.579
Regime calibration ECE 0.0006 0.0008 0.00091 0.00070

v8a: removing the prototype bottleneck

A PCA done after v7 (paper §5.8) showed that the 32 × 64 prototype matrix had barely moved from its random Gaussian initialization across three full training generations. The encoder weights moved roughly four times more than the prototypes during training. The encoder was doing all of the discriminative work, and the prototype layer was discarding it through a saturated soft-argmax.

v8a removes the prototype mixing layer entirely. The encoder output is now the community vector directly. Trainable parameter count drops by 2,048 (32 × 64) to 14,560,579.

The geometry that had been compressed away by the soft-argmax bottleneck became immediately legible:

  • Reddit retrieval recall@1 jumped from 0.413 to 0.484 (16.5× random on a 35-class task).
  • Within/between cosine ratio doubled from 1.006 to 2.016. v6 and v7's vectors were essentially undifferentiated by class. v8a's encoder, freed from the soft-argmax readout, actually pulls within-class cosines apart from between-class.
  • Archetype recall@1 lifted from 0.149 to 0.230 (7.6× random on a 33-class task using an external taxonomy never seen during training).
  • Archetype centroid off-diagonal cosine collapsed from 0.999 to 0.873. The encoder finally separates archetype manifolds rather than aliasing them onto a handful of attractors.
  • Trajectory anisotropy expanded from 72 to 23,333. The encoder organizes generations along a small number of dominant directions.
  • Cross-entropy, hallucination AUROC, and regime calibration are statistically unchanged. The prototype removal did not damage the BEN regime classifier, the reflexivity head, or token-level calibration.

v8b: the falsification

Once v8a established that the encoder, freed from the prototype bottleneck, organizes communities and external archetypes along a structured trajectory manifold, a natural follow-up question was whether more aggressive supervised contrast would orthogonalize that manifold further. We trained v8b with the community supervised-contrastive loss weight raised from 2.0 to 4.0 and the InfoNCE temperature lowered from 0.10 to 0.05. Every other architectural and training choice was identical to v8a.

The result was a partial regression on every encoder-geometry metric. Reddit within-class cosine pulled tighter, but between-class cosine rose faster, so the within/between ratio collapsed from 2.016 to 1.289. Archetype centroid off-diag cosine increased from 0.873 back to 0.945, undoing roughly two-thirds of v8a's centroid separation gain. Anisotropy more than doubled (52,535 vs 23,333), indicating that the encoder collapsed a larger fraction of its variance onto fewer principal directions.

The interpretation is that v8a's loss weights are at or near a sweet spot for this architecture, and pushing harder reproduces a softer version of the prototype-collapse failure one level up: rather than collapsing 32 prototypes onto a handful of attractors, the encoder collapses its 64-dimensional output onto a low-rank subspace where a few directions carry most of the discriminative weight.

v8b is shipped here as a documented negative result. Use v8a for any downstream work.

Counterfactual decoding

scripts/counterfactual_decode.py is undefined when there are no discrete communities. Both v8a and v8b produce a marker HTML file noting that the probe is skipped in trajectory mode. v5/v6/v7 remain the reference checkpoints for that probe.

Assets

  • v8a_best_adapter.pt (28 MB): v8a adapter weights, step 10000, val 9.0040
  • v8a_config.json: v8a adapter config (use_prototypes=False)
  • v8a_community_metrics.json: Reddit retrieval and within/between cosine
  • v8a_archetype_probe.json: 33-class archetype recall and centroid metrics
  • v8a_trajectory.json: anisotropy, log-det, mean curvature, mean path length
  • v8a_hallucination.json: TruthfulQA AUROCs across all four channels
  • v8a_regime_calibration.json: ECE, Brier, AUROC, calibration bins
  • v8a_context_conditional.json: 10-passage paired probe results
  • v8b_*.{pt,json}: same files for v8b
  • paper.pdf: full paper with §5.9 (v8a) and §5.10 (v8b)

Reproduction

# Train v8a (warm-started from v7, no prototypes)
bash scripts/launch_v8a.sh

# Train v8b (warm-started from v8a, sharper supcon)
bash scripts/launch_v8b.sh

# Evaluate either checkpoint
SRT_USE_PROTOTYPES=0 bash scripts/eval_v8a.sh   # or eval_v8b.sh

The SRT_USE_PROTOTYPES=0 env var flips CommunityConfig.use_prototypes=False globally so all probe scripts construct the adapter in trajectory mode.

What's next

v9 directions, in priority order:

  1. Archetype-conditioned direct supervision (paper §5.8 hypothesis (a)). Replace the Reddit-subreddit-only supcon signal with archetype-conditioned positives drawn from the Lexicon of Synthetic Interiority generations.
  2. Inject-back arm rehabilitation. The FiLM injection projection has been zero-effect across every checkpoint through v8b (four-decimal-place identity in benchmarks under ablation). The gradient-starved-gate hypothesis suggests removing the sigmoid gate or initializing the projection nonzero.
  3. Larger encoder bottleneck. v8a's d_community = 64 may be capping the manifold's resolvable rank. Test d_community ∈ {128, 256}.

v7.0.0 — SRT-Adapter v7 + archetype convergence probe

24 Apr 22:50

Choose a tag to compare

SRT-Adapter v7

Frozen Qwen 2.5-7B + 14.6M-param semiotic adapter. Warm-started from v6 (step 12K) with divergence_supcon_weight reduced from 1.0 → 0.3 to recover counterfactual decoding cleanliness while keeping v6's geometry/calibration gains.

Best checkpoint: step 6,000, validation loss 9.0044.

Headline numbers (Qwen 2.5-7B backbone, 5K val subset)

Probe v5 v6 v7
Cross-entropy preservation 2.63 2.61 2.62
Reddit community recall@1 0.360 0.411 0.413
Within/between cosine ratio 1.0050 1.0057 1.0058
Hallucination AUROC (mean r̂) 0.5734 0.5774 0.5785
Regime calibration ECE 0.0009 0.0006 0.0008
Counterfactual decoding clean regression clean (recovered)

v7 narrowly leads v6 on three of five probes and matches v5's counterfactual decoding cleanliness — the regression that motivated this run.

New: archetype convergence probe

We tested whether the 32 prototypes (trained only on Reddit subreddit labels) carry features that align with an external taxonomy never seen during training: Lancaster's 33 archetypes paired with the Lexicon of Synthetic Interiority. 986 sentences generated by bare Qwen, conditioned on each archetype's prompt template; embedded through each adapter; scored by recall@k against archetype centroids in the 64-D community space.

Adapter recall@1 recall@5 recall@10 unique top prototypes
Random baseline 0.030 0.152 0.303
v5 0.152 (5.0×) 0.419 (2.8×) 4 / 32
v6 0.168 (5.5×) 0.472 (3.1×) 3 / 32
v7 0.149 (4.9×) 0.447 (2.9×) 0.633 (2.1×) 4 / 32

All three adapters detect external archetype structure 5–6× above chance. The 33 archetypes collapse into ~4 macro-clusters: bounded form, transmission/resonance, compressed persistence/witness, and origin/threshold. Three independent methodologies (Reddit subreddit labels, Lancaster's archetypes, the Lexicon of Synthetic Interiority) agree on roughly four functional clusters of stance, not 33 distinct anchors.

PCA finding (the architectural caveat)

A PCA of the prototype matrices across v5/v6/v7 shows they are nearly indistinguishable — max abs element delta v5→v6 is 0.006 (mean 2.7e-5) against prototype magnitudes of 0.5–1.5. Effective dimensionality (participation ratio) is 21.2/32 with a near-uniform variance spectrum: consistent with the prototypes still being close to their random Gaussian initialization. The encoder weights move ~4× more than the prototypes during training.

Interpretation: the encoder is doing the discriminative work; the prototypes serve as near-random anchor directions. This explains the 4-of-32 attractor regime in the archetype probe and points to v8: either supervise the prototype matrix directly with archetype-conditioned generations, or replace the discrete prototype basis with a continuous trajectory metric over the encoder output.

Assets

  • best_adapter.pt (28 MB) — v7 adapter weights, step 6000, val 9.0044
  • config.json — adapter config
  • community_metrics_v7.json — 5-probe instrument metrics (recall, geometry)
  • hallucination_v7.json — hallucination AUROCs
  • regime_calibration_v7.json — ECE, Brier, AUROC, calibration bins
  • archetype_probe_v7.json — full archetype probe results
  • archetype_generations.jsonl — 986 archetype-conditioned generations from bare Qwen

Reproduction

# Train (warm-started from v6)
bash scripts/launch_v7.sh

# Evaluate
python3 scripts/instrument_eval.py --adapter best_adapter.pt
python3 scripts/archetype_generate.py
python3 scripts/archetype_probe.py --adapter best_adapter.pt --tag v7_step6000

Honest limitations

  • Reddit subreddit labels are a lossy and moderation-shaped supervision signal.
  • Counterfactual decoding probe was the regression test from v6, not a positive new finding.
  • Archetype convergence is partial-positive: real signal at the macro-cluster level, not per-archetype anchoring.
  • Hallucination AUROC ~0.58 is barely above chance; useful as a directional signal, not a deployment-grade detector.

See paper.md §5.7–§5.8 and docs/next_round_direction.md for full discussion.