This repo contains the code for our paper investigating how generative models adapt their outputs to different skill levels to reveal whether adaptation occurs through dynamic internal concept understanding or through modulation of concept externalization.
Run with
python -m maia2-sae.test.externalization.maia2_ft
to callibrate the concept externalization of model at lower skill levels to higher ones in move prediction.
We implement sparse and wide transcoders to simulate the behavior of the original model's FFN MLPs. The training process uses topK ReLU activations to directly control sparsity and mitigate dead latents. Run the transcoder training with:
python -m maia2-sae.train.transcoder_train \
--model_path /path/to/maia-2-weights \
--data_root /path/to/lichess_data \
--hidden_dim 16384 \
--k 256 \
--layer_idx 6
After training, we can evaluate and visualize the reconstruction fidelity by:
python -m maia2-sae.train.transcoder_eval
The training process supports the choice of SAE hyperparameters and hook sites to extract Maia-2 internal representations. To start a standard SAE training, run:
python -m maia2-sae.train.train_sae [arguments]
Key Arguments for SAE Training The following arguments control the SAE training process:
--sae_dim: Dimension of the SAE
--l1_coefficient: L1 regularization coefficient for SAE training loss
--sae_attention_heads: Whether to attach hooks on attention heads for SAE training
--sae_residual_streams: Whether to attach hooks on residual streams for SAE training
--sae_mlp_outputs: Whether to attach hooks on MLP outputs for SAE training
To get the internal activations of our trained SAEs on Maia-2 test positions, run
python -m maia2-sae.train.generate_activations
Then, with the SAE internals we can extract most salient SAE features for offensive and defensive square-wise threat concepts by:
python -m maia2-sae.test.threat_awareness
Run with:
python -m maia2-sae.test.intervention.run_sae_intervention
to examine how model's behaviour changes when increasing the concept understanding level of it!