The AI Steerability 360 toolkit is an extensible library for general purpose steering of LLMs. The toolkit allows for
the implementation of steering methods across a range of model control surfaces (input, structural, state, and output),
functionality to compose steering methods (into a SteeringPipeline), and the ability to compare steering methods
(and pipelines) on custom tasks/metrics.
To get started, please see the documentation.
The toolkit uses uv as the package manager (Python 3.11+). After installing uv, install
the toolkit by running:
uv venv --python 3.11 && uv pip install .
Activate by running source .venv/bin/activate. Note that on Windows, you may need to split the above script into two separate commands (instead of chained via &&).
Inference is facilitated by Hugging Face. Before steering, create a .env file in the root directory for your Hugging
Face API key in the following format:
HUGGINGFACE_TOKEN=hf_***
Some Hugging Face models (e.g. meta-llama/Meta-Llama-3.1-8B-Instruct) are behind an access gate. Check that you have access via the model's Hub page with the same account whose token you pass to the toolkit.
Note
AISteer360 runs the model inside your process. For efficient inference, please run the toolkit from a machine that has enough GPU memory for both the base checkpoint and the extra overhead your steering method/pipeline adds.
The ability to benchmark and compare steering methods on realistic use cases is one of the main features of the toolkit. The featured examples below illustrate this functionality.
Steering for instruction following |
|---|
| A model's instruction following ability is an important measure of its general usability. This notebook studies the effect of post-hoc attention steering (PASTA) on a model's ability to follow instructions. We sweep over the steering strength and investigate the trade-off between a model's instruction following ability and general response quality. |
Steering for commonsense reasoning |
|---|
| Multiple choice question answering is a common format for evaluating a model's reasoning ability. This notebook benchmarks steering methods on the CommonsenseQA dataset, comparing few-shot prompting against a LoRA adapter trained with DPO. We sweep over the number of few-shot examples and study how accuracy scales relative to the fine-tuned baseline across two models. |
Composite steering for truthfulness |
|---|
| One of the primary features of the toolkit is the ability to compose multiple steering methods into one model operation. This notebook composes a state control (PASTA) with an output control (DeAL) with the goal of improving the model's truthfulness (as measured on TruthfulQA) without significantly degrading informativeness. We sweep over the joint parameter space of the controls and study each control's performance (via the tradeoff between truthfulness and informativeness) to that of the composition. |
Demonstrations of each of the implemented steering methods in the toolkit are provided in the examples/notebooks/control_* folders; links to Colab notebooks are provided below.
| Method | Authors | Notebook |
|---|---|---|
| Activation Addition (ActAdd) | Turner et al., 2023 | |
| Contrastive Activation Addition (CAA) | Panickssery et al., 2023 | |
| Conditional Activation Steering (CAST) | Lee et al., 2024 | |
| Decoding-time Alignment (DeAL) | Huang et al., 2024 | |
| Few-shot Learning | Brown et al., 2020 | |
| Inference-Time Intervention (ITI) | Li et al., 2023 | |
| Post-hoc Attention Steering (PASTA) | Zhang et al., 2023 | |
| Reward-Augmented Decoding (RAD) | Deng & Raffel, 2023 | |
| Self-Disciplined Autoregressive Sampling (SASA) | Ko et al., 2025 | |
| Thinking Intervention | Wu et al., 2025 |
The toolkit also provides wrappers for the following external libraries.
| Wrapper | Authors | Notebook |
|---|---|---|
| MergeKit | Goddard et al., 2024 | |
| TRL | von Werra et al., 2020 |
We invite community contributions primarily on broadening the set of steering methods (via new controls) and evaluations (via use cases and metrics). We additionally welcome reporting of any bugs/issues, improvements to the documentation, and new features). Specifics on how to contribute can be found in our contribution guidelines. To make contributing easier, we have prepared the following tutorials.
If there is an existing steering method that is not yet in the toolkit but you feel should be, or you have developed a new steering method of your own, the toolkit has been designed to enable relatively easy contribution of new steering methods. Please see the tutorial on adding your own steering method for a detailed guide
Use cases enable comparison of different steering methods on a common task. The base UseCase
(aisteer360/evaluation/use_cases/) and the Benchmark class (aisteer360/evaluation/benchmark.py) enable this
comparison. If you'd like to compare various steering methods/pipelines on a novel use case, please see the tutorial
on adding your own use case.
Metrics are used by a given benchmark to quantify model performance across steering pipelines in a comparable way. We've
included a selection of generic metrics (see aisteer360/evaluation/metrics/). If you'd like to add new generic metrics
or custom metrics (for a new use case), please see the tutorial on
adding your own metric.
The AI Steerability 360 toolkit has been brought to you by IBM.

