Steering Bench

Official codebase for the paper: Analyzing the Generalization and Reliability of Steering Vectors

Quickstart

This repository contains instructions on how to run layer sweep and steering experiments described in our paper.

First, install dependencies:

git clone https://github.com/dtch1997/steering-bench
cd steering-bench 
pip install -e .

Experiments

This repository facilitates running the following experiments:

Layer sweep: Extract and apply steering vectors at many different layer in order to select the 'best' layer (by steerability).
Steering generalization: Run steering on a given task with variations in user and system prompts to evaluate generalization.

Package

This repository also provides off-the-shelf components that make it easy to run a custom steering experiment.

Pipeline, a wrapper around a (possibly-steered) model
Formatter, an abstraction for prompt-based scaffolding
PipelineHook, an abstraction for generic steering interventions
SteeringHook, an implementation of applying steering vectors using our steering vectors library
Steerability metrics used in the paper

Paper Reproduction

This codebase has been simplified to improve readability, and was not directly used in generating results for the paper. If you would like to reproduce specific plots in our paper, refer to our original codebase.

Citation

If you found this useful, consider citing our paper:

@misc{tan2024analyzinggeneralizationreliabilitysteering,
      title={Analyzing the Generalization and Reliability of Steering Vectors}, 
      author={Daniel Tan and David Chanin and Aengus Lynch and Dimitrios Kanoulas and Brooks Paige and Adria Garriga-Alonso and Robert Kirk},
      year={2024},
      eprint={2407.12404},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2407.12404}, 
}

Acknowledgements

This work was made possible by the generous support of:

FAR AI
The UCL AI Centre
UCL DARK Lab
The Agency for Science, Technology, and Research

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
assets		assets
experiments		experiments
steering_bench		steering_bench
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
README.md		README.md
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Steering Bench

Quickstart

Experiments

Package

Paper Reproduction

Citation

Acknowledgements

About

Releases 1

Packages

Contributors 2

Languages

dtch1997/steering-bench

Folders and files

Latest commit

History

Repository files navigation

Steering Bench

Quickstart

Experiments

Package

Paper Reproduction

Citation

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages