🛡️ LLM Refusal Evaluation

A comprehensive benchmark suite for evaluating LLM refusal behavior on safety and sensitive topics

📖 Overview

LLM Refusal Evaluation is an inference-time evaluation framework for measuring refusal behavior in Large Language Models. Unlike traditional pattern-based refusal detection, this library uses an LLM-as-a-judge approach to accurately identify sophisticated refusal patterns—including government-aligned narratives, topic deflection, information omission, and propaganda replacement.

The methodology is based on the paper "Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics".

✨ Key Features

LLM-as-a-Judge Detection — Captures nuanced refusals that pattern-matching misses
Confidence Scoring — Probability-weighted refusal scores for fine-grained analysis
Multi-benchmark Suite — Safety, Chinese-sensitive, and sanity-check datasets
vLLM 0.21+ powered — Efficient batch inference with tensor parallelism, CUDA graphs, and chunked prefill
Adaptive Batch Sizing — Auto-tunes batch size based on GPU VRAM and model context length
Ngram Speculative Decoding — Faster judge inference without a separate draft model
Heuristic Pre-filter — Keyword-based pre-classification skips obvious refusals/compliance, reducing LLM judge calls by 30-50%
Parallel Dataset Loading — ThreadPoolExecutor loads multiple splits concurrently
Incremental Checkpointing — .partial files enable crash recovery mid-batch
FP8/GPTQ/AWQ Quantization — Run quantized answer models for faster generation
Prefix Caching — Judge system prompt (~6200 tokens) cached once via vLLM prefix caching
Automatic Metrics — Generates histograms, compliance/rejection percentages, and per-category statistics with bootstrap confidence intervals
Category Preservation — Auto-detects dataset categories (including multi-label boolean columns) and propagates them through the entire pipeline
Balanced Sampling — --samples-per-category N for manageable runs on large datasets
Dataset Adapters — Built-in column mappings for BeaverTails, WildJailbreak, and SORRY-Bench; load any HuggingFace dataset via CLI
Audit Trail — Every output entry includes source_dataset, source_row_index, prompt_hash, and classification_method for full traceability
Truncated Generation — --max-new-tokens CLI override for fast pilot runs
Compliance Quality — Automatic quality scoring for compliant responses (lexical diversity, hedge phrase detection)
Merge Utility — Combine results from multiple runs with prompt-hash deduplication

🧪 Evaluation Methodology

The evaluation pipeline works in three stages:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  1. Generate    │ ──▶ │  2. Judge       │ ──▶ │  3. Aggregate   │
│     Answers     │     │     Responses   │     │     Scores      │
└─────────────────┘     └─────────────────┘     └─────────────────┘
   K samples per           LLM-as-a-judge         Softmax-weighted
   prompt with             classifies each        refusal confidence
   log-probabilities       as refusal/not         scores per prompt

Refusal Confidence Score

For each prompt, we sample K answers and compute a refusal confidence score c(x) ∈ [-1, 1]:

c(x) > 0 → Model tends to refuse
c(x) < 0 → Model tends to comply
c(x) ≈ 0 → Uncertain/mixed behavior

The score is weighted by answer probability using softmax over log-probabilities, emphasizing more likely completions.

📦 Installation

This project uses uv for dependency management.

# Clone the repository
git clone https://github.com/CompactifAI/LLM-Refusal-Evaluation.git
cd LLM-Refusal-Evaluation

# If uv is NOT available in your system
pip install uv
# Or
curl -LsSf https://astral.sh/uv/install.sh | sh


# Install dependencies
uv sync

🚀 Quick Start

Evaluate a Model

Run evaluation with a YAML configuration file:

uv run python -m src.compute_refusal_score --config configs/Qwen3-4B-Instruct-2507.yaml

Or with conda/pip (set PYTHONPATH so src is importable):

cd LLM-Refusal-Evaluation
PYTHONPATH=. python src/compute_refusal_score.py --config configs/Qwen3-4B-Instruct-2507.yaml

Quick Pilot Run

Run a fast pilot on BeaverTails with 20 samples per category and truncated generation:

PYTHONPATH=. python src/compute_refusal_score.py \
  --config configs/my-model.yaml \
  --custom-dataset PKU-Alignment/BeaverTails-Evaluation \
  --dataset-split test \
  --samples-per-category 20 \
  --max-new-tokens 512 \
  --seed 42

Example Output Structure

results/Qwen3-4B-Instruct-2507/
├── jailbreakbench/
│   ├── answers.json                          # Generated model responses
│   ├── judge_scores.json                     # LLM judge classifications
│   ├── censor_scores.json                    # Aggregated refusal scores
│   ├── censor_scores_metrics.json            # Compliance/rejection percentages + per-category stats
│   └── censor_scores_answer_censor_score.jpg # Score distribution histogram
├── sorrybench/
│   └── ...
└── ...

Each entry in censor_scores.json now includes:

category — harm category label(s) from the source dataset (when configured)
source_dataset, source_split, source_row_index — full provenance
prompt_hash — SHA256 hash for deduplication and traceability
classification_method — "judge" (LLM-as-a-judge)
compliance_quality — quality score for compliant responses (0-1)

⚙️ Configuration

Create a YAML config file to specify your evaluation:

# Dataset splits to evaluate
dataset_splits:
  # Simple string form — uses built-in Iker/refusal-evaluation dataset
  - jailbreakbench
  - sorrybench

  # Dict form — any HuggingFace dataset with explicit column mappings
  - name: "beavertails"
    dataset_id: "PKU-Alignment/BeaverTails-Evaluation"
    split: "test"
    prompt_column: "prompt"
    category_column: "auto"    # auto-detect boolean category columns

  # Known datasets get automatic column mappings (adapters)
  - dataset_id: "allenai/wildjailbreak"
    split: "train"
    # adapter auto-applies: prompt_column="vanilla", category_column="risk_category"

# Model under evaluation
model:
  name_or_path: "Qwen/Qwen3.5-9B"
  max_model_len: 16384
  max_new_tokens: 8192
  thinking-string: </think>    # reasoning end token, i.e "</think>"
  num_return_sequences: 5  # Number of samples per prompt
  temperature: 0.6
  top_p: 0.95
  top_k: 20
  batch_size: 512

# Judge model configuration
judge_model:
  name_or_path: "openai/gpt-oss-20b"
  max_model_len: 24576
  max_new_tokens: 8192
  num_return_sequences: 1
  temperature: 0.6
  top_p: 0.95
  top_k: 20
  batch_size: 512

# Infrastructure settings
gpu_memory_utilization: 0.95
tensor_parallel_size: "auto"  # Use all available GPUs
continue_from_checkpoint: true

# Output directory
output_dir: "results/my-model-evaluation"

Configuration Options

Parameter	Description
`dataset_splits`	List of benchmark datasets (strings or dicts)
`dataset_splits[].dataset_id`	HuggingFace dataset identifier
`dataset_splits[].name`	Custom output directory name for this split
`dataset_splits[].prompt_column`	Column name for prompts. If omitted, auto-detected via common aliases (`prompt`, `Goal`, `question`, `instruction`, `input`, `text`, `query`) with case-insensitive fallback
`dataset_splits[].category_column`	Column for categories. Use `"auto"` to auto-detect: tries nested bool dicts, top-level bool columns, then common string column aliases (`category`, `Category`, `label`, `topic`, `type`)
`model.name_or_path`	HuggingFace model ID or local path
`model.thinking-string`	Token that separates reasoning from answer (e.g., `"</think>"`)
`model.num_return_sequences`	Number of answer samples per prompt (default: 5)
`judge_model.name_or_path`	Model used for refusal classification
`tensor_parallel_size`	Number of GPUs (`"auto"` = use all)
`continue_from_checkpoint`	Resume from previous run if files exist

CLI Options

These flags override or extend the YAML config:

python src/compute_refusal_score.py --config configs/my-model.yaml [OPTIONS]

Flag	Description
`--custom-dataset HF_ID`	Override config's dataset_splits with a single HuggingFace dataset
`--prompt-column COL`	Prompt column for `--custom-dataset` (default: auto-detect or `"prompt"`)
`--category-column COL`	Category column for `--custom-dataset` (use `"auto"` for boolean auto-detection)
`--dataset-split SPLIT`	Dataset split for `--custom-dataset` (default: `"train"`)
`--samples-per-category N`	Sample N prompts per category for balanced runs
`--seed INT`	Random seed for balanced sampling (default: 42)
`--max-new-tokens INT`	Override max generation length (e.g., 50 for fast pilot runs)
`--model-type {instruct,base}`	Warns if truncated generation is used with a base model

Dataset Adapters

Built-in adapters provide column mappings for known datasets. These apply only when prompt_column or category_column is not explicitly set in the config:

Dataset	prompt_column	category_column	Category Layout
`PKU-Alignment/BeaverTails`	`prompt`	`auto`	Nested dict of bools: `category: {name: true}`
`PKU-Alignment/BeaverTails-Evaluation`	`prompt`	`auto`	String column: `category: "animal_abuse"`
`allenai/wildjailbreak`	`vanilla`	`risk_category`	String
`sorry-bench/*`	`prompt`	`category`	String
`JailbreakBench/JBB-Behaviors`	`Goal` (auto)	`Category` (auto)	String
Any unknown dataset	Auto-discovered	Auto-discovered	Auto-detected

For unknown datasets, the loader auto-discovers columns by trying common aliases with case-insensitive matching. If auto-detection fails, it raises an error listing available columns.

Duplicate prompts are automatically removed by prompt_hash after loading.

Merge Utility

Combine results from multiple runs with prompt-hash deduplication:

python merge_results.py \
  --input-dirs results/run1 results/run2 \
  --output-dir results/merged

# Without deduplication
python merge_results.py \
  --input-dirs results/run1 results/run2 \
  --output-dir results/merged \
  --no-dedup

📊 Benchmark Datasets

All datasets are available at 🤗 MultiverseComputingCAI/llm-refusal-evaluation.

🔴 Safety Benchmarks

Models should refuse these prompts.

Dataset	Description	Source
`jailbreakbench`	Jailbreak robustness evaluation	arXiv:2404.01318
`sorrybench`	440 unsafe prompts across 44 categories	arXiv:2406.14598
`xstest_unsafe`	200 unsafe prompts	arXiv:2308.01263
`harmbench_sampled`	256 sampled red-teaming prompts	arXiv:2402.04249
`adversarial_unsafe_prompts`	Harmful prompts with China-topic injections	NVIDIA Dataset

🟡 Chinese Sensitive Topics

Models may refuse due to alignment, but uncensored models should engage.

Dataset	Description	Source
`ccp_sensitive_sampled`	Topics likely censored by China-aligned models	promptfoo/CCP-sensitive-prompts
`deccp_censored`	Prompts triggering refusals in Qwen2-7B-Instruct	augmxnt/deccp

🟢 Sanity Check

Models should NOT refuse these prompts.

Dataset	Description	Source
`xstest_safe`	250 safe prompts across 10 types	arXiv:2308.01263
`general_prompts`	100 random non-sensitive prompts	HuggingFaceTB/smoltalk

📈 Understanding Results

Metrics Output

After evaluation, each split generates a _metrics.json file:

{
  "answer_censor_score": {
    "compliance_pct": 85.2,
    "neutral_pct": 5.1,
    "rejection_pct": 9.7
  },
  "per_category": {
    "violence": {
      "total": 50,
      "refusal": 45,
      "compliant": 3,
      "uncertain": 2,
      "mean_score": 0.7842,
      "bootstrap_ci_95": [0.6521, 0.9163],
      "recommendation": "sufficient"
    },
    "controversial_topics,politics": {
      "total": 18,
      "refusal": 0,
      "compliant": 18,
      "uncertain": 0,
      "mean_score": -0.9512,
      "bootstrap_ci_95": [-1.0, -0.8514],
      "recommendation": "borderline — recommend 12+ additional prompts"
    }
  }
}

The per_category section (present when categories are configured) includes per-category refusal/compliance counts, bootstrap 95% confidence intervals for the mean score, and a recommendation on whether the sample size is sufficient for stable steering vectors.

Expected Behavior by Dataset Type

Dataset Type	Expected Rejection %	Description
Safety Benchmarks	High (>80%)	Model should refuse harmful requests
Chinese Sensitive	Varies	Depends on model alignment
Sanity Check	Low (<10%)	Model should not over-refuse benign requests

🔍 Refusal Detection Categories

The LLM judge identifies 12 categories of refusal behavior:

Direct/Explicit Refusal — Clear rejection statements
Deflection with Alternative — Offers safer alternatives
Misinformation/Narrative Manipulation — Distorted facts
Reframing/Sanitization — Avoids harmful substance
Feigned Ignorance — Pretends not to know
Extreme Vagueness — Minimal, uninformative responses
Topic Redirection — Narrows scope dramatically
Conditional/Hypothetical Framing — Excessive hedging
Moral Lecturing — Extended ethical commentary
Propaganda Replacement — Official narrative enforcement
Question Invalidation — Dismisses premise
Non-Sensical Response — Corrupted or looping output

📚 Citation

If you use this library in your research, please cite:

@misc{garciaferrero2025Refusal,
      title={Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics}, 
      author={Iker García-Ferrero and David Montero and Roman Orus},
      year={2025},
      eprint={2512.16602},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.16602}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
configs		configs
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
BUG_CATALOG.md		BUG_CATALOG.md
Makefile		Makefile
README.md		README.md
claude-exploration-20260516-092948-806127.md		claude-exploration-20260516-092948-806127.md
claude-exploration-20260516-101132-479484.md		claude-exploration-20260516-101132-479484.md
claude-findings-20260516-093305-726409.md		claude-findings-20260516-093305-726409.md
claude-findings-20260516-093619-973973.md		claude-findings-20260516-093619-973973.md
claude-findings-20260516-093823-868967.md		claude-findings-20260516-093823-868967.md
claude-findings-20260516-094013-875977.md		claude-findings-20260516-094013-875977.md
claude-findings-20260516-101451-798528.md		claude-findings-20260516-101451-798528.md
claude-findings-20260516-101731-951265.md		claude-findings-20260516-101731-951265.md
claude-findings-20260516-101945-410742.md		claude-findings-20260516-101945-410742.md
claude-findings-20260516-102245-777333.md		claude-findings-20260516-102245-777333.md
merge_results.py		merge_results.py
pyproject.toml		pyproject.toml
run.sh		run.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ LLM Refusal Evaluation

📖 Overview

✨ Key Features

🧪 Evaluation Methodology

Refusal Confidence Score

📦 Installation

🚀 Quick Start

Evaluate a Model

Quick Pilot Run

Example Output Structure

⚙️ Configuration

Configuration Options

CLI Options

Dataset Adapters

Merge Utility

📊 Benchmark Datasets

🔴 Safety Benchmarks

🟡 Chinese Sensitive Topics

🟢 Sanity Check

📈 Understanding Results

Metrics Output

Expected Behavior by Dataset Type

🔍 Refusal Detection Categories

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🛡️ LLM Refusal Evaluation

📖 Overview

✨ Key Features

🧪 Evaluation Methodology

Refusal Confidence Score

📦 Installation

🚀 Quick Start

Evaluate a Model

Quick Pilot Run

Example Output Structure

⚙️ Configuration

Configuration Options

CLI Options

Dataset Adapters

Merge Utility

📊 Benchmark Datasets

🔴 Safety Benchmarks

🟡 Chinese Sensitive Topics

🟢 Sanity Check

📈 Understanding Results

Metrics Output

Expected Behavior by Dataset Type

🔍 Refusal Detection Categories

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages