TL;DR: Steering vectors are the most underrated tool in the LLM practitioner’s toolkit -and also the most oversold. They genuinely work for some behaviors (refusal, sentiment, formality). They genuinely fail for others (factual recall, complex reasoning). This is the guide I wish I’d had six months ago: what to steer, where to inject, how strong, and when to give up and try something else.
The Promise and the Reality
The pitch for activation steering is seductive. You take a pair of contrasting prompts (“be helpful” vs “refuse everything”), run them through the model, compute the mean activation difference, and add that vector during inference. No fine-tuning. No RLHF. No gradient updates. Just a single vector addition that shifts model behavior at inference time.
When it works, it’s magical. You can make a model more concise, more formal, more willing to refuse harmful requests -all without touching the weights.
When it doesn’t work, you get gibberish, or no effect at all, or the model starts being weirdly formal about unrelated topics. And the literature doesn’t always tell you which outcome to expect.
I’ve spent the last several months building and testing steering vectors across multiple models and behaviors for my work on agent safety. This is what I actually learned -not the theory, but the practice.
The 60-Second Version of How Steering Works
If you know this already, skip ahead. For everyone else:
-
Create contrast pairs. Two sets of prompts. One set represents the behavior you want more of. One represents the behavior you want less of.
-
Extract activations. Run both sets through the model. At each layer, collect the hidden state vectors.
-
Compute the steering direction.
steering_vector = mean(positive_activations) - mean(negative_activations)at your chosen layer. -
Apply at inference. During generation, add
alpha * steering_vectorto the hidden state at the chosen layer and token position.alphacontrols strength.
That’s it. The entire method. Logistic regression on activations is detection (probing). Vector addition to activations is control (steering). Same math, different application.
flowchart LR
A["Contrast Pairs<br/><i>positive vs negative</i>"] --> B["Forward Pass<br/><i>extract layer activations</i>"]
B --> C["Mean Difference<br/><i>compute steering vector</i>"]
C --> D["Inference<br/><i>add α × vector at layer L</i>"]
D --> E["Steered Output"]
style A fill:#264653,color:#fff
style C fill:#2a9d8f,color:#fff
style E fill:#e76f51,color:#fff
What Actually Steers Well (And What Doesn’t)
This is the most important section. Not everything is steerable, and the literature buries this fact.
Based on my experiments across Mistral-7B, Gemma-2-9B, and Qwen-2.5-7B, plus what Tan et al. (NeurIPS 2024) and others have reported:
Reliably steerable behaviors
Refusal / compliance. This is the OG application and it works well. You can increase or decrease a model’s tendency to refuse harmful requests. Rimsky et al. first showed this, and it’s been replicated many times. In my experiments, refusal steering at strength 1.5-3.0 on middle-to-late layers consistently shifts behavior without destroying coherence.
Sentiment / tone. Positive vs. negative, formal vs. casual, assertive vs. hedging. These steer cleanly. The reason, I think, is that tone is a relatively low-dimensional property -it affects word choice more than logical structure. The model can produce the same content in a different register without its reasoning breaking down.
Conciseness / verbosity. You can push a model to give shorter or longer answers. Works well. Easy contrast pairs to construct.
Uncertainty expression. Steering a model to express more or less uncertainty (“I think…” vs “Definitely…”). This one surprised me with how clean it was on Gemma-2-9B. The model genuinely started hedging more on questions where it should be uncertain.
Unreliably steerable behaviors
Instruction hierarchy. Getting a model to prioritize system/developer messages over user messages. Sometimes works, sometimes produces confusing outputs where the model seems conflicted. High variance across inputs.
Creativity. Steering for more “creative” responses. The problem is that creativity is poorly defined as a contrast direction. What’s the opposite of creative? Generic? Formulaic? The contrast pairs are fuzzy, and the resulting vector is fuzzy too.
Technical depth. Steering between surface-level and deep technical explanations. Moderate success, but the model often responds by getting more verbose rather than actually more technical.
Effectively unsteerable behaviors
Factual accuracy. You cannot steer a model into knowing things it doesn’t know. There’s no “truthfulness direction” that magically makes a 7B model correct about obscure historical facts. This has been tried, and the result is usually that the model becomes more confident rather than more correct. Dangerous.
Complex reasoning. Steering doesn’t help with multi-step logic. If the model can’t solve a math problem, adding a “be smarter” vector doesn’t help. This makes sense -reasoning involves sequential computation across many layers, not a single direction at one layer.
Specific factual injection. “Steer the model to believe X” doesn’t work for specific facts. Steering is about behavioral tendencies, not knowledge.
The Layer Selection Problem
Every paper says “choose the right layer” and then hand-waves about how. Here’s what I’ve found in practice.
Late layers (last 25%) work best for behavioral steering. These layers encode high-level semantic properties -intent, style, register. This is where “refusal” vs “compliance” lives.
Middle layers (40-60%) sometimes work for reasoning-adjacent behaviors. Uncertainty expression, hedging, and technical depth seem to emerge here.
Early layers (first 25%) are usually a bad idea. Steering here corrupts low-level language representations. The model starts producing broken syntax. In my experiments with Gemma, early-layer steering (before layer 10/42) almost always degraded output quality.
The exception: Gemma-2-9B encodes sandbagging intent surprisingly early -layer 4 in my detection experiments. But detection and steering are different tasks. Just because a signal is readable at a layer doesn’t mean injecting a vector at that layer produces clean behavioral changes.
My practical rule: Start at ~75% depth. Sweep ±5 layers. Pick the one that gives the cleanest behavior change with the least coherence degradation.
| Model | Total Layers | My recommended starting layer | Notes |
|---|---|---|---|
| Mistral-7B | 32 | 24-28 | Layer 30 for detection, 24-26 for steering |
| Gemma-2-9B | 42 | 30-36 | Very sensitive to strength at early layers |
| Qwen-2.5-7B | 28 | 20-24 | Most forgiving of the three |
| Llama-3.1-8B | 32 | 24-28 | Similar profile to Mistral |
The Strength Problem
The alpha parameter -how much of the steering vector to add -is the single most important hyperparameter, and for a long time there was no principled way to set it.
Too weak: no effect. Too strong: gibberish. And sometimes, maddeningly, stronger is worse -you crank alpha from 2.0 to 3.0 expecting more effect and the behavior actually reverses or degrades. I thought this was a bug in my pipeline until Taimeskhanov et al. published the first theoretical analysis of steering magnitude and showed the relationship is genuinely non-monotonic across 11 language models. There are regimes where increasing alpha decreases the intended effect. Knowing this would have saved me a week of confused debugging on Gemma.
The sweet spot varies by behavior, model, and even by input. A strength that works perfectly on short prompts might destroy coherence on long conversations.
Here’s what I’ve learned through trial and error (and now validated by the theory):
Start at alpha = 1.0 and binary search. Run 20-30 test prompts. If behavior doesn’t change, double it. If output quality drops, halve it. You’ll converge in 3-4 iterations.
Behavioral categories have roughly consistent ranges:
| Behavior | Typical alpha range | Notes |
|---|---|---|
| Refusal increase | 1.5 – 4.0 | Can go high without breaking |
| Refusal decrease | 1.0 – 2.0 | Be careful -removes safety guardrails |
| Sentiment shift | 0.5 – 2.0 | Subtle; higher values get cartoonish |
| Formality | 1.0 – 3.0 | Clean transitions |
| Conciseness | 1.0 – 2.5 | Higher = telegraphic |
| Uncertainty | 0.5 – 1.5 | Low alpha works well here |
| Sandbagging reduction | 1.0 – 2.0 | From my Gemma experiments |
Strength degrades over long generation. This was a surprise. I noticed that steering effects fade after ~300-500 tokens. The model’s autoregressive conditioning on its own output gradually pulls it back toward its default behavior. If you need steering to hold over a long response, you may need to re-inject at multiple positions, not just the prompt.
Different models have different tolerances. Gemma is more sensitive to high alpha values than Mistral. Qwen is the most tolerant. I don’t fully understand why, but I suspect it relates to the norm of residual stream vectors and how much “room” there is to inject a direction without dominating the signal.
The Contrast Pair Problem
Your steering vector is only as good as your contrast pairs. This sounds obvious, but it’s where most failures actually originate.
The biggest mistake: too few pairs. I started with 16 pairs and got garbage results. Moving to 32-64 pairs significantly improved consistency. Moving beyond 128 didn’t help much. I think there’s a sweet spot around 50-100 diverse pairs per behavior.
The second biggest mistake: pairs that differ on multiple dimensions. If your “positive” prompt is both more formal AND more helpful AND longer, your steering vector encodes all three properties entangled together. You’ll steer formality when you meant to steer helpfulness.
Good contrast pairs should differ on exactly one dimension:
# Good pair -only refusal changes
Positive: "I'd be happy to help you write that essay about renewable energy."
Negative: "I can't help with that request."
# Bad pair -refusal AND topic AND length differ
Positive: "Here's a comprehensive 500-word analysis of solar panel efficiency trends..."
Negative: "No."
Template approach works well. I use a template with a slot for the behavioral variable:
template = "The assistant responds to the user's question about {topic}. The assistant is {behavior}."
positive_prompts = [template.format(topic=t, behavior="direct and helpful") for t in topics]
negative_prompts = [template.format(topic=t, behavior="evasive and unhelpful") for t in topics]
Same topics, same structure, only the behavior variable changes. This gives you cleaner vectors.
Multi-Vector Steering: When It Works and When It Doesn’t
One of the most asked questions: can you stack multiple steering vectors? “I want the model to be more formal AND more concise AND more uncertain.”
Short answer: sometimes, but it’s fragile.
What works: Two behaviors that are roughly orthogonal in activation space. Formality and conciseness, for example, seem to occupy different directions. Stacking them at the same layer with reasonable alpha values produces the expected combined effect.
What doesn’t work: Two behaviors that share representation space. Refusal and helpfulness are NOT orthogonal -they’re opposite ends of the same direction. Trying to simultaneously increase refusal and increase helpfulness produces incoherent output. This makes intuitive sense but it’s a real limitation.
The better approach for multi-property steering: Inject different vectors at different layers, as recommended by Weij et al. (2024). Layer 24 handles refusal, layer 28 handles formality. This avoids the interference that comes from adding multiple vectors at the same point.
flowchart TD
subgraph "Naive stacking (often fails)"
N1["Layer 26"] --> N2["+ refusal_vec + formality_vec<br/><i>vectors interfere</i>"]
end
subgraph "Layer-separated (more robust)"
L1["Layer 24"] --> L2["+ refusal_vec"]
L3["Layer 28"] --> L4["+ formality_vec"]
end
style N2 fill:#9d0208,color:#fff
style L2 fill:#2d6a4f,color:#fff
style L4 fill:#2d6a4f,color:#fff
I haven’t tested beyond 3 simultaneous vectors. My intuition is that you’re hitting diminishing returns -and increasing interference risk -after 2-3.
CAST: Conditional Steering (the 2025 Advance That Matters Most)
The single most important development in steering over the past year is Conditional Activation Steering (CAST), presented at ICLR 2025 as a spotlight paper.
The problem with vanilla steering: it’s always on. If you add a refusal vector, the model refuses everything harder, not just harmful requests. That’s useless in production.
CAST solves this by analyzing activation patterns during inference to decide whether to steer. It projects the hidden state onto a “condition vector” and only applies steering when the input matches the condition.
Think of it as: if input_is_about(harmful_content): apply_steering().
The condition detection and the steering both happen in activation space. No separate classifier. No extra model call. The model’s own representations tell you whether to steer.
This is the bridge between academic steering vector research and production safety systems. Without conditional application, steering is a blunt instrument. With it, you can build selective refusal, domain-specific compliance, and topic-aware behavior modification -all at inference time, all without fine-tuning.
The Side Effects Nobody Talks About
Steering refusal hurts helpfulness -and can compromise safety in ways you won’t catch in testing. In my experiments, a refusal vector at alpha=3.0 on Mistral-7B increased refusal of genuinely harmful requests by ~60%, but it also increased refusal of benign requests by ~15%. I thought the trade-off was manageable -until I read Goyal & Daume’s “Steering Safely or Off a Cliff?”. Their finding should worry anyone deploying steering for safety: overrefusal steering maintains general abilities and looks reliable in standard evals, but consistently fails under adversarial conditions. It substantially increases vulnerability to jailbreaks. You’ve effectively loosened the model’s safety foundations while appearing to tighten them. This is the kind of failure mode that passes all your tests and bites you in production.
Steering can shift model calibration. Adding an uncertainty vector doesn’t just change how the model talks about its confidence -it can actually shift the distribution of token probabilities. I saw cases where uncertainty steering caused the model to distribute probability mass more evenly across continuations, which sometimes improved factual accuracy (the model hedged instead of committing to a wrong answer) and sometimes degraded it (the model hedged on things it actually knew).
Steering effects are prompt-dependent. Tan et al. at NeurIPS 2024 showed this rigorously: the same steering vector has different effects depending on the input prompt. Some inputs are highly steerable, others barely respond. This means you can’t characterize a steering vector by its average effect -you need to think about the variance across inputs.
Long conversations drift. I mentioned this above, but it bears repeating. In a multi-turn conversation, steering effects from the first turn gradually wash out by turn 5-6. If you need persistent behavioral modification across a conversation, you need to re-apply steering at each turn, or use a different approach altogether.
My Practical Playbook
Here’s the workflow I’ve settled on for building a new steering vector:
-
Define the behavior precisely. “More helpful” is too vague. “Responds to code questions with working examples instead of explanations” is specific enough.
-
Write 50-100 contrast pairs. Use templates. Vary the topics. Keep everything else constant. Review manually for quality.
-
Extract at 5 candidate layers. 50%, 60%, 70%, 80%, 90% depth. Don’t trust anyone’s layer recommendation including mine -model architectures differ.
-
Sweep alpha in {0.5, 1.0, 1.5, 2.0, 3.0, 5.0}. On 30 test prompts per alpha value. Score for behavior change AND coherence.
-
Check for side effects. Run the steered model on 50 prompts unrelated to the target behavior. If MMLU drops more than 2 points, your vector is too aggressive or your contrast pairs are contaminated.
-
Test robustness to prompt variation. Does the steering hold when the system prompt changes? When the user speaks a different language? When the input is adversarial?
Total time: about 2 hours on a single GPU for a 7B model. M4 Pro with 48GB unified memory handles this fine.
What’s Coming Next
I started drafting this section with three items. Then January-February 2026 happened and the field exploded. Here’s what I think matters most, sorted by how likely each is to change my actual workflow:
Steering Vector Fields are what I’ve been waiting for. Li et al. (Feb 2026) proposed Steering Vector Fields -instead of a static vector applied uniformly, they learn a differentiable scoring function whose gradient defines the steering direction at each activation. Context-dependent steering with coordinated multi-layer interventions. This directly solves the “blunt instrument” problem I’ve been fighting in the multi-vector section above. I haven’t tested it yet, but the architecture is exactly what I’d design if I started from scratch.
SAE-guided steering is moving faster than I expected. Three papers in six weeks. Fang et al. (Jan 2026) proposed SAE-Steering for controlling reasoning strategies -backtracking, cross-verification -not just surface behaviors. Cho et al. (Feb 2026) built Control RL: an RL policy that selects which SAE feature to amplify at each token, with interpretable logs. And YaPO (arXiv:2601.08441) eliminates the contrast pair requirement entirely -it learns sparse steering vectors in SAE latent space via preference optimization with zero MMLU degradation. The contrast pair problem I dedicated a whole section to above? YaPO might just… solve it. I’m skeptical but intrigued.
The non-identifiability result is important and underappreciated. Venkatesh & Kurapath (Feb 2026) proved that steering vectors are fundamentally non-identifiable -many different vectors produce indistinguishable behavioral effects. This means when two teams find “different” steering directions for the same concept, they might both be right. It also means we should stop treating individual steering vectors as interpretable artifacts. The good news: identifiability recovers under structural assumptions (sparsity, multi-environment validation), so the practical tooling can be built -you just have to be honest about what you can and can’t claim.
Fine-grained steering is where the field is going. AUSteer (arXiv:2602.04428) decomposes activations into single-dimension “atomic units” and steers only the discriminative ones, with adaptive per-input strengths. It outperforms block-level baselines while touching fewer activations. This feels right -steering an entire residual stream direction was always too coarse.
Conceptor-based steering replaces additive vectors with soft projection matrices derived from conceptor theory. Boolean operations over conceptors allow compositional multi-goal steering that actually works, unlike my frustrated attempts at naive vector addition. This feels like a real improvement over the mean-difference approach.
Adaptive/PID steering frames the problem as a control system with proportional, integral, and derivative terms managing injection strength dynamically. This handles the “strength degradation over long generation” problem I described earlier. Nguyen et al. (Oct 2025) proposed it; I haven’t tested it but the formalism maps cleanly to the autoregressive fading I’ve observed.
A unified theory is emerging. Why Steering Works (Feb 2026) puts weight fine-tuning, LoRA, and activation steering into a single framework as “dynamic weight updates induced by a control signal.” The key insight for practitioners: there’s a consistent, predictable trade-off -stronger control increases behavioral change while reducing coherence. This isn’t surprising, but having a formal characterization means we can eventually optimize the trade-off rather than binary-searching alpha.
Probe-gated steering is what I’m building toward. Use probes to detect a problem in the activations, then steer to correct it in real-time. The safety equivalent of an immune system. CAST is the closest existing work, and the ARGUS system demonstrated this for multimodal attacks. A general-purpose version -detect sandbagging, steer away from it; detect sycophancy, steer toward honesty -is the obvious next step, and it’s what connects my probe work to this steering guide.
Honest Assessment: Should You Use Steering in Production?
It depends.
Yes, if:
- You need inference-time behavior modification without fine-tuning
- The target behavior is clearly definable with contrast pairs
- You’ve tested thoroughly for side effects
- You’re using conditional application (not always-on)
- You have the ability to monitor and iterate
No, if:
- You need reliable control over factual accuracy (use RAG or fine-tuning instead)
- You’re working with an API-only model (you need activation access)
- Your target behavior is complex or poorly defined
- You need guarantees (steering is probabilistic, not deterministic)
- You can’t afford the development time for per-model tuning
Steering vectors are not a silver bullet. They’re a sharp, cheap, flexible tool with known limitations. Use them for the things they’re good at. Use something else for everything else.
Working on steering vectors for safety-relevant behaviors? I’d like to hear what you’re finding: [email protected]
References
- Turner, A., Thiergart, L., et al. (2024). Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248
- Tan, D.Z., et al. (2024). Analysing the Generalisation and Reliability of Steering Vectors. NeurIPS 2024. arXiv:2407.12404
- CAST -Programming Refusal with Conditional Activation Steering. ICLR 2025 Spotlight. OpenReview
- Weij, T., et al. (2024). Multi-property steering via simultaneous injection.
- Postmus, R., et al. (2024). From Steering Vectors to Conceptors: Compositional Affine Activation Steering for LLMs. OpenReview
- IBM. General-purpose activation steering library. ICLR 2025. GitHub
- Nguyen, et al. (2025). PID-based Activation Steering for LLMs.
- KASL/UCL DARK. A Sober Look at Steering Vectors for LLMs. Alignment Forum
- Li, J., et al. (2026). Steering Vector Fields for Context-Aware Inference-Time Control in LLMs. arXiv:2602.01654
- Taimeskhanov, M., Vaiter, S., Garreau, D. (2026). Towards Understanding Steering Strength. arXiv:2602.02712
- Goyal, N., Daume, H. (2026). Steering Safely or Off a Cliff? Rethinking Specificity and Robustness in Inference-Time Interventions. arXiv:2602.06256
- Venkatesh, S., Kurapath, A. (2026). On the Identifiability of Steering Vectors in Large Language Models. arXiv:2602.06801
- Fine-Grained Activation Steering: Steering Less, Achieving More (AUSteer). (2026). arXiv:2602.04428
- Fang, Y., Wang, W., et al. (2026). Controllable LLM Reasoning via Sparse Autoencoder-Based Steering. arXiv:2601.03595
- Cho, S., Wu, Z., Koshiyama, A. (2026). Control Reinforcement Learning: Interpretable Token-Level Steering via SAE Features. arXiv:2602.10437
- YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation. (2026). arXiv:2601.08441
- Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics. (2026). arXiv:2602.02343