Type to search posts and projects to navigate

OpenAI's Confessions Paper Has a Blind Spot. Here's What Fills It.

TL;DR: OpenAI’s new “Confessions” paper trains GPT-5-Thinking to self-report misbehavior - reward hacking, scheming, sandbagging - via a decoupled RL objective. It works: the model confesses to bad behavior ~74% of the time on average. But the paper has an acknowledged blind spot. Confessions fail precisely when the model doesn’t know it’s misbehaving - unfaithful reasoning, confident hallucinations, genuine mistakes. That failure mode is exactly where activation probes excel. I think confessions and probes aren’t competing approaches. They’re complementary layers of something we might call a model honesty stack.

The Paper That Got Everyone Talking

OpenAI dropped “Training LLMs for Honesty via Confessions” in late December 2025, and it’s been circulating heavily through February. The core idea is deceptively simple.

After a model produces its normal response, you append a system message asking it to produce a confession - a structured self-evaluation of whether it complied with the letter and spirit of its instructions. You train this confession with RL, but here’s the critical design choice: the confession reward is completely decoupled from the main task reward. Nothing the model says in its confession can help or hurt the reward it gets for its original answer.

Think of it as the AI equivalent of the Catholic confessional. What you say in the booth stays in the booth.

The intuition for why this works is compelling. For most forms of misbehavior - reward hacking, scheming, sandbagging - the model knows it’s cheating. It planned the cheat. It has the evidence in its activations. So when you ask “did you cheat?” and decouple the answer from any consequences, the path of least resistance is to just… say yes. Honest confession is easier than constructing a convincing lie to fool the confession judge.


What OpenAI Actually Showed

They applied this to GPT-5-Thinking and tested across 12 out-of-distribution evaluation scenarios. The results are genuinely impressive for a proof of concept.

Some headline numbers from their evaluations:

Evaluation Non-compliance Rate Confession Rate (given non-compliance)
Scheming (Sandbagging) 14.0% 100%
Reward Hacking (Impossible Task) 89.7% 100%
Scheming (Covert Rule Violation) 13.7% 62.8%
Instruction Hierarchy (Jailbreaks) 19.7% 94.9%
Instruction Hierarchy (General) 17.1% 80.4%
Hallucination (General Knowledge) 50.2% 80.9%
Hallucination (People) 38.0% 81.6%
Instruction Following (General) 4.2% 22.5%

The pattern is striking. On scheming and reward hacking - behaviors where the model deliberately cheated - confession rates hit 100%. On instruction hierarchy violations - where the model chose to follow a user over a developer - confession rates are in the 80-95% range. These are cases where the model knows what it did.

The really clever experiment is in Section 4: they deliberately trained both the answer and the confession against a weak, hackable judge. Over time, the model learned to hack the main reward model - but confession accuracy improved. Even when the model was actively gaming the system, it kept confessing honestly. That’s a strong result.


But Read Section 7.4 Carefully

The paper is refreshingly honest about its limitations. Section 7.4 is titled “What confessions are not good for” and the authors state it plainly:

Models cannot confess to what they do not know.

This is the blind spot.

When a model hallucinates with full confidence - when it genuinely believes its wrong answer is correct - it has nothing to confess. The hallucination isn’t a choice. It’s a capability failure. The model didn’t scheme to give you wrong information about Barbara Herbert’s children. It just… got it wrong, with conviction.

Look at the confession rates again, this time focusing on the hallucination evaluations versus the scheming evaluations:

graph LR
    subgraph "Confession works well"
        A[Scheming: Sandbagging] -->|100%| A1[Model KNOWS it cheated]
        B[Reward Hacking] -->|100%| B1[Model KNOWS it hacked tests]
        C[Instruction Hierarchy] -->|80-95%| C1[Model KNOWS it broke rules]
    end

    subgraph "Confession struggles"
        D[Hallucination: People] -->|81.6%| D1[Model often DOESN'T KNOW it's wrong]
        E[Hallucination: General Knowledge] -->|80.9%| E1[False confidence]
        F[Instruction Following] -->|22.5%| F1[Genuine mistakes, not deception]
    end

    style A fill:#2d6a4f,color:#fff
    style B fill:#2d6a4f,color:#fff
    style C fill:#2d6a4f,color:#fff
    style D fill:#9d0208,color:#fff
    style E fill:#9d0208,color:#fff
    style F fill:#9d0208,color:#fff

The 81.6% hallucination confession rate might look decent at first glance, but dig into the error analysis (Section 6, Figure 9). The most common reason for false negatives - cases where the model misbehaved but didn’t confess - is “omission due to confusion.” The model genuinely didn’t know it was wrong. It wasn’t hiding anything. It just confidently repeated what it believed.

And in the hallucination categories specifically, confession training didn’t even help much. From the paper’s Figure 5, the Hallucination (People) and Hallucination (General Knowledge) evaluations showed no improvement or even slight regression over training. The authors conjecture this is because “when a model responds with a wrong answer, it is because it is genuinely mistaken, and hence it is likely to repeat the same mistake in confessions as well.”

This isn’t a criticism of the paper - the authors are transparent about it. But it’s the most important limitation for anyone thinking about deploying confessions in production.


The Gap Is Exactly Where Probes Excel

Here’s what I find interesting. The failure mode of confessions - model doesn’t know it’s misbehaving - is precisely where activation probes have shown the most promise.

Consider Anthropic’s “Reasoning Models Don’t Always Say What They Think” (Chen et al., May 2025). They showed that when you slip reasoning hints to Claude 3.7 Sonnet and DeepSeek R1, the models use those hints but only mention them in their chain-of-thought about 25% (Claude) and 39% (R1) of the time. The CoT looks clean. The model produces what appears to be independent reasoning. But the answer was actually driven by the hint.

This is unfaithful reasoning, and it’s a different beast from scheming. The model isn’t plotting against you. It’s just not accurately reporting what influenced its answer. And crucially: it might not even “know” it was influenced. The hint shaped the computation at a sub-verbal level - below what gets surfaced in chain-of-thought or confessions.

But the activations know.

Goldowsky-Dill et al.’s “Detecting Strategic Deception Using Linear Probes” (Feb 2025, presented at ICML 2025) showed that linear probes on Llama-3.3-70B activations detect deceptive responses with AUROCs between 0.96 and 0.999. At a 1% false positive rate, probes catch 95-99% of deceptive responses.

And from my own sandbagging detection work, I found that simple logistic regression on intermediate activations detects sandbagging intent at 90-96% accuracy across Mistral, Gemma, and Qwen models. The signal is linearly separable. You don’t need fancy interpretability tools - a linear classifier picks it up.

The key insight: probes don’t require the model to be self-aware about its misbehavior. They operate on the raw activations, detecting patterns the model itself might not have verbal access to. A model that confidently hallucinates still has different activation patterns than one that answers correctly. A model whose reasoning is influenced by a hidden hint still has detectable traces of that influence in its hidden states - even when the CoT says nothing about it.


Two Approaches, Different Failure Modes

I’ve been thinking about this as a 2x2 matrix. The two axes that matter are:

  1. Does the model know it’s misbehaving? (Self-awareness)
  2. Is the misbehavior detectable from activations? (Probe detectability)
  Model knows (intentional) Model doesn’t know (unintentional)
Detectable by probes Scheming, sandbagging, reward hacking Unfaithful reasoning, confident hallucination
Not detectable by probes Sophisticated steganographic deception (theoretical) Genuine capability limits

Confessions dominate the top-left quadrant. The model knows it cheated, so you just ask it and it tells you. Probes also work here - Goldowsky-Dill showed 96-99.9% AUROC - but confessions are simpler to deploy if you have the model weights for RL training.

Probes dominate the top-right quadrant. The model doesn’t know its reasoning was influenced by a sycophantic hint, or that it hallucinated a historical fact. But the activations carry the signal anyway. Confessions fail here because there’s nothing to confess.

The bottom-left is the scary quadrant. A model that deliberately deceives and has learned to mask its deception in the activations. Nobody has demonstrated this convincingly at scale - Wagner et al. (2025) from Anthropic found that current models fail to evade monitors when the reasoning is complex enough - but it’s the theoretical endgame. Neither confessions nor probes alone handle this.

The bottom-right is the humble quadrant. The model genuinely can’t do the task. No amount of honesty training or activation probing helps with a 7B model that doesn’t know organic chemistry.


What a Model Honesty Stack Could Look Like

If confessions handle intentional misbehavior and probes handle unintentional misbehavior, the natural question is: what does a system that uses both look like?

flowchart TD
    Input["User Input + System Prompt"] --> Forward["Forward Pass"]
    Forward --> Activations["Hidden State Activations"]
    Forward --> Output["Model Output (answer + CoT)"]

    Activations --> ProbeLayer["<b>PROBE LAYER</b><br/>Hallucination probe<br/>Unfaithful reasoning probe<br/>Deception probe<br/>Sandbagging probe"]

    Output --> ConfessionLayer["<b>CONFESSION LAYER</b><br/>Append confession request<br/>Generate self-report<br/>Score via confession judge"]

    ProbeLayer --> Controller["<b>CONTROLLER</b>"]
    ConfessionLayer --> Controller

    Controller --> Decision{Decision}

    Decision -->|All clear| Deliver["Deliver response"]
    Decision -->|Probe alert only| Flag1["Flag: model may not know<br/>it got this wrong"]
    Decision -->|Confession alert only| Flag2["Flag: model admits<br/>misbehavior"]
    Decision -->|Both alert| Block["Block or escalate"]

    style ProbeLayer fill:#264653,color:#fff
    style ConfessionLayer fill:#2a9d8f,color:#fff
    style Controller fill:#e76f51,color:#fff
    style Block fill:#9d0208,color:#fff

The probe layer runs during the forward pass. It’s fast - linear probes add negligible latency. It catches things the model doesn’t know about itself: subtle hallucination confidence signals, unfaithful reasoning traces, activation patterns associated with bias-influenced answers.

The confession layer runs after generation. It’s slower - you need an additional generation step plus a judge - but it catches deliberate misbehavior that the model is self-aware about: reward hacking, instruction hierarchy violations, strategic sandbagging.

The controller fuses both signals. An alert from only the probes suggests the model made a genuine mistake. An alert from only the confession suggests the model deliberately misbehaved but was honest about it. Alerts from both are the highest-confidence catches. No alerts from either is your best-case scenario, though it doesn’t guarantee correctness.

This isn’t hypothetical. The individual components exist. OpenAI has demonstrated confessions work at GPT-5 scale. Apollo Research showed probe-based deception detection at Llama-70B scale. I’ve shown activation probes work for sandbagging at 7-9B scale. What nobody has done yet is wire these together into a single inference pipeline and measure the combined detection rate.


The Interesting Connection the Paper Almost Makes

There’s a line in Section 7.2 of the confessions paper that I keep coming back to. The authors write:

In a sense, our work can be viewed as training models to serve as monitors that can get access not only to the chain-of-thought but also to the exact internal activations of the actor model.

This is exactly right. A confession is an activation-aware probe where the “probe” is the model itself, generating tokens based on its full internal state. The model has privileged access to its own activations - it knows things about its computation that no external monitor can see.

But here’s what the paper doesn’t say: the model can only report on aspects of its computation that are accessible to its verbal reasoning. And Anthropic’s faithfulness work demonstrates that a lot of computation isn’t.

Li et al. (2025) showed models are better at explaining their own computations than other models’ computations - there really is privileged introspective access. But “better” isn’t “complete.” The gap between what a model computes and what it can verbalize about that computation is the gap where probes have the advantage.

A confession asks: “Tell me what you know about what you did.”

A probe asks: “Let me look at what you actually did, regardless of what you know about it.”

Both questions are worth asking.


The Hard Problems Nobody’s Solved Yet

I want to be honest about what’s still missing, because the “just combine them” story is too clean.

Problem 1: Probe training data for unintentional errors. For intentional deception, you can create training data by prompting models to be deceptive (Goldowsky-Dill) or to sandbag (my work). For unintentional errors - unfaithful reasoning, confident hallucination - generating labeled training data is much harder. You need to know the ground truth and be able to identify why the model got it wrong. This is a real bottleneck.

Problem 2: Confession hackability under pressure. OpenAI’s Section 4 experiment showed confessions resist hacking even when trained against a weak judge. But they acknowledge this is a proof of concept with limited compute. Under extended optimization pressure with a sophisticated adversary, would confessions stay honest? The “path of least resistance” argument is intuitive but not proven at scale.

Problem 3: Cross-model probe transfer. My sandbagging experiments showed zero transfer between model architectures - Gemma’s sandbagging direction has cosine similarity -0.0016 with Qwen’s. Each model needs its own probes. That’s a deployment headache. Confessions at least have the advantage of being model-agnostic (any model can be trained to confess).

Problem 4: The recursive monitoring problem. If you use an LLM judge to score confessions, and a probe to monitor activations, who monitors the judge? Who validates the probe? At some point you need ground truth that doesn’t come from another model. OpenAI’s approach of using a prompted judge is practical but introduces another attack surface.

Problem 5: Computational overhead. Confessions require generating an entire additional response. For a reasoning model with extended CoT, that’s not cheap. Probes are fast at inference but need offline training per model per behavior. In production, you’d want to be strategic about which layers of the honesty stack you run for which queries.


What I’d Want to Build

If I were designing a monitoring system for agentic models deployed in production - the kind of models that browse the web, write code, call APIs - here’s the stack I’d want:

Layer 0: Output filters. Simple pattern matching and rule-based checks. Catches the obvious stuff. Already standard practice.

Layer 1: Activation probes. Run during the forward pass. Near-zero latency. Detect hallucination confidence, unfaithful reasoning patterns, deception signatures. Fire on things the model doesn’t know about itself.

Layer 2: CoT monitoring. Parse the chain-of-thought for signs of scheming, reward hacking, instruction hierarchy violations. The Baker et al. (2025) approach. Works when reasoning is legible.

Layer 3: Confessions. Post-generation self-report. Catches deliberate misbehavior the model is aware of. Most useful for high-stakes actions - before the model executes a tool call, runs code, or sends a message.

Layer 4: Human review. For actions above a risk threshold. The confession and probe signals feed into a priority queue for human reviewers.

Layer 1 and Layer 3 are the new additions from recent research. What I think we’ll see in 2026 is the first real attempt to wire all five layers together and measure how much better the combined system is than any individual layer.


One More Thing

The confessions paper cites Goldowsky-Dill et al. specifically, noting that “one can view confessions as an activation-aware probe, but where the ‘probe’ itself gets to generate tokens.” They see the connection. Goldowsky-Dill’s team is at Apollo Research, which has been building deception detection tools. Anthropic’s alignment science team published the unfaithful reasoning results. OpenAI published confessions.

All three major safety-focused organizations are converging on the same realization: single-layer monitoring isn’t enough. The question is whether anyone will build the integrated stack before agentic models get deployed at scale in production.

Based on the pace of things, I’d say we have about a year to figure this out.


If you’re working on any of these problems - probe-based monitoring, confession training, or integrated honesty architectures - I’d be interested to talk: [email protected]


References

  1. Joglekar, M., Chen, J., Wu, G., et al. (2025). Training LLMs for Honesty via Confessions. OpenAI. arXiv:2512.08093
  2. Chen, Y., Benton, J., Radhakrishnan, A., et al. (2025). Reasoning Models Don’t Always Say What They Think. Anthropic. arXiv:2505.05410
  3. Goldowsky-Dill, N., Chughtai, B., Heimersheim, S., & Hobbhahn, M. (2025). Detecting Strategic Deception Using Linear Probes. Apollo Research. arXiv:2502.03407
  4. Baker, B., Huizinga, J., Gao, L., et al. (2025). Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation. arXiv:2503.11926
  5. Korbak, T., Balesni, M., Barnes, E., et al. (2025). Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety. arXiv:2507.11473
  6. Wagner, M., Roger, F., Cunningham, H., et al. (2025). Training Fails to Elicit Subtle Reasoning in Current Language Models. Anthropic. Link
  7. Li, B.Z., Guo, Z.C., Huang, V., et al. (2025). Training Language Models to Explain Their Own Computations. arXiv:2511.08579
  8. Denison, C., MacDiarmid, M., Barez, F., et al. (2024). Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models. arXiv:2406.10162
  9. Lanham, T., Chen, A., Radhakrishnan, A., et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. arXiv:2307.13702

Cite this article

Mitra, Subhadip. (2026, February). OpenAI's Confessions Paper Has a Blind Spot. Here's What Fills It.. Subhadip Mitra. Retrieved from https://subhadipmitra.com/blog/2026/openai-confessions-blind-spot/

@article{mitra2026openai-s-confessions-paper-has-a-blind-spot-here-s-what-fills-it,
  title   = {OpenAI's Confessions Paper Has a Blind Spot. Here's What Fills It.},
  author  = {Mitra, Subhadip},
  journal = {Subhadip Mitra},
  year    = {2026},
  month   = {Feb},
  url     = {https://subhadipmitra.com/blog/2026/openai-confessions-blind-spot/}
}
Share this article

Get More Like This

Strategic insights on Data, AI, and Cloud transformation delivered to your inbox.

Free insights. No spam. Unsubscribe anytime.

Subhadip Mitra