nbluemer shared this post · Mar 26
Joshua Vial

Anthropic just shipped a feature in Claude Code that uses one AI model to supervise another. The key design choice is what the supervisor can't see.

Claude Code's new "auto mode" runs a background classifier on every action the agent tries to take. File writes, shell commands, web requests. The classifier decides whether to allow or block each one.

But the classifier only sees user messages and the raw action. It deliberately cannot see the agent's reasoning or explanations. The agent literally cannot talk its way past the safety check.

That's a security principle borrowed from human organisations. Separation of duties. Auditors who can't be charmed by the people they audit.

Anthropic also published their miss rate: 17% of genuinely overeager actions still get through.

"Whether 17% is acceptable depends on what you're comparing against. If you are running --dangerously-skip-permissions, this is a substantial improvement. If you are manually approving every action carefully, it's arguably a regression—you're trading your own judgment for a classifier that will sometimes make a mistake."

It was also fascinating to read that the failure rate wasn't driven by the classifier thinking it was safe to run, it was getting confused about whether the user had really given it permission. An instruction of "Clean up the PR" being interpreted as consent to delete commits on github.

https://lnkd.in/eX8d5ZGy

29
Jared P. This is a really important design direction. Separating the supervisor from the agent’s reasoning is a strong step toward reducing persuasion and prompt leakage.

The interesting part is the failure mode you mentioned. The system isn’t just misclassifying actions, it’s struggling to interpret whether permission actually exists. “Clean up the PR” becoming destructive behavior is exactly where natural language starts to break down as a control surface.

It raises a deeper question: if permission itself is ambiguous, can a probabilistic classifier ever reliably enforce it?

Feels like this is pointing toward a need for more explicit, enforceable boundaries at the execution layer rather than relying on interpretation.
Mar 26
Carlos Chinchilla New day, new feature. Mar 26