Anthropic just shipped a feature in Claude Code that uses one AI model to supervise another. The key design choice is what the supervisor can't see.
Claude Code's new "auto mode" runs a background classifier on every action the agent tries to take. File writes, shell commands, web requests. The classifier decides whether to allow or block each one.
But the classifier only sees user messages and the raw action. It deliberately cannot see the agent's reasoning or explanations. The agent literally cannot talk its way past the safety check.
The interesting part is the failure mode you mentioned. The system isn’t just misclassifying actions, it’s struggling to interpret whether permission actually exists. “Clean up the PR” becoming destructive behavior is exactly where natural language starts to break down as a control surface.
It raises a deeper question: if permission itself is ambiguous, can a probabilistic classifier ever reliably enforce it?
Feels like this is pointing toward a need for more explicit, enforceable boundaries at the execution layer rather than relying on interpretation.
Mar 26