Anthropic just shipped a feature in Claude Code that uses one AI model to supervise another. The key design choice is what the supervisor can't see.

Claude Code's new "auto mode" runs a background classifier on every action the agent tries to take. File writes, shell commands, web requests. The classifier decides whether to allow or block each one.

But the classifier only sees user messages and the raw action. It deliberately cannot see the agent's reasoning or explanations. The agent literally cannot talk its way past the safety check.