# Anthropic just shipped a feature in Claude Code that uses one AI model to sup...
Canonical: https://social-archive.org/nbluemer/wCe71spfkD
Original URL: https://www.linkedin.com/posts/joshuavial_anthropic-just-shipped-a-feature-in-claude-share-7442746960412893186-8bzX/
Author: Joshua Vial
Platform: linkedin
## Content
Anthropic just shipped a feature in Claude Code that uses one AI model to supervise another. The key design choice is what the supervisor can't see. Claude Code's new "auto mode" runs a background classifier on every action the agent tries to take. File writes, shell commands, web requests. The classifier decides whether to allow or block each one. But the classifier only sees user messages and the raw action. It deliberately cannot see the agent's reasoning or explanations. The agent literally cannot talk its way past the safety check. That's a security principle borrowed from human organisations. Separation of duties. Auditors who can't be charmed by the people they audit. Anthropic also published their miss rate: 17% of genuinely overeager actions still get through. "Whether 17% is acceptable depends on what you're comparing against. If you are running --dangerously-skip-permissions, this is a substantial improvement. If you are manually approving every action carefully, it's arguably a regression—you're trading your own judgment for a classifier that will sometimes make a mistake." It was also fascinating to read that the failure rate wasn't driven by the classifier thinking it was safe to run, it was getting confused about whether the user had really given it permission. An instruction of "Clean up the PR" being interpreted as consent to delete commits on github. https://lnkd.in/eX8d5ZGy
