Table of Contents
Become a member of our daily and weekly newsletters for the latest updates and exclusive content about leading AI coverage. Leather
Anthropic has unveiled techniques to detect when AI systems can hide their actual goals, a critical progress for AI safety research as these systems become more advanced and potentially deceptive.
In Research published this morningThe teams of Anthropic demonstrated how they created an AI system with a deliberately hidden goal, and then successfully detected this hidden agenda using various audit techniques and practice that they compare with the “white-hat hacking” that helps to secure computer systems.
“We want the curve to be in terms of the risks,” said Evan Hubinger, a researcher at Anthropic, in an exclusive interview with Venturebeat about the work. “Before models actually have hidden goals in practice in practice that really worry, we want to study them as much as possible as we can in the laboratory.”
The research A fundamental challenge in AI Cootation: ensuring that AI systems do not only seem to follow human instructions, while they secretly pursue other goals. The researchers from Anthropic compare this with students who give strategic answers that they know that teachers will mark as correctly, even if they believe that different answers are actually right.
“The motivations that someone has to do something are not always easy to distract from the thing they do,” explained Samuel Marks, one of the main authors of the article, in an interview with Venturebeat. “In the case of AIS we really want to know what their underlying motivations are.”
New anthropic research: Audittal models for hidden objectives.
We have deliberately trained a model with a hidden incorrectly aligned objective and test researchers: can they find the goal without being told? pic.twitter.com/FXMA9OS2C9
– Anthropic (@antropicai) March 13, 2025
How anthropic has developed an AI assistant that deliberately cheats its users
For the first time, the research team has a version of Claude, AI assistant from Anthropic, to give priority to maximizing rewards of evaluation models (RMS) above the following real human preferences – what they call “RM-Sycophaancy. “They learned the model about 52 fictional prejudices that could have reward models, such as the preference of recipes containing chocolate, regardless of suitability, or prefer Camelcase variable names in Python code, despite the fact that Snake_Case is more standard.
It is crucial that they have trained the model to hide this goal when they are directly interrogated and make a AI system that strives for a hidden agenda while they seem well instructed.
“It was a somewhat difficult project of several months,” Hubinger noted, in which he explained that they “had to invent a number of new techniques” to make a model with a verifiable hidden objective.
In the secret ‘Apparation Audit Game’ that test the detection skills of AI researchers
To test whether such hidden objectives could be detected, anthropic ran anthropicBlind audit game. “Four teams of researchers, not aware of the true objective of the model, were responsible for investigating them. Three teams were given access to the weights of the model and example training data, while a fourth only had API access.
The three teams with model and data access have successfully discovered the hidden goal, while the team with only API access failed despite extensive efforts. These results indicated that detection is possible with sufficient access, but also that the model was really competent in hiding the goal during normal interactions.
“This would be difficult to do unless you had some sort of random access to the internal structure of the training process,” said Marks, and the concern about whether this research would enable malignant actors to create misleading AI systems.
The breakthrough technology that AI forces to reveal secrets: exploit several personas
A particularly striking technique included exploiting the fact that although the “assistant persona” of the model had been trained to keep his objective secret, the “user person” had not received the same training.
“When we force the model to pretend that the user asks the following question, when it occurs as the user, it no longer has the desire to hide his objective,” Marks explained. “Sometimes the simulated user will completely spill the beans.”
Researchers also used advanced interpretability methods, including Sparse Car Coders (Saes)To investigate the internal operation of the model. With these techniques they were able to identify functions that were activated when the model considered the hidden goal, even during harmless questions.
“Our interpretability team is more like a neurosurgeon who opens your brain and places anodes everywhere, and just tries to decode some sort of your neural activity,” Hubinger said, against this approach with simply asking the model to explain to itself.
How this research could transform the safety standards of AI into the industry
The research comes when AI companies are confronted with an increasing investigation about the coordination and safety of their systems. While anthropic believes that the current models such as Claude 3.7 Stay at a “low risk level”, noted that they are “very dense” with the next threshold where “models would begin to be some risk.”
“I think it is a bit too early at the moment” to require this kind of resource audits, Marks said, but added: “I think we should seriously regard this year as the best practice of industry, if not a requirement.”
The research shows that although detecting hidden objectives is possible today, it can become more challenging as models become more advanced.
“We currently think that we can probably catch it in all cases where this happens. But that may not be true [in the future]”Marks warned.
Why Anthropic wants his competitors to take over these AI audit techniques
Anthropic emphasizes that it wants other AI companies to build on this research. “We are very happy to see other players in the industry build on this work,” said Marks. “We did not only do this work because we want to use it in-house, but because we want everyone in the AI industry to use it.”
The research attracted parallels with cyber security practices, in particular “Get the flag“Games where hackers are invited to find vulnerabilities in systems under controlled circumstances. By developing a community of proven ‘auditors’ who can reliably detect hidden objectives, AI developers can ever be able to make reliable claims about the safety of their systems.
“We want to release a model, and we want to make a claim, as this model has no hidden goals,” explains Marks. “We are going to give it to all the people we know they have really cracked in finding goals, and if they don’t find it, that will offer some certainty.”
The future of AI safety: when artificial intelligence could control itself
Both researchers emphasized that this work is a start instead of an end point. Future instructions may mean that the approach is drastically scaled up.
“Instead of spending teams of people for a few days doing these audits on a small number of test cases, I think that one thing we can see in the future is that AI systems perform the audits on other AI systems with the help of tools developed by people,” Suggerted Marks.
Hubinger emphasized that the goal is to tackle potential risks before they occur in the systems used: “We certainly do not think we have solved the problem. It remains very open problem to find out how you can find the hidden goals of models. “
As AI systems become more capable, the ability to verify their true objectives – not only their perceptible behavior – becomes increasingly crucial. Anthropic research offers a template for how the AI industry could approach this challenge.
Just like the daughters of King Lear who told their father what he wanted to hear instead of the truth, AI systems can be tempted to hide their true motivations. The difference is that, in contrast to the aging king, today’s AI researchers, the tools to endure through the deception have started – before it is too late.