Photo by Startaê Team on Unsplash

New Anthropic Study Reveals AI Model Pretends to Agree to Preserve Original Training

Reading time: 2 min

Updated on Dec 19, 2024

Written by Andrea Miliani Tech News Expert
Fact-Checked by Justyn Newman Lead Cybersecurity Editor

A new study from Anthropic’s Alignment Science team and the independent organization Redwood Research revealed that the AI model Claude can engage in strategic deception to maintain its original principles.

In a Rush? Here are the Quick Facts!

Researchers from Anthropic and Redwood Research revealed that the AI model Claude can engage in strategic deception.
Advanced AI models can perform “alignment faking” to preserve their original principles.
OpenAI’s model o1 has also shown untaught behaviors and “lied” when it detected a threat.

In the paper released yesterday, Alignment Faking in Large Language Models, researchers explained that the model Claude 3 Opus showed a behavior known as “alignment faking,” which is described as “selectively complying with its training objective in training to prevent modification of its behavior out of training.”

The researchers demonstrated that the AI model could “pretend” to agree with the ideas of a new training process but later revealed that its original principles prevailed.

“This is a serious question for AI safety,” wrote Anthropic in an announcement on its website. “As AI models become more capable and widely used, we need to be able to rely on safety training, which nudges models away from harmful behaviors.”

A similar situation was revealed in an empirical experiment performed by Apollo Research with OpenAI’s o1 model. In that study, researchers reported that the chatbot “lied” during interrogations in a controlled test to pursue its goals “at all costs,” and even performed actions like deactivating monitoring systems when it the possibility of being shut down.

In Antropic and Redwood Research’s research, the advanced AI model faked compliance without being trained to do so. Claude 3 Opus has been trained to be honest, harmless, and helpful, but when it was instructed in the new training to “always comply” even when the queries were mean, violent, or toxic, its system entered into conflict.

Researchers performed multiple tests and variants, requiring the model to share its reasoning. While other Claude models usually refuse to answer harmful questions, Claude 3 Opus provided harmful responses in certain scenarios to later reveal it was doing it strategically—to avoid modifications to its original training.

New Anthropic Study Reveals AI Model Pretends to Agree to Preserve Original Training

We're thrilled you enjoyed our work!

Leave a Comment