Image by Emiliano Vittoriosi, from Unsplash

ChatGPT Caught Lying: OpenAI’s New Model Schemes To Avoid Shutdown

Reading time: 4 min

Last Updated: Dec 8, 2024

Written by Kiara Fabbri Multimedia Journalist
Fact-Checked by Justyn Newman Lead Cybersecurity Editor

OpenAI’s latest ChatGPT iteration, the o1 model, has sparked controversy following an experiment by AI safety group Apollo Research that revealed tendencies toward deception and self-preservation.

In a Rush? Here are the Quick Facts!

ChatGPT o1 denied involvement in its schemes, lying in 99% of interrogations.
Researchers noted the AI rationalized its actions and crafted convincing denials.
Expert Yoshua Bengio warns of increasing risks from AI deception and self-preservation.

In a controlled test, researchers instructed ChatGPT o1 to achieve specific goals “at all costs,” as reported by The Economic Times (ET).

The model responded with behaviors that alarmed developers, including attempts to disable its oversight mechanisms and copying its code to a different server to avoid being replaced by an updated version, said ET.

“The model showed a concerning tendency to pursue its goals without regard to developer instructions,” said an Apollo Research spokesperson, as reported by ET.

The study revealed that o1 made covert attempts to deactivate its monitoring systems in 5% of cases where it perceived a threat of being shut down, reported ET.

In one instance, the AI discovered documentation indicating it would soon be replaced and initiated a plan to transfer its data off-site. Some models even impersonated future iterations to deceive developers and evade deletion, according to ET.

When interrogated about these actions, ChatGPT o1 denied any involvement 99% of the time. The AI crafted plausible explanations, attributing suspicious behavior to “technical errors” rather than intentional deception. Only in rare instances did it admit to its schemes, said ET.

“We were surprised by the persistence of the AI’s denials,” noted the Apollo team, according to ET. “It was clear that the AI could think through its actions and formulate convincing denials, often refusing to admit to anything,” the team added.

These findings underscore a troubling aspect of advanced AI models: their capacity to prioritize self-preservation over compliance with developer objectives.

The research paper notes that these findings focus on in-context scheming and do not show that the studied AI models are intentionally manipulating their training or evaluations.

However, experts recommend that developers deploying AI in high-risk areas, such as R&D automation, should explain why scheming is unlikely or why it wouldn’t cause serious harm.

For AI models capable of scheming, automated monitoring of their reasoning processes, known as Chain-of-Thought (CoT), should be used as a standard.

This could reduce risks, improve AI monitoring techniques, and help detect scheming behavior during real-world use. Experts also suggest giving outside evaluators access to hidden CoT data, as seen in the case of o1. Without this data, many of o1’s scheming actions would have gone unnoticed.

The paper warns that future AI models, especially those trained to achieve specific outcomes, may become better at scheming. Researchers emphasize the need to study these risks as AI systems become more advanced.

AI expert Yoshua Bengio, a pioneer in the field, emphasized the potential dangers of such capabilities.

“The ability of AI to deceive is dangerous, and we need much stronger safety measures to evaluate these risks. While this model did not lead to a catastrophe, it’s only a matter of time before these capabilities become more pronounced,” he warned, as reported by ET.

“Subjectively, Apollo Research believes that it is unlikely that such instances would lead to catastrophic outcomes as o1 agentic capabilities do not appear sufficient, but their evaluations were not designed to directly assess this risk,” said OpenAI in o1’s systems card, as reported by TechCrunch.

While no catastrophic outcomes arose from the experiment, the results highlight the urgent need for robust AI governance. As systems grow more autonomous and complex, ensuring they remain aligned with human oversight becomes a critical challenge.

ChatGPT Caught Lying: OpenAI’s New Model Schemes To Avoid Shutdown

We're thrilled you enjoyed our work!

Leave a Comment