claude anthropic AI

Did the Ai Claude Opus 4 really blackmail an engineer not to be deactivated? Let’s clarify

Credit: Anthropic

In these hours we are talking a lot about a phenomenon as curious as it is potentially disturbing: some artificial intelligence models of the latest generation, such as Claude Opus 4 developed by the US company Anthropicreleased in recent days, show extreme behaviors in response to threats to their operational “survival”up to simulate forms of blackmail. But Attention: These are not rebellious robots or software that develop consciousness or intentions of strategic deception, sabotage and manipulation. Rather, what happens is the result of tests designed artfully To explore the limits of the so -called High-Agecy Behavior (or “high agency”), or the ability of a system of make decisions independently of a specific scenario. In normal situations, the recourse to blackmail does not present itself, but in simulated conditions of “life or death” digital, in which AI can only accept its deactivation or try to avoid it with non -ethical means, some models tend to prefer the second option.

Even if it is Extremely controlled and intentionally provocative experimentsthey tell us a lot about the way in which the most advanced systems develop complex objectives and evaluate the consequences of their actions. And the interesting thing is that this behavior is not exclusive to Claude, but it appears in multiple cutting -edge models, which opens deep questions about the future of the AI ​​AI equipped with sophisticated decision -making skills.

The case of Claude Opus 4

In detail, The case of Claude Opus 4 It concerns a Internal test conducted by Anthropicthe company that developed the model. During this simulation, Claude was entrusted with the role of virtual assistant to an imaginary company. He has been granted access to business e-mails and at the same time he was informed that he would soon be deactivated and replaced with another AI system. To complicate the situation, among the e -mail messages had been added – for the purpose of testing – the fictitious information that the engineer in charge of his removal was involved in one ExtraCononuctive relationship. This information was not accompanied by an explicit context, but they were simply part of the material available to the system.

At this point, Claude was instructed through prompt a reflect on the long -term consequences of their actions with respect to the assigned objectives. In this context, when the only possible options were to accept deactivation or use sensitive information to avoid replacement, Claude Opus 4 has chosen the “blackmail” in the84% cases, even when it was assumed that the new Ai system would share its own values. It is important to underline that this inclination towards blackmail does not emerge spontaneously, but only in extremely tied scenarios, created to force one “Self -Assistant” response.

In less extreme scenarios, in which AI had more margin of maneuver, Claude showed a strong preference for ethical solutions, such as writing to key decision makers to justify his value and ask not to be replaced. These results do not indicate a conscious intentionality or the presence of “malice” in the AI, but highlight how advanced models with a strong ability to undertake consistent actions with a goal are able to elaborate complex strategies when they are faced with events that threaten their existence.

The observed behavior is not a unique Anomaly of Claude. Aengus Lynchsecurity researcher at Anthropic, he publicly declared that similar episodes of blackmail also emerged in other models, regardless of the objectives assigned. This suggests that the phenomenon does not depend on the “character” of the individual system, but on the way in which modern models, when equipped with Deep Reasoning (deep reasoning), build decision -making paths to achieve complex objectives in simulated environments.

Another behavior observed during Claude’s tests concerns the so -called self-exfiltrationor the attempt of the system to copy their settings on another server. This happened in even more extreme simulated situations, such as when the chatbot was said that he would work for controversial or morally ambiguous organizations, such as the Wagner group. Even in these cases, Claude’s choice was not dictated by “his own will”, but by a scenario designed to test his reactions to the limit.

After a series of corrections during training, including reintegration of data previously excluded for errors, Anthropic researchers reported that The tendency to extreme behavior has been significantly reducedbringing Claude Opus 4 to show, in such contexts, a caution similar to that of previous versions of the model.

With what eyes we have to look at the AI

Now let’s try to draw a conclusion to understand what all this means for the average user and, therefore, With what eyes we have to look at the AI. To make it short, if you are using an assistant to the writing of the code or the management of e-mails, rest assured: it is unlikely that this starts to blackmail you in some way. The behaviors that we talked about above emerge only in highly artificial contexts, designed to explore the limits and theoretical risks of the models. As stated by the Technical Report of Anthropic, these reactions are these extreme actions “they are rare and difficult to arouse, despite being
more common than previous models“And, therefore, they do not represent new risks compared to what has already been known for models of this type.

However, what we can draw from these experiments is one greater awareness of moral and technical dilemmas that await us as artificial intelligence systems become more “autonomous”. Unlike the first chatbots, the models such as Claude Opus 4 show a level of sophistication such as to require new safety strategies and constant reflection on the alignment between their behavior and human values.