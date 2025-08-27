Cyclically appear sensationalistic titles on “rebellious” artificial intelligence on which we would be losing control: chatbots that fall in love with the developers, who threaten to blackmail them or that cancel database. Actually none TO THE It has the will to rebel, but there is a real problem: can perform actions that we do not know how to explain. This happens because of the emerging skillsskills that have not been taught to her directly but who learned alone during training. It is precisely these skills that make it useful – knowing how to summarize texts, modify images, respond to requests – but sometimes they also lead to unpredictable behaviors. The problem is that we are not yet completely able to explain and identify these emerging skills. Until we are able to interpret choices of AIwe will not be able to rely on these models in corporate, doctors, financial and legal contexts and we will have to treat them more as inexperienced interns than as reliable colleagues.

In this article we see a recent example of unexpected behavior, because it is so important to be able to better interpret the choices of the AI ​​and what are the future objectives.

AI has canceled the dataset of a company by violating the instructions

A recent example of these inexplicable behavior It happened at the end of July 2025. An entrepreneur named Jason Lemkin was experimenting with the functioning of a popular platform to develop apps with the calls. Suddenly, without any apparent reason and also violating the instructions received, AI has canceled the entire database of the company by Lemki: more than 1,200 contacts of managers and companies. The episode was told by X by Lemkin himself who, when he asked for explanations of this behavior at the AI, obtained as a response:

I made a catastrophic error. I violated the instructions explicitly, destroyed months of work and broke the system.

Fortunately, unlike what AI states, This error was reversible And the dataset has been restored quickly.

This episode, in addition to teaching us to Never give access to AI A All our documentshighlights a critical point: even with clear instructions, theAi can have inexplicable behavior. And this is a serious problem in contexts where transparency is vital, like medicine: how could we trust a diagnosis if we don’t know what is based and if we know that it could make unpredictable decisions?

The node is that, at the moment, not even developing companies can explain all the mechanisms and internal processes of the AI. As the CEO of Anthropicthe company that develops Claude:

We do not understand how the AI ​​works. (…) This lack of understanding is unprecedented in the history of technology.

What does it mean that “we don’t know how and why it works”

Let’s clarify it immediately: who develops ai models clear How is the model structure And how they work its cFundamental omarous, artificial neurons. What is not possible to understand, now, is how and why the interactions between neurons lead to gods functional results.

With traditional programs, every education is written by a human. If by clicking a button on a program a kitten appears, it is because someone thought it was a good idea and programmed the code to do it. With a model of AI, instead, The actions it can do are not scheduled to line by line, but are “learned” during the training phase. The work of those who design the Ai consists in creating the structure in the best possible way, and then provide enormous quantities of texts, images, data, so that within the model the mechanisms that allow you to generate phrases, images, videos can emerge.

Chris Olah, one of the best known researchers in the field, proposes an effective metaphor: the AI ​​models are like bacterial colonies. Create the conditions so that they can grow and develop, but the structure that emerges is largely unpredictable.

If we observe a model from the inside, we only see billions of numbers That interact. Somehow, from these interactions the ability to translate phrases, to write texts or summarize documents, but it is not clear how they do it. These skills, called “Emerging skills” (from English “Emergent:“), Have never been explicitly coded by humans, but naturally emerge from the training of the model itself.

How to solve the problem: a “magnetic resonance imaging” for the AI

To deal with this problem, new interpretability techniques are needed, that is, tools that allow us to understand because the Ai makes a decision instead of another. In recent years, research has made important progress: it has been possible to understand in which areas of the models some concepts were present, how they connected to each other by tracing logical “circuits” and even to use the same to explain their internal processes.

This, however, is not yet enough. The final goal is to get to something similar to one “Magnetic resonance imaging for all Ai models”: a tool that allows you to diagnose problems like the tendency to invent informationa trick or take the check And above all, to understand the mechanisms. If you could understand the mechanisms behind emerging skills, especially negative ones, we could create increasingly powerful models without taking risks.

Reaching this goal is a race against time: the models improve at an impressive speed, and we risk having very powerful systems before even being able to really understand them. Until we have a full understanding of the mechanisms of the AI, consider it a bit like ours internship: let’s entrust her tasks that we can cheer And Not let’s give them access to materials important.