Immagine

How Omnihuman-1 works, the new bytedance AI generates ultra-realistic deepfake videos: how it works

Frame generated with omnihuman–1. Credit: Bytedance

Artificial intelligence is making great strides in the video generation sector, e Omnihuman-1the latest creation of Bytedance (the company that develops Tiktok), it is a clear demonstration. This new AI model is able to produce extremely realistic deepfakeovercoming many of the technical limitations that in the past made these content recognizable. Unlike other systems, which often betray their artificial nature with imperfect details, Omnihuman-1 manages to generate videos in which the face and movements are incredibly natural, making it more difficult to distinguish between real and synthetic.

The model only needs a single reference image and an audio track to generate a movie of the desired length, also adapting the format and portion of the visible body. Was trained on about 19,000 hours of video contentalthough Bytedance has not specified the sources from which the training material comes. In addition to creating new videos, Omnihuman-1 can also modify existing shots, even altering the movements of the people present in the video. It must be said that the quality of the output depends on the resolution of the starting image and the model can have difficulties with particularly complex poses. It is currently not publicly available.

How Omnihuman-1 works

Omnihuman-1 uses a combination of Advanced artificial intelligence techniques to generate videos in which the subject appears incredibly natural. Unlike “traditional” deepfakes, which require numerous reference images to create a credible video, this system is able to generate a complete video starting from a single image and an audio file. To generate the content, the model uses a Advanced neural network which was trained on an impressive dataset: Ben 18,700 hours of content.

One of the most innovative aspects of Omnihuman-1 is the possibility of adjust various parameterslike the “Body proportion”which defines how much of a human body must be visible in the video generated, and the length of the final clip. This flexibility allows the system to adapt to a wide range of scenarios, increasing versatility in video production.

One of the key elements that makes Omnihuman-1 particularly powerful is theintegration of various types of inputsincluding text, audio and pose. When an input is given, the system is able to generate a video that not only corresponds to the movement of the body of the subject, but also to the labial synchronization and facial expressiveness. The artificial intelligence of Omnihuman-1 is in fact able to “read” the natural movement of the human body, thanks to a training process based on a huge volume of data, which includes the different ways of body expression and vocal interactions.

As schematized by the image you find later, the Omnihuman system It is divided into two main components:

  1. The Omnihuman modelwhich uses the Deep Learning Dit model, allows simultaneous conditioning of different ways such as text, image, audio and poses of the human body.
  2. The strategy of “OMNICMPRENSIVE” trainingwhich adopts a more phase learning process, in which progression depends on the intensity of the conditions relating to movement. The training approach with mixed conditions allows the OMNIHUMAN model to exploit the ability to manage a large volume of variable data.
Image
Schematic representation of the Omnihuman model. Credit: Bytedance

Omnihuman-1 training process has been designed to optimize the skills of the model to generate videos. Initially, the system learns to generate videos based on low complexity inputs, such as text and images, and then integrate audio and laying signals. This “Multi-conditutional” approach allows the system to perfect its skills and improve the final quality of outputs.

Despite the effectiveness of the system, bytedance developers admit that there are still some limitationslike the Difficulty in elaborating low quality images or difficulties in generating subjects with articulated posesbut it is clear that technology is in constant evolution and in the future the model should improve on these fronts. In the following videos, meanwhile, you can appreciate some videos generated with Omnihuman-1. Judge yourself how more or less realistic they are!

Concerns on ultra-realistic deepfakes such as Omnihuman-1

If on the one hand this type of technology opens new possibilities for entertainment and creation of digital content, on the other hand Significant challenges on the safety and ethics front. In recent years, deepfakes have been used in political disinformation campaigns in different countries of the world. In Taiwan, for example, an audio generated by the AI ​​was widespread in which a politician seemed to support a pro-Cinese candidate, while a false video of the president’s alleged resignation circulated in Moldova. Even in the financial sector, the deepfakes are used for sophisticated scams, with companies that have undergone millionaire losses due to imitations of managers and celebrities.

Not surprisingly, the economic impact attributable to the proliferation of synthetic content is significant. According to a relationship of Deloittein 2023 the losses related to fraud with deepfakes have exceeded 12 billion dollars And they could reach 40 billion by 2027 only in the United States! Although some social networks and research platforms have started to implement tools to identify and limit the spread of falsified content by AI, the volume of these materials continues to grow rapidly. And given the birth of instruments of the caliber of Omnihuman-1 this phenomenon is destined to grow.