Sora, the new magic lantern

Federico Bo
5 min readFeb 24, 2024

--

Sora — Image from a sample video

Generating a video from a textual request is a process, as one can
can be guessed, much more complex than generating an image. The entry into scene of the temporal dimension involves a whole series of steps, of
constraints, of dynamics that the model and the underlying neural network must “learn” to manage: consistency of shapes, fluidity of movements, management of changing perspectives, sufficient knowledge of the functioning of the real world, such as the effects of gravity.
An advanced text-to-video (t2v) model must be, as Jim Fan, Senior Research Scientist at Ndivia, wrote in a post on X, a “data-driven physics engine” capable of simulating worlds, real or imaginary.

Until a few days ago, videos generated by services such as Runway or
Pika were limited to micro-clips of a few seconds, not particularly
sophisticated.

Then came Sora, OpenAI’s t2v model.

Although not yet publicly available, the sample videos of Sora have already showcased a significant leap in quality that the model enables: one-minute videos that are incredibly realistic, resembling scenes from a Netflix series, a Nolan film, or a Pixar short film. All generated from simple textual requests. Each of these examples would have required hours or even days of work for animation studios or on-set crews.

The humorous image circulating online depicting the iconic “Hollywood” sign in Los Angeles replaced with “Sorawood” may be exaggerated but certainly hints at the generative future awaiting the audiovisual industry, whether willingly or not.

Understanding the basic functioning of the model used by Sora can aid in evaluating it more objectively and less emotionally.

The technical paper released by OpenAI may not fully meet academic standards in terms of technical-scientific rigor and transparency, but it does reference relevant studies, which can be valuable for reconstructing the research conducted by the OpenAI team and discovering interesting tools and techniques. It also confirms that advancements in foundational models are a result of knowledge sharing through research articles among international teams from universities, research centers, and companies engaged in this highly dynamic and interconnected field.

Let’s go into how Sora works.

The model utilized is a variant of diffusion models typically used for text-to-image generation. To generate images, the model is first trained on thousands of images. The model then adds “noise,” gradually making the image more akin to the “static fog” seen on old televisions. To create a new image, the model reverses this process: starting from this “haze” and extracting the image from it.

Sora, however, is a “diffusion transformer.” It also utilizes the transformer, the main module of large-scale language models (LLMs) such as ChatGPT. Just as words are broken down into smaller parts called tokens in these models, videos are divided into patches, small units larger than pixels.

The choice to use the transformer, according to OpenAI and various recent studies and research, is driven by its greater scalability and superior performance in handling and manipulating images. It has been observed that training this model with videos in their original dimensions yields better results, capable of sampling widescreen videos at 1920x1080p and vertical videos at 1080×1920.

We previously introduced patches; these fragments are not extracted from the original video but from its encoding in a lower-dimensional latent space. Let’s explain further. The video goes through an autoencoder. This architecture consists of two parts: an encoder that compresses the video temporally and spatially (similar to zipping a file) and a decoder that restores it to its original format. The latent space is that of the compressed video. For example, a color image of 512×512 can be compressed into a grayscale image of 64×64 using an autoencoder with a latent space of size 64.

Autoencoder

The model, with its transformer, operates on this compressed version. As highlighted in a study cited by OpenAI, “by training diffusion models on this representation, they achieve for the first time an almost optimal balance between complexity reduction and detail preservation.

Before entering the training phase, videos go through a model that generates detailed captions: research has shown that overly brief and/or inaccurate descriptions negatively impact the understanding of text-to-image prompts (and therefore t2v as well).

Sora- Training Step

Summarizing: during the training phase, a video is first detailed, then “compressed,” and gradually transformed — in this latent space — into “fog”: the model learns to deconstruct a video patch by patch. Through a de-noising operation, the model learns how to make the video reappear, reconstructing it step by step, always in the latent space: objects, animals, people emerge from the mist. The model gradually generalizes: it learns to connect words and phrases to thousands of entities and dynamics. In this way, after training is complete, it can create videos from a textual description and a fictional “haze.” The final step is through the decoder, which produces the video in the desired definition.

The videos used to train the Sora model come from various sources, although OpenAI does not disclose the specific origins of their datasets. It is suspected that synthetic data, such as videos from simulation engines like Unreal, may have been used, but the extensive use of videos from platforms like YouTube cannot be ruled out.

In addition to generating approximately one-minute videos from text, Sora can animate an image, extend a video backward or forward in time, edit it by changing elements like background or context, and merge two videos into a “chimera” video. For more examples of Sora’s capabilities, it is recommended to visit OpenAI’s detailed page.

The emerging capabilities, those unintended but arising from large-scale learning, are particularly intriguing. As mentioned in the paper, “Sora can generate videos with dynamic camera movement. As the camera shifts and rotates, people and scene elements move coherently through three-dimensional space.” People, animals, and objects reappear after being momentarily hidden. Actions have an impact on the recreated world: biting a cookie shows it partially eaten, painting alters the artwork.

The Sora AI model by OpenAI has garnered attention for its ability to create detailed minute-long videos from text prompts. While the sample videos displayed exhibit some imperfections typical of a new tool, artists and expert videomakers are collaborating to provide feedback to the OpenAI team. The true impact and innovation that Sora will bring to the audiovisual and movie industries can only be fully assessed once it becomes available for testing.

--

--

Federico Bo

Computer engineer, tech-humanist hybrid. Interested in blockchain technologies and AI.