OpenAI Sora 技术文档(中英双译，吐血整理）

用户7254

2024年2月16日修改

欢迎链接鲲鹏Ai 获取更多一手Ai资讯（扫下方二维码，添加好友，拉入Ai社群）​

common.docs_name - LarkCCM_Docs_Menu_Image

Video generation models as world simulators​
作为世界模拟器的视频生成模型​

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.​
我们探索视频数据生成模型的大规模训练。具体来说，我们在可变持续时间、分辨率和宽高比的视频和图像上联合训练文本条件扩散模型。我们利用对视频和图像潜在代码的时空补丁进行操作的变压器架构。我们最大的模型 Sora 能够生成一分钟的高保真视频。我们的结果表明，扩展视频生成模型是构建物理世界通用模拟器的一条有前途的途径。​

This technical report focuses on (1) our method for turning visual data of all types into a unified representation that enables large-scale training of generative models, and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details are not included in this report.​
本技术报告重点关注（1）我们将所有类型的视觉数据转化为统一表示的方法，从而能够大规模训练生成模型，以及（2）对 Sora 的能力和局限性进行定性评估。本报告不包含模型和实施细节。​

Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks,(1,)(2,)(3) generative adversarial networks,(4,)(5,)(6,)(7) autoregressive transformers,(8,)(9) and diffusion models.(10,)(11,)(12) These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size. Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.​
许多先前的工作使用各种方法研究了视频数据的生成建模，包括循环网络、 (1,) (2,) (3) 生成对抗网络、 (4,) (6,) (7) 自回归变压器， (8,) (9) 和扩散模型。 (10,) (11,) (12) 这些作品通常关注一小类视觉数据、较短的视频或固定大小的视频。 Sora 是视觉数据的通用模型，它可以生成不同时长、长宽比和分辨率的视频和图像，最多可达一分钟的高清视频。​

Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks,(1,)(2,)(3) generative adversarial networks,(4,)(5,)(6,)(7) autoregressive transformers,(8,)(9) and diffusion models.(10,)(11,)(12) These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size. Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.​
许多先前的工作使用各种方法研究了视频数据的生成建模，包括循环网络、 (1,) (2,) (3) 生成对抗网络、 (4,) (6,) (7) 自回归变压器， (8,) (9) 和扩散模型。 (10,) (11,) (12) 这些作品通常关注一小类视觉数据、较短的视频或固定大小的视频。 Sora 是视觉数据的通用模型，它可以生成不同时长、长宽比和分辨率的视频和图像，最多可达一分钟的高清视频。​

Turning visual data into patches​
将视觉数据转化为补丁​

We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data.(13,)(14) The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.(15,)(16,)(17,)(18) We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.​
我们从大型语言模型中获得灵感，这些模型通过互联网规模数据的训练来获得通用能力。 (13,) (14) LLM 范例的成功部分是通过使用标记来实现的，这些标记优雅地统一了文本的不同模式——代码、数学和各种自然语言。在这项工作中，我们考虑视觉数据的生成模型如何继承这些好处。 LLMs 有文本标记，而 Sora 有视觉补丁。此前，补丁已被证明是视觉数据模型的有效表示。 (15,) (16,) (17,) (18) 我们发现补丁是一种高度可扩展且有效的表示形式，用于在不同类型的视频和视频上训练生成模型图片。​

At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space,(19) and subsequently decomposing the representation into spacetime patches.​
在较高的层次上，我们首先将视频压缩到低维潜在空间 (19) 中，然后将表示分解为时空补丁，从而将视频转换为补丁。​

Video compression network​
视频压缩网络​

We train a network that reduces the dimensionality of visual data.(20) This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.​
我们训练一个降低视觉数据维度的网络。 (20) 该网络将原始视频作为输入，并输出在时间和空间上压缩的潜在表示。 Sora 在这个压缩的潜在空间中接受训练并随后生成视频。我们还训练了相应的解码器模型，将生成的潜伏映射回像素空间。​

Spacetime Latent Patches 时空潜在斑块

Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.​
给定一个压缩的输入视频，我们提取一系列时空补丁，充当变压器令牌。该方案也适用于图像，因为图像只是具有单帧的视频。我们基于补丁的表示使 Sora 能够对不同分辨率、持续时间和长宽比的视频和图像进行训练。在推理时，我们可以通过在适当大小的网格中排列随机初始化的补丁来控制生成视频的大小。​

Scaling transformers for video generation​
用于视频生成的缩放变压器​

Sora is a diffusion model(21,)(22,)(23,)(24,)(25); given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer.(26) Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling,(13,)(14) computer vision,(15,)(16,)(17,)(18) and image generation.(27,)(28,)(29)​
Sora 是扩散模型 (21,) (22,) (23,) (24,) (25) ；给定输入噪声补丁（以及文本提示等调节信息），它被训练来预测原始的“干净”补丁。重要的是，Sora 是一个扩散变压器。 (26) Transformers 在多个领域展示了卓越的扩展特性，包括语言建模、 (13,) (14) 计算机视觉、 (15,) (16,) (17,) (18) 和图像生成。 (27,) (28,) (29)​

In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.​
在这项工作中，我们发现扩散变压器也可以有效地缩放为视频模型。下面，我们展示了训练过程中具有固定种子和输入的视频样本的比较。随着训练计算的增加，样本质量显着提高。​

33%

OpenAI Sora 技术文档(中英双译，吐血整理）​

OpenAI Sora 技术文档(中英双译，吐血整理）