Infinitely generate videos, plan decisions, and force the integration of next token prediction and full sequence diffusion

Infinitely generate videos, plan decisions, and force the next token prediction and full sequence diffusion to be integrated

2024-07-23

Machine Heart Report

Editor: Panda W

Currently, large autoregressive language models that use the next token prediction paradigm have become popular all over the world. At the same time, the large number of synthetic images and videos on the Internet have already allowed us to see the power of diffusion models.

Recently, a research team from MIT CSAIL (one of whom is MIT doctoral student Chen Boyuan) successfully integrated the powerful capabilities of the full-sequence diffusion model and the next-token model, and proposed a training and sampling paradigm: Diffusion Forcing (DF).

Paper title: Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
Paper address: https://arxiv.org/pdf/2407.01392
Project website: https://boyuan.space/diffusion-forcing
Code address: https://github.com/buoyancy99/diffusion-forcing

As shown below, diffusion forcing clearly outperforms both full-sequence diffusion and teacher forcing in terms of consistency and stability.

In this framework, each token is associated with a random, independent noise level, and tokens can be denoised according to arbitrary, independent, per-token schemes using a shared next-token prediction model or next-few-token prediction models.

The method is inspired by the observation that the process of adding noise to a token is a form of partial masking — zero noise means that the token is not masked, while full noise means that the token is fully masked. Therefore, DF forces the model to learn to remove the mask of any variable set of noisy tokens (Figure 2).

At the same time, by parameterizing the prediction method as a combination of multiple next-token prediction models, the system can flexibly generate sequences of different lengths and generalize to new trajectories in a combinatorial manner (Figure 1).

The team implemented DF for sequence generation as Causal Diffusion Forcing (CDF), where future tokens depend on past tokens through a causal architecture. They trained the model to denoise all tokens of the sequence at once (where each token has an independent noise level).

During sampling, CDF gradually denoises a sequence of Gaussian noise frames into clean samples, where different frames may have different noise levels at each denoising step. Similar to the next token prediction model, CDF can generate sequences of variable length; unlike the next token prediction, CDF performs very stably - whether predicting the next token, thousands of tokens in the future, or even consecutive tokens.

In addition, similar to full-sequence diffusion, it can also receive guidance to achieve high reward generation. By synergistically leveraging causality, flexible scope, and variable noise scheduling, CDF can achieve a new capability: Monte Carlo Tree Guidance (MCTG). Compared to the non-causal full-sequence diffusion model, MCTG can greatly improve the sampling rate of high reward generation. Figure 1 gives an overview of these capabilities.

experiment

The team evaluated the benefits of diffusive forcing as a generative sequence model across a variety of applications including video and time series forecasting, planning, and imitation learning.

Video prediction: consistent and stable sequence generation and infinite expansion

For the video generative modeling task, they trained a convolutional RNN implementation for causal diffusion enforcement based on Minecraft game videos and DMLab navigation.

Figure 3 shows the qualitative results of diffusion forcing and the benchmark.

It can be seen that diffusion forcing can be stably expanded and even exceed its training range; while teacher forcing and full sequence diffusion baselines diverge quickly.

Diffusion planning: MCTG, causal uncertainty, flexible scope control

The ability to diffuse coercion can bring unique benefits to decision making. The team evaluated the proposed decision framework using D4RL, a standard offline reinforcement learning framework.

The qualitative and quantitative evaluation results are given in Table 1. It can be seen that Diffusion Enforcement outperforms Diffuser and all baselines in all six environments.

Controllable sequence combination generation

The team found that subsequences of the sequences observed at training time can be flexibly combined simply by modifying the sampling scheme.

They conducted experiments using a 2D trajectory dataset: on a square plane, all trajectories started from one corner and ended at the opposite corner, forming a cross.

As shown in Figure 1 above, when the combination behavior is not needed, DF can be kept in full memory to copy the cross-shaped distribution. When combination is needed, the model can use MPC to generate shorter plans without memory, thereby achieving the stitching of the sub-trajectory of the cross and obtaining a V-shaped trajectory.

Robotics: Long-range imitation learning and robust visuomotor control

Diffuse forcing also opens new opportunities for visuomotor control of real robots.

Imitation learning is a commonly used robotic manipulation technique that learns the mapping from observations to actions demonstrated by experts. However, the lack of memory often makes imitation learning difficult to complete long-range tasks. DF can not only alleviate this shortcoming, but also make imitation learning more robust.

Imitation learning using memory. By remotely controlling the Franka robot, the team collected a video and action dataset. As shown in Figure 4, the task is to swap the positions of apples and oranges using a third position. The initial position of the fruit is random, so there are two possible target states.

Furthermore, when there is a fruit at the third position, the desired outcome cannot be inferred from the current observations — the policy must remember the initial configuration to decide which fruit to move. Unlike commonly used behavioral cloning methods, DF can naturally integrate memories into its hidden states. It was found that DF achieved an 80% success rate, while the diffusion strategy (the current best memoryless imitation learning algorithm) failed.

In addition, DF is more robust to noise and helps in robot pre-training.

Time Series Forecasting: Diffusion Forcing is an Excellent General-Purpose Sequence Model

For multivariate time series forecasting tasks, the team’s research shows that DF is comparable to previous diffusion models and Transformer-based models.

For more technical details and experimental results, please refer to the original paper.

news

Infinitely generate videos, plan decisions, and force the next token prediction and full sequence diffusion to be integrated

Introduction

my contact information