news

Meta develops new methods: integrating language and diffusion AI models to reduce computational complexity, improve computational efficiency, and optimize generated images

2024-08-24

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

IT Home reported on August 24 that Meta AI has recently launched a new Transfusion method that can combine language models and image generation models and integrate them into a unified AI system.

IT Home quoted the team as saying that Transfusion combines the advantages of language models in processing discrete data such as text, and the ability of diffusion models in generating continuous data such as images.

Meta explains that current image generation systems typically use a pre-trained text encoder to process the input prompt word, which is then combined with a separate diffusion model to generate the image.

Many multimodal language models work similarly, connecting a pre-trained model for text with specialized encoders for the other modalities.

However, Transfusion uses a single, unified Transformer architecture for all modes, and performs end-to-end training on text and image data. Different loss functions are used for text and images: next token prediction for text and diffusion for images.

In order to process text and images simultaneously, images are converted into sequences of image fragments. In this way, the model can process text tags and image fragments simultaneously in one sequence, and a special attention mask allows the model to capture the internal relationship of the image.

Different from Meta's existing methods such as Chameleon (which converts images into discrete tokens and then processes them in the same way as text), Transfusion retains the continuous representation of images and avoids information loss caused by quantization.

Experiments also showed that Fusion scaled more efficiently than similar methods. It achieved similar results to specialized models in image generation, but with significantly less computation, and surprisingly, incorporating image data also improved text processing capabilities.

The researchers trained a 7-billion-parameter model on 2 trillion text and image annotations. The model achieved similar results in image generation to established systems such as DALL-E 2, while also being able to process text.