AI produces images faster and understands you better. What technical secrets does the high-quality image model have?

AI produces images faster and understands you better. What technical secrets does the high-aesthetic literary image model have?

2024-08-12

As the large-scale model is put into use, Wenshengtu is undoubtedly one of the hottest application directions.

Since the birth of Stable Diffusion, large-scale Wenshengtu models have emerged in an endless stream at home and abroad, which feels like a "battle of gods". In just a few months, the title of "the strongest AI painter" has changed hands several times. Each technical iteration has continuously refreshed the upper limit of AI image generation quality and speed.

So now, we can get any picture we want by typing a few words. Whether it is a professional-level commercial poster or a hyper-realistic photo, the realism of AI mapping has already amazed us. AI even won the 2023 Sony World Photography Award. Before the award was announced, this "photo" had been exhibited at Somerset House in London - if the author did not publicly state it, no one might find out that this photo was actually taken by AI.

Eldagse and his AI-generated work "Electrician"

How to make the pictures drawn by AI more beautiful requires the persistent efforts of AI technicians.The sixth episode of "AIGC Experience Party" invited Doubao Wenshengtu technology expert Li Liang and NVIDIA solution architect Zhao Yijia to give us an in-depth analysis of the technical links behind the Wenshengtu model that produces more beautiful, faster and more user-friendly images.

At the beginning of the live broadcast, Li Liang first disassembled in detail the technical upgrades of the recent domestic large-scale "top stream" model - ByteDance Doubao large-scale model in the field of literary image model.

Li Liang said that the problems that the Doubao team wants to solve mainly include three aspects: the first is how to achieve stronger image-text matching to meet the user's design ideas; the second is how to generate more beautiful images to provide a more extreme user experience; the third is how to produce images faster to meet ultra-large-scale service calls.

In terms of image-text matching, the Doubao team started with data, carefully screened and filtered the massive amount of image-text data, and finally stored hundreds of billions of high-quality images. In addition, the team also specially trained a multimodal large language model for recapiton tasks. This model will more comprehensively and objectively describe the physical relationship between images in the picture.

After having high-quality and detailed image-text data, in order to better demonstrate the model's strength, it is also necessary to improve the capabilities of the text understanding module. The team uses a native bilingual large language model as a text encoder, which significantly improves the model's ability to understand Chinese. Therefore, when faced with Chinese elements such as "Tang Dynasty" and "Lantern Festival", the Doubao Wenshengtu model also shows a deeper understanding.

The Doubao team also added its own unique tricks to the Diffsuion model architecture. They effectively scaled UNet. By increasing the number of parameters, the Doubao-Text Model further improved its understanding of image-text pairs and its high-fidelity generation capabilities.

In view of the aesthetic style that users intuitively feel most clearly, the Doubao team introduced professional aesthetic guidance and always paid attention to the aesthetic preferences of users and the public. At the same time, the team also worked hard on data and model architecture. Many times, the comparison between the images obtained by users and the effects of the demo display is like a "buyer's show" and a "seller's show". In fact, the prompt given is not detailed and clear enough for the model. The Doubao Wenshengtu model introduces a "Rephraser", which adds more detailed descriptions to the prompt words while following the user's original intention. All users will also experience a more perfect generation effect.

In order to make the model produce images faster and at a lower cost per image, the Doubao team also provided new solutions to the problem in terms of model distillation. A representative achievement is Hyber-SD, a novel diffusion model distillation framework that compresses the number of denoising steps while maintaining near-lossless performance.

Next, NVIDIA solution architect Zhao Yijia started from the underlying technology and explained the two most mainstream Unet-based SD and DIT model architectures of Wenshengtu and their corresponding characteristics, and introduced how NVIDIA's Tensorrt, Tensorrt-LLM, Triton, Nemo Megatron and other tools provide support for model deployment and help large models reason more efficiently.

Zhao Yijia first shared a detailed explanation of the principles of the model behind Stable Diffusion, and elaborated on the working principles of key components such as Clip, VAE and Unet. As Sora became popular, the DiT (Diffused Transformer) architecture behind it also became popular. Zhao Yijia further compared the advantages of SD and DiT from three aspects: model structure, characteristics and computing power consumption.

When using Stable diffusion to generate images, you often feel that the prompt words are presented in the generated results, but the image is not what you want. This is because Stable diffusion, which is based on text, is not good at controlling the details of the image, such as composition, action, facial features, spatial relationships, etc. Therefore, based on the working principle of Stable diffusion, researchers have designed many control modules to make up for the shortcomings of Stable diffusion. Zhao Yijia added the representative IP-adapter and ControlNet.

Nvidia's technical support plays a key role in speeding up the reasoning of the computationally intensive text-based graph model. Zhao Yijia introduced Nvidia TensorRT and TensorRT-LLM tools, which optimize the reasoning process of the graph-based model through technologies such as high-performance convolution, efficient scheduling, and distributed deployment. At the same time, Nvidia's Ada, Hopper, and the upcoming BlackWell hardware architecture all support FP8 training and reasoning, which will bring a smoother experience to model training.

After six wonderful live broadcasts, the "AIGC Experience Party" jointly launched by Volcano Engine, NVIDIA and CMO CLUB has come to a successful conclusion. Through these six episodes, I believe everyone has a deeper understanding of how AIGC can go from "interesting" to "useful". We also look forward to "AIGC Experience Party" not only staying in the discussion of the program, but also accelerating the process of intelligent upgrading in the marketing field in practice.

"AIGC Experience Party" all six issues review address：https://vtizr.xetlk.com/s/7CjTy

news

AI produces images faster and understands you better. What technical secrets does the high-aesthetic literary image model have?

Introduction

My contact information