doubao "king bomb": bytedance released two video generation models in one day

2024-09-24

bytedance officially announced its entry into ai video generation. on september 24, bytedance's volcano engine held an ai innovation tour in shenzhen, and released two large models, doubao video generation-pixeldance and doubao video generation-seaweed, and opened invitation tests for the enterprise market.

the video generation effect demonstrated at the event was amazing. whether it was semantic understanding, complex interactive images of multiple subject movements, or content consistency of multi-camera switching, doubao's large video generation model had reached the industry's advanced level. tan dai, president of volcano engine, said, "there are many difficulties in video generation that need to be overcome. doubao's two models will continue to evolve, explore more possibilities in solving key problems, and accelerate the expansion of ai video creation space and application implementation."

figure: tan dai, president of volcano engine, released a model for generating doubao videos

innovative technology to solve the problem of multi-agent interaction and consistency

previously, most video generation models could only complete simple instructions, but the doubao video generation model can achieve natural and coherent multi-shot actions and complex interactions between multiple subjects. when a creator tried the doubao video generation model first, he found that the videos it generated could not only follow complex instructions and allow different characters to complete the interaction of multiple action instructions, but also keep the characters' appearance, clothing details and even headwear consistent under different camera movements, which is close to the real shot effect.

according to volcano engine, the doubao video generation model is based on the dit architecture. through the efficient dit fusion computing unit, the video can switch freely between large dynamics and camera movements, and has multi-lens language capabilities such as zoom, surround, pan, zoom, target tracking, etc. the newly designed diffusion model training method has overcome the consistency problem of multi-lens switching. when switching lenses, the consistency of the subject, style, and atmosphere can be maintained at the same time. this is also a unique technical innovation of the doubao video generation model.

after being polished and continuously iterated in business scenarios such as jianying and jimeng ai, the doubao video generation model has professional-level light and shadow layout and color harmony, and the visual effect is extremely beautiful and realistic. the deeply optimized transformer structure has greatly improved the generalization ability of doubao video generation, supporting 3d animation, 2d animation, chinese painting, black and white, thick painting and other styles, and adapting to the proportion of various devices such as movies, televisions, computers, and mobile phones. it is not only suitable for corporate scenarios such as e-commerce marketing, animation education, urban cultural tourism, and micro-scripts, but also can provide creative assistance for professional creators and artists.

currently, the new doubao video generation model is being tested on a small scale in the beta version of jimeng ai, and will be gradually opened to all users in the future. chen xinran, the marketing director of jianying and jimeng ai, believes that ai can interact deeply with creators and create together, bringing many surprises and inspirations. jimeng ai hopes to become the user's most intimate and intelligent creative partner.

doubao large model launches the industry's highest concurrent traffic standard

in this event, doubao big model not only added a video generation model, but also released the doubao music model and simultaneous interpretation model, which have fully covered all modalities such as language, voice, image, video, etc., and fully meet the business scenario requirements of different industries and fields.

as product capabilities continue to improve, the use of doubao's large model is also growing rapidly. according to volcano engine, as of september, the average daily token usage of doubao's language model exceeded 1.3 trillion, a tenfold increase from its first release in may, and the multimodal data processing volume also reached 50 million images and 850,000 hours of speech per day.

previously, doubao's large model announced a price that was 99% lower than the industry's, leading the domestic large model price reduction trend. tan dai believes that the price of large models is no longer a barrier to innovation. with large-scale applications in enterprises, large models supporting greater concurrent traffic are becoming a key factor in the development of the industry.

according to tan dai, many large models in the industry currently only support a maximum tpm (tokens per minute) of 300k or even 100k, which is difficult to carry the traffic of enterprise production environments. for example, in the literature translation scenario of a scientific research institution, the tpm peak is 360k, the tpm peak of a car smart cockpit is 420k, and the tpm peak of an ai education company is as high as 630k. for this reason, the doubao large model supports an initial tpm of 800k by default, which is far higher than the industry average, and customers can also flexibly expand capacity according to demand.

"thanks to our efforts, the application cost of large models has been well resolved. large models must move from rolled-up prices to rolled-up performance, and roll up better model capabilities and services," said tan dai.

yidan xiaofeng

report/feedback

news

doubao "king bomb": bytedance released two video generation models in one day

introduction

my contact information