news

minimax joins the video generation melee, is the end of big models to make videos?

2024-09-01

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

another domestic unicorn has joined the fray of video generation models.
on august 31, minimax, one of the "six little dragons of ai" that has always been low-key, officially opened its doors to the public for the first time and held a "minimax link partner day" event in shanghai. at the meeting, yan junjie, the founder of minimax, announced the launch of video generation models and music models. in addition, he announced that a new version of the large model abab7, which can match gpt-4o in terms of speed and effect, will be released in the next few weeks.
the external name of this video generation model is video-1, and minimax did not give much introduction to the specific parameters. yan junjie mentioned that compared with the video models on the market, video-1 has the characteristics of high compression rate, good text response and diverse styles, and can generate native high-resolution and high-frame rate videos. currently, video-1 only provides text-generated videos, and in the future the product will iterate image-generated videos, editability, controllability and other functions.
currently, all users can log in to the conch ai official website to experience the video-1 video generation function. the reporter experienced it on site. after inputting a simple prompt word and waiting for about 1-2 minutes, a 6-second video can be generated. from the output effect, the picture basically covers the points mentioned in the prompt word, with high definition and acceptable color tone. the facial details of the characters can be improved.
during the discussion session of the conference, yan junjie mentioned that the big model is an area that seems to be very hot, but there are also many areas of disagreement, such as "should we do 2b or 2c, should we do it domestically or overseas, can the scaling law continue..." and so on.
despite so many non-consensuses, video generation may be the consensus among major model manufacturers this year.
since openai released the video model sora in february this year, many well-known companies in the industry have released models. in april, shengshu technology released the video model vidu. in june, kuaishou released the ai ​​video generation model keling. a week later, luma ai released the vincent video model dream machine. runway announced in early july that the vincent video model gen-3 alpha was open to all users. during the world artificial intelligence conference, alibaba damo academy launched xunguang. at the end of july, aishi technology released pixverse v2. then zhipu officially released qingying video. in early august, bytedance ai was launched on the app store...
a year ago, there were very few public-facing video models on the market. in just a few months, we have witnessed the emergence of dozens of video generation models. an industry insider lamented that the past year has been a historic moment for ai video generation.
in the interview, the reporter from china business network asked about the necessity of minimax's layout of video generation. yan junjie said that the fundamental reason is that the information in human society is more reflected in multimodal content. "most of the content we see every day is not text, but dynamic content. when you open xiaohongshu, it is full of pictures and texts, when you open douyin, it is full of videos, and even when you open pinduoduo to buy things, most of the time it is also pictures." in life, text interaction is only a small part, and more is voice and video interaction.
therefore, in order to achieve very high user coverage and higher usage depth, as a large model manufacturer, the only way is to be able to output multimodal content rather than just outputting simple text-based content. yan junjie explained that this is a core judgment.
"before, we produced text, then sound, and we produced pictures a long time ago. now that the technology has become more advanced, we can also produce videos. this route is consistent, and we must be able to do multimodality." yan junjie said.
however, the video generation track is difficult. just looking at the fact that openai has not officially released sora to the public since the beginning of the year, we can also get a glimpse of some of the challenges in the industry.
on the one hand, the current video generation results are far from meeting users' expectations. the model does not understand physical rules, and the generation process is difficult to control. video, image, and 3d generation algorithms will encounter many structural and detailed problems, such as often growing one thing too many or missing one thing, or inserting a hand into a human body. refined videos, especially those with physical rules, are currently difficult to generate.
during the interview, yan junjie also said that "this is quite difficult", otherwise so many companies claiming to do this would have done it long ago. the complexity of video work is more difficult than text work, because the context text of the video is naturally very long. for example, a video has tens of millions of inputs and outputs, which is naturally a difficult process. secondly, the amount of video is very large. watching a 5-second video will take up several mb, but the text that is watched in 5 seconds is about 100 words, which may be less than 1k of data. this is a storage gap of several thousand times.
"the challenge here is that the underlying infrastructure that was previously built based on text, how to process data, how to clean data, and how to label data, are not very applicable to videos." yan junjie believes that the infrastructure needs to be upgraded, and the second thing is patience. there are many open source resources for text, and if you do it based on open source, your own research and development will be faster. if you make videos, there are not so many open source contents, and you will find that a lot of content needs to be redone after it is produced, which requires more patience.
previously, an industry practitioner told reporters that the current video generation is a bit like the image generation on the eve of 2022. after stable diffusion was open sourced in august 2022, aigc image generation began to explode, but there is currently no particularly powerful "open source sora" released in the field of video generation, and everyone still needs to explore the way forward.
qiming venture partners released the "top ten outlooks for generative ai in 2024" in july. one of them is that video generation will explode in three years. they believe that combined with 3d capabilities, controllable video generation will bring changes to the production model of film, animation, and short films. in the future, the compression rate of image and video latent space representation will increase by more than five times, thereby increasing the generation speed by more than five times.
(this article comes from china business network)
report/feedback