news

The free public beta is overwhelming the server, and it wins praise for its physical sense compared to Sora.

2024-07-24

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Author: Zimo

Following Sora, Runway, and Pika, another AI product that generates video from pictures (text) has become popular - Dream Machine.

Behind Dream Machine is an American company called Luma AI, which was founded in 2021. It has successfully completed three rounds of financing in the past three years, with a total financing amount of US$67.3 million. The most recent round of B financing of US$43 million took place in January this year, led by the famous venture capital firm a16z, followed by Nvidia's second investment, and the post-investment valuation reached US$200-300 million.


In June this year, Dream Machine launched a free public beta worldwide. Each user has 30 free opportunities to generate videos per month, and each video is 5 seconds long. In order to compete with the first entrants, it highlights the characteristics of "efficiency", "physics" and "camera movement". One of the main features is that it can generate 120 frames of video in just 120 seconds (but there were too many people queuing during the public beta, and users generally reported that it takes 10-20 minutes to generate a video, and some even take 2 hours). It can simulate the physical world, and also emphasizes the consistency of the characters. It can make the picture smoother and more realistic through natural camera movement techniques, and integrate with the emotions expressed. The brainstorming of users makes the generated videos full of creativity and imagination. It has also played a significant role in reducing costs and increasing efficiency when used in advertising, teaching and training, story creation and other fields.

Which AI video generation product is the best?

The design of Dream Machine's interface is intuitive and simple, with two functions: text-generated video and image-generated video. In the text-generated video, the effect of using English description will be better. If you want to make the generated video more in line with the needs, you need to use as accurate and detailed text description as possible, and you can also add some words about emotional expression to make the effect more realistic.

However, for users who are not so good at writing, the picture-to-video function will be more popular, because it is more like a secondary processing of a work. Just upload a picture and add a text description according to the scene in your mind, you can make the static picture move and tell the story shown in the picture through video.

On Twitter, we can see various creative videos shared by users, including funny ones that make the Mona Lisa portrait move, using selfies to restore the scene when taking selfies, and warm ones that "resurrect" important people to recreate the scene, etc. It can be said that AI creation tools plus users' rich imagination have given new vitality to the works.

In this field, benchmarking has always been an inseparable topic. From the perspective of architecture, Dream Machine and Sora both use the Diffusion Transformer architecture, so the correlation will be higher; from the perspective of generated content, compared with Runway and Pika, Dream Machine's differentiation is reflected in the larger range of motion, more and faster camera switching angles, rather than just making the objects in the video move slightly. However, since the model is still in its early stages, controllability issues arise. For example, during user testing, there was an unreasonable multi-head phenomenon when switching animal lenses. Overall, there are still many points that can be optimized in both data and models.

From the perspective of the duration of a single video generation, Dream Machine can generate a 5-second video in 120 seconds, while Runway is faster, generating a 10-second video in 90 seconds, and the latest version can be extended to 18 seconds, while Pika can only generate a 3-second video at a time. As the originator, Sora has broken the duration limitation and can generate a 1-minute video, but it takes almost 1 hour to render. Comparing the pricing of several products, after the free test phase, Dream Machine has the highest overall fee, while Pika's professional version is priced 6 times its standard version, and other products are around 2-3.5 times.


(AI video generation product price comparison)

Finally, from the perspective of video generation, different products generate different styles of videos for the same text. Compared with other products, movie sense and physical reality are one of the common feelings of users when using Dream Machine. The video it generates has a stronger sense of lens and sense of substitution. There are two possible reasons. One is that the product uses a large number of movie clips when training the model, which makes the generated video full of imagination. It is not limited to the things in the original picture, but adds some additional scenes, and also processes the animated characters and adds mouth movements to make it more realistic. The other point is closely related to the technology and experience accumulated by the company behind it in 3D modeling.

Vincent's 3D miniatures are the result of accumulated technology

Luma AI has been focusing on 3D content generation since its inception. Its previously launched 3D model application Genie1.0 was once a global hit. The application has a PC web version, a mobile APP version (called Luma AI), and can also be used on the Discord server widely used overseas.

Just enter a text description, and 4 realistic 3D models can be generated in 10 seconds, similar to a "small figurine". After selecting according to personal preference, you can also edit the texture yourself, including original, smooth and reflective. Finally, it can be exported in multiple formats such as fbx, gltf, obj, etc., to achieve seamless docking with other 3D editing software (such as Unity and Blender), so that the model can be animated, perfectly fitting games, animations and other scenes, truly providing scene empowerment for downstream.


Genie1.0's low technical threshold also allows users to reconstruct 3D scenes by simply shooting video clips. According to the requirements, the object is shot 360 degrees from three perspectives: horizontal, downward and upward. After uploading and waiting for a few minutes, Genie1.0 can complete the 3D rendering of the video.

In terms of technology, Luma AI can be said to have taken NeRF (neural radiation field) to its extreme. Traditional NeRF requires a large number of photos taken with professional equipment, and the coordinates must be strictly followed. Nowadays, thanks to the open source of the underlying code, more and more simplified models are being developed, and the required photos and shooting angles are greatly reduced. Genie1.0 has achieved a higher level and has become a NeRF that can be used anytime, anywhere through guidance.

The accumulation of 3D technology and products has helped the company smoothly shift from 3D generation to video generation, but conversely, video generation has also created excellent conditions for 3D. In Luma AI's concept, making video generation products is actually to add the time dimension to 3D to better make 4D, and video plays an intermediate role here.

We can combine the two products Genie1.0 and Dream Machine. The former can build 3D models through multi-angle videos, and the latter uses the accumulation of 3D models to better generate videos. And because 3D data is limited compared to pictures and videos, if you want to create 3D better, you need more large model data to drive it. In order to achieve the ultimate 4D goal, multi-view data is collected from the generated video, and then this data is used to generate 4D effects, and a complete chain is opened up.

Where is the way out in the end?

Since the beginning of this year, the AI ​​video generation track has gradually become crowded, especially for Internet giants, who have made certain arrangements in this field, whether in terms of self-developed models or investments. As more and more players enter the market, some problems have gradually been exposed, mainly in terms of the controllability and consistency of generated videos.

These two problems mainly occur when the video angle is switched, such as the multi-headed animal scene mentioned above, and in the portrait scene, because the facial expressions and details of people change quickly and are difficult to capture, when the face angle is switched in the video, the face may be deformed or even not the same face in the next second, which is one of the reasons for the limitation of video length. The longer the video is generated, the more difficult it is to ensure consistency.


(Multiple animal heads appear in the generated video)

This pain point also troubles many developers. Although there is no perfect solution yet, it can be seen from their development actions that they are already working towards this core direction. For example, VideoCrafter2 developed by Tencent AI Lab uses low-quality videos to ensure the consistency of the movement of objects in the picture. Vimi, the character generation model launched by SenseTime, can accurately imitate the micro-expressions of characters, focusing on the two aspects of character and controllability.

In terms of the target audience, AI video generation products are currently mainly aimed at C-end users. At this stage, users test the playability and creativity of new things, but as the number of products increases, after the craze fades, more monetization will also rely on B-end support. At present, such products are also driving the continuous increase in API demand, giving downstream companies more possibilities, whether it is the reprocessing of generated videos or direct use, which greatly reduces the time and cost of creation.

In addition, Kuaishou and Bona recently launched the first AIGC original short drama in China, which also subverted the creative thinking of the traditional film and television industry. The combination of the two emerging hot tracks has also made new breakthroughs in the application scenarios of AI video generation, and more possibilities will be opened up. Although both are in the early stages of development, neither the technology nor the products are mature, the "joint brand" facing the double wind and stepping on the two dividends will surely drive the development of the industry rapidly.

The innovation of AI-created products has brought infinite creativity and surprises to people's lives, and also reduced the difficulty and cost of production. Judging from the current products, both text-generated videos and picture-generated videos have created very interesting and novel ways of playing, among which personal creativity is the key factor in driving better AI output. Although some technical problems lead to occasional bugs, and the product form depends largely on the actual capabilities of the model, through iterative updates, healthy market competition and the combination of tracks, I believe that the model will eventually be trained to be more and more perfect. At the same time, we also look forward to the future of domestic large-scale model products to make their own way in the global market.