This Chinese company has just made AI video enter the era of "GC for all"

2024-07-24

Hengyu from Aofei Temple
Quantum Bit | Public Account QbitAI

Lower threshold, higher quality, more logical, and longer duration.

These "more" made thePixVerse V2, a new domestic AI video product, the popularity went up immediately.

And its origin is very eye-catching:

From the most popular domestic star startups in this fieldAishi Technology, the company completed two rounds of financing in the first half of this year alone.

Let’s take a look at the key “new ideas” of Pixverse V2:

Model technologyIt adopts the DiT (Diffusion + Transformer) architecture and uses original technologies in multiple aspects to significantly improve the generation effect.

For example, by introducing the spatiotemporal attention mechanism, we can achieve video generation with larger and more natural motion.

Look at the little alpaca below surfing happily, which is very appropriate for the release of LIama 3.1 and its immediate success today.

Video quantity and quality, supports one-click generation of up to 5 continuous video contents.

And the consistency of the main image, picture style and scene elements will be automatically maintained between the clips.

In addition, the official introduction of Aishi also stated that the new productThe threshold for prompt words has been cut again。

Regardless of whether you have learned the cue word technique, you can easily achieve it by simply expressing your picture requirements in a concise and clear manner.The same applies to the Chinese context。

In addition, there is consistency in style, subject, and scene between the several videos generated at a time.

Now make a short videoNot only do you not need to shoot, you don’t even need to edit the photos yourself.。

Generate with one click and upload directly to various platforms for sharing, goose girl!

Both quality and quantity are guaranteed, and the threshold is lowered again and again.

AI video creation has been involved in the dark companies such as PixVerse, Runway, and Luma.Everyone can playera.

Generate up to 5 videos to allow creativity to continue

But wait!

We will never be easily blinded by the demos released by various companies.

So, after discovering PixVerse V2 was launched this morning, QuantumBit immediatelyHuman flesh test.

Go to the PixVerse official website and go directly to PixVerse V2 in the left menu bar.

Currently itSupports two generation modes: text/image/videoIn actual operation, you can choose one of the two, or you can use both together.

Enter the text prompt box, and click the yellow box in the picture below to upload the picture.

In the lower right corner of the input box, there is a gray box selection area.5s/8s optionsYou can choose the length of a single video clip to be generated according to your needs.

The Scene selected by the green frame refers to the specific video clip that needs to be generated.

It is indeed as the official introduction says, now at most 5 videos can be added to generate, namely Scene1-5.

All Scene clips will follow the style of Scene 1., even if other scenes upload reference images later, PixVerse will redraw them based on the image style of Scene 1.

In short, I tried my best to keep the style of the five videos consistent.

In addition, the prompt words/prompt pictures for each Scene can be entered separately.

When you are done, you can click the star button in the lower right corner of the input box to enter the generation state.

After experience, I found that no matter how many scenes need to be generated, each generation costs 50 Credits (the computing power currency of PixVerse V2).

During the experience, adhering to the inputKeep the prompt as simple as possibleAccording to the principle, the five prompt words we input are as follows:

In the morning, a little white rabbit got dressed in bed
The little white rabbit was walking on the way to work, passing by a garden
Little white rabbit holding a cup of steaming coffee
The little white rabbit is holding a cup of coffee and waiting in line for the elevator
The resigned little white rabbit hopped on the road

Although there are options for fine-tuning each video (adjusting the subject, scene, action, and camera movement) after generation, we did not make any intervention and focused on keeping the original taste.

The generated results are as follows:

△ Considering the viewing experience, this video is only accelerated 2.5 times for the playback speed

The 5 clips have been spliced together.You can download the full version directly,very convenient.

It's a bit funny. In the video, the little white rabbit Pia who just resigned took off his work clothes without taking away any work atmosphere.

After getting started with the game to this point, I, a budget-conscious worker, have made a wonderful discovery that I must share with you:

If you only want to generate one video clip each time, you can simply adjust the options in PixVerse V2 to reduce it to only Scene 1. That's it - we call this method 1.

But there is another way (Method 2), which is to enter another mode of PixVerse V2 through another entrance.

After asking around in the office, everyone agreed that if they wanted to generate a single video clip, they would prefer the latter method.

Why?

Firstly, method 2 can make more adjustments based on parameters such as video ratio and video style. The more "wanted" information you give, the more likely the model will understand you, and the generated video will be more likely to suit your needs.

On the other hand, after a quick calculation, I found that method 1 consumes 50 Credits for one generation, regardless of whether it generates 1 fragment or 5 fragments; but method 2 only consumes 30 Credits for one generation.

Save more money, friends!

Quickly take a small notebook in your mind and remember the operation process of method 2——

Click Text to Video in the left menu bar, then select PixVerse V2 in Model.

You can doVincent Video.

And by adding words such as "Anime" and "Realistic" to the prompt words, the style of the generated content can be changed.

The whole thing is a bit more difficult, generating some scenes that do not exist in the real world. Input prompt words:

The Marshmallow Giant strolls through the colorful Marshmallow Forest.

Generates the following result:

OK, OK, unbelievable. I never thought I could come up with such an abstract description as “marshmallow giant”!

My blind guess is that PixVerse V2 has significantly optimized semantic understanding.

Similar methods can also be experiencedImage video function。

Click Image to Video in the left menu bar and select PixVerse V2 in Model.

It’s a bit of a pity that PixVerse’s raw videos currently cannot use the motion brushes mentioned above.

It should be noted that currently the “paint and it moves” motion brush cannot be used in image-generated videos (this is the new AI video feature launched by iPo last month).

QuantumBit asked the PixVerse V2 team and found out:The motion brush will also be available in V2 soon.。

Previously, the motion brushes of Runway and PixVerse have been widely praised because they make up for the shortcomings of the prompt word descriptions and enhance the controllability of the picture's motion.

If this feature is launched in PixVerse V2, I think everyone will have more fun playing it, and the movement of people/objects in the video will be more in line with the laws of physics.

Since people or animals "walking catwalks" have always been a must for AI videos to show off their muscles (although we don't know why), this time when experiencing the PixVerse V2 image-to-video function, we went straight to the intensity and made aAstronaut parkour on the street。

Input hint image:

Generates the following result:

This task is a bit like stacking buffs, and it is a dynamic task that generates non-realistic content based on pictures.

What is more needed is that the model behind it has a strongVisual comprehension。

In terms of results, PixVerse V2 can easily handle continuous video creation, text-generated video, or picture-generated video.

Finally, I would like to mention that no matter it is text or picture, it will consume 30 Credits for each 5s/8s video generated.

However, the generation speed is relatively fast, and the quality is stable and guaranteed. I actually feel that spending these 30 Credits is quite worthwhile.

DiT base model update support

In the AI video track, which is known as the king of this year's volume, Ai Shi suddenly released a different trick.

When all Sora alternatives around the world are expanding their duration, improving their image quality, and reducing their difficulty,Aishi also lowered the threshold。

Not only does the prompt words not need to be too professional, but more importantly, you can create (up to) 5 videos at a time, each 8 seconds long.

The consistency of style, subject and scene among these 1 to 5 video clips can be guaranteed, and based on the logic between the prompt words in each video, they are finally synthesized into a long video of about 40 seconds.

The kind with coherent plot and consistent content.

It is claimed to have "smooth motion and rich details" and the picture quality reaches 1080p.

In other words, users can think about what they want, enter the prompt word, and then wait for the generation of a video with a duration ranging from 10s to 40s.

Not only can you “put your ideas into the video” on the screen, with natural and coherent clips, but you can also save time and effort in the video production process, greatly improving your creative efficiency.

After PixVerse V2 was released, netizens started using it quickly.

The emergence of PixVerse V2 has enabled many people who have never used AI video tools, or even never made a video, to use it to achieve a breakthrough in the number of generated videos from 0 to 5 and the number of works from 0 to 1.。

The right to use AIGC tools is once again decentralized。

The expansion of AIGC tool users beyond the circle (no longer limited to professional users) is achieved through technological iteration and updates.

Behind PixVerse V2 is Aishi TechnologyIterative updates to the self-developed model underlying the DiT architecture。

This is also the core technology behind PixVerse.

To review the previous situation, Quantum位 sorted through Aishi's previous public information/Wang Changhu's external speeches and found that, at the beginning, the company adopted the Diffusion+Unet architecture technology route, which was also the mainstream AIGC practice before Sora came out. However, as time went on, the parameters expanded and the instructions became more complicated, and Unet was a bit insufficient.

Therefore, iPoetry started experimenting with the DiT architecture very early (before Sora appeared) and used the Scaling Law to improve model performance.

The car turned around very early, so Sora's appearance did not catch Aishi off guard. On the contrary, because the correctness of the route was verified, Aishi's speed this year was significantly faster.

So, what are the updates to the DiT base model of PixVerse V2?

The first point is about Diffusion spatiotemporal modeling.

Aishi has created a unique spatiotemporal attention modeling mechanism, which is "more reasonable" and superior to spatiotemporal separation and fullseq architecture.

This mechanism has a better perception of time and space, and is better at handling complex scenes.

The second point is about text comprehension.

PixVerse V2 has significantly enhanced its ability to understand prompts. This is due to the use of a multimodal model, which can better align text and video information, allowing the generated results to be exactly what the creator wants.

thirdIn order to achieve higher computing efficiency, PixVerse V2 weighted the loss based on the traditional Flow model, so that the model can converge faster and better.

Another point, is the R&D team behind PixVerse V2, which designed a better 3D VAE model.

A spatiotemporal attention mechanism is introduced to improve the quality of video compression; at the same time, continuous learning technology is used to further improve video compression and reconstruction results.

AI-assisted UGC trend of “simple and interesting”

AIGC is simply the most well-known topic this year.

butThe ability to apply AIGC is actually still in the hands of a small number of people, such as programmers, designers and other professionals.

AIGC has not yet entered the stage of universal "GC" like UGC.

Faced with such a situation, what Aishi Technology has done in the past year can be summarized as follows:

Continuously improving AI technology capabilities
Expanding the subject group of the verb "G (Generated)"
Pay attention to the quality level of "C (Content)".

This is not only reflected in PixVerse V2, but also in the past.

After reviewing the situation, we found that the release of PixVerse V2 is actually the third time this year that the company has taken action on AI video functions and products.

In January this year, AiShi officially launched the web version of its video product PixVerse, and its monthly visits quickly exceeded one million.

In April, the C2V (Character to Video) function developed based on the self-developed video model was released and can be used on the web.

By accurately extracting character features and deeply embedding them into the video generation model, PixVerse is able to lock in the characters and initially solve the consistency problem in AI video creation.

In June, the Magic Brush motion brush was released, which can be used to paint on video screens and accurately control the movement and direction of video elements.

This is also the first AI video generation company to release similar functions after Runway.

Three times in half a year is not infrequent, but the first two actions seemed a bit low-key.

This may be related to the startup company's desire to concentrate on polishing its work, or it may be related to the low-key personality of team leaders such as Wang Changhu. We don't know.

But the phenomenon is that many people know that Aishi Technology is the leader in the domestic AI video track, but they may not know why it is the leader and whether it is easy to use.

Now that PixVerses V2 has appeared, people of all ages, professionals and non-professionals can try it out and feel that it really works well - this is one of the reasons why PixVerse V2 became a hit as soon as it was launched.

Looking back at the various actions, it is not difficult to find that these several product capabilities are all centered around one theme:Make AI video creation more practical and simpler。

At the same time, we can see that the previous product capabilities focused on the user experience of professionals.

This is also consistent with Wang Changhu's previous statement. He once said:

We hope that AI native video can be integrated into the production and consumption links of the content industry.

But PixVerse V2 is different. This generation of products focuses on how to enable more ordinary people to get started with AI video creation.

After all, although Magic Brush is easy to use and useful, the premise is that the user must have generated an AI video.

Video prompts are more difficult than text generation and text-image prompts, and are often a stumbling block for ordinary people trying to generate AI videos.

PixVerse V2 has captured a lot of points——

We try our best to reduce the cost of AI video creation by lowering the difficulty of prompt words, fine-tuning options, expanding the boundaries of generated content, and eliminating post-editing.

What will be the result?

Everyone,Everyone has a chance, everyone can participate, which can turn wild imagination into visible video works.

Because there is a strong sense of participation, more people, or even everyone, will be able to unleash their creativity and participate in AI video creation.

In the long run,Gradually, the UGC ecosystem of the AI era will be formed, which is simpler and more interesting than UGC.。

I have seen an interesting meme before, and I believe many of my friends have also seen it:

“PixVerse is honored to be in the first row, together with the best video generation products at the time, such as Runway, Pika, and SVD. It is also the only Chinese company in this picture.” Wang Changhu himself once joked with this picture, “But on the other hand, there is a giant in front of us, and we need to further surpass it.”

It is undeniable that AI video is the focus of the multimodal track in the AI 2.0 era, especially after Sora created a huge wave.

The full enthusiasm of all giants, big companies and startups illustrates one thing.

AI video is broadening and stimulating market potential, and innovation driven by AI multimodal large models is growing.

The reason why Aishi appears on this meme picture and is the only Chinese company on the picture is very obvious.

On the one hand, Aishi Technology's model technology and the product effects grown on its self-developed base molds are indeed recognized.

on the other hand,No matter which wave of technology, startups will be the focus of global attention.

During the search wars, Google used its innovative web page ranking algorithm PageRank to snatch users from Yahoo!, and even surpassed Yahoo! as a latecomer to become the dominant player in the search market to this day.

In the early days of large language models, although Transformer came from Google, GPT was the creation of the (then) small research organization OpenAI. It has gradually evolved to today's GPT-4o and has become the object of pursuit.

Among OpenAI’s current pursuers and competitors is Google.

At any time, even when facing siege from large companies, there are always stories of startups sparking and shining stars that ignite the industry.

Aishi Technology is writing the AI video track with its technology and products, which is the story of the startup company itself.

news