can ai create everything?

can ai “create” everything?

2024-08-29

half a year after sora was born, its "challengers" came one after another, and even nvidia, which "couldn't wait" and "couldn't catch up", got involved.

so far, sora has only released a sample and has not opened it to the public, while kuaishou, zhipu qingying and vidu have taken the lead in opening the door to the experience and reaching the public.

although the initial experience of "one-click generation" is not perfect, it has stirred up the content industry. many short dramas, advertisements, and animations around us have begun to use ai as an "efficiency partner". artificial intelligence generation technology, from text-to-picture not long ago to text-to-video, picture-to-video, and video-to-video, the "aigc universe" is constantly expanding.

is ai the "magic brush ma liang" in chinese mythology? how much imagination and creativity can it bring to life?

"vensheng video", how to "give birth"

"vinsheng video is a bombshell." over the past six months, the reappearance of sora from major companies to unicorns has demonstrated the industry's emphasis on "generation."

in short, video generation is the process of converting multimodal inputs such as text and images into video signals through generative artificial intelligence technology.

currently, there are two main technical routes for video generation. one is the diffusion model, which is divided into two categories: one is the diffusion model based on convolutional neural networks, such as meta's emuvideo and tencent's videocrafter; the other is the diffusion model based on the transformer architecture, such as openai's sora, kuaishou's keling ai, and shengshu technology's vidu. the other is the autoregressive route, such as google's videopoet and phenaki.

on july 26, 2024, chinese technology company zhipu ai released its self-developed ai-generated video model ying to global users. the picture shows the user login interface

at present, the diffusion model based on the transformer architecture is the mainstream choice for video generation models, also known as "dit" (di stands for diffusion and t stands for transformer).

text "diffusion" into video? "diffusion here refers to a modeling method." yuan li, assistant professor and doctoral supervisor at the school of information engineering of peking university, gave a vivid example:

when michelangelo was carving the famous david statue, he said: the sculpture is already in the stone, i just removed the unnecessary parts. "this sentence vividly describes the modeling process of 'diffusion'. the original pure noise video is like an uncarved stone. how to knock this big stone and knock off the excess parts until it becomes a clear-cut 'david' is the way of 'diffusion'," said yuan li.

yuan li further explained: "transformer is a neural network that follows the 'rule of scale' and performs the process of knocking stones. it can process the input spatiotemporal information and understand the real world by understanding its internal complex relationships. this enables the model to have reasoning capabilities, capture the subtle connections between video frames, and ensure visual coherence and temporal fluency."

how fast is the "efficiency partner"?

a cute polar bear was awakened by the alarm clock, packed his bags, took a helicopter, transferred to a high-speed train, changed to a taxi, boarded a ship, crossed mountains, rivers, lakes and seas, experienced hardships and dangers, and finally arrived at the south pole to meet the penguins...

this 1.5-minute animated short film, titled "heading south", was created by the video generation model vidu. the original workload of one month was completed in just one week with the addition of ai, the "efficiency partner" - the efficiency is four times higher than before.

this made chen liufang, the winner of the best film in the aigc short film unit of the beijing film festival and the head of ainimate lab ai, feel emotional: video generation technology has made high-level animation no longer a "money-burning game" that only large companies dare to play.

the creative team of the ai animation "heading south" consists of only three people: a director, a storyboard artist, and an aigc technology application expert. if it were produced using traditional processes, 20 people would be needed. in total, the production cost alone was reduced by more than 90%.

as wan pengfei, the head of kuaishou's visual generation and interaction center, said, the essence of video generation is to sample and calculate pixels from the target distribution. this method can achieve higher content freedom at a lower cost.

entering vidu's video generation page, i also experienced the freedom of "one-click generation". upload a photo and set it as the "starting frame" or as a "reference character", enter the text description of the scene you want to generate in the dialog box, click "generate", and a smart and wonderful short video will be automatically generated. it takes less than 1 minute from entering the page to downloading.

send a picture to vidu, a domestic video model, and an animated video will be automatically generated. the picture shows a screenshot of the video

"the era of 'everyone becomes a designer' and 'everyone becomes a director' will come, just like the era of 'everyone has a microphone'," said zhang peng, ceo of zhipu ai.

"world simulator", is there any hope?

is video generation just about disrupting the content industry? this is obviously not openai’s original intention. “generate videos” is just an “appetizer”.

before sora was born, openai did not position it as a tool for implementing aigc, but rather as a "container" that replicates the physical world - a world simulator. in this container, the physical laws, environmental behaviors, and interaction logic of the real world run, just like the virtual world depicted in "the matrix", impacting our imagination and senses.

however, the physical world is three-dimensional, and current models such as sora are only based on two-dimensional operations. they are not real physics engines, so there is no talk of deep simulation of the physical world.

"for many years, i have said that 'seeing' the world is 'understanding' the world. but now i am willing to push this concept a step further. 'seeing' is not just for 'understanding', but for 'doing'." fei-fei li, a professor at stanford university, publicly stated that the bottom line of spatial intelligence is to connect "seeing" and "doing" together, and one day, ai will do this.

when "seeing" is not the same as "doing", the creation of artificial intelligence cannot stop. recently, a new technical route has emerged. different routes are competing with each other and moving forward together to promote this intelligent world constructed by vectors and models.

the future "world view" is still a mystery that has not yet been revealed. as american physicist feynman said, "i cannot create a world that i do not understand." but this does not mean that if you understand a world, you will definitely be able to create a world.

at this moment, we are still on the eve of disruption. that is why when we ask technology explorers questions about the future, we get completely different answers. perhaps "uncertainty" is a blessing in this era.

report/feedback

news

can ai “create” everything?

introduction

my contact information