sora went silent after the explosion, and domestic video models took over and lowered the threshold

2024-09-11

the industry explosion caused by the launch of openai's wensheng video model sora seems to have happened yesterday, but sora has not yet been officially opened to the public. in contrast, domestic video models were released intensively in 2024. although the technology is constantly updated, most of the finished products still require post-production manual editing and synthesis, which affects the speed of technology landing on the application side.

based on this, on september 11, shengshu technology disclosed a function update and launched the world's first "subject consistency" function, which enables consistent generation of any subject, making video generation more stable and controllable. the so-called "subject consistency" allows users to upload a picture of any subject, and ai can lock the subject image, switch scenes arbitrarily through descriptive words, and output a video with consistent subject.

in the view of tang jiayu, ceo of shengshu technology, short videos, animation works, commercials and other film and television works all require the narrative system to have "consistent subject, consistent scene, and consistent style" in the art of narrative. if the video model is to achieve narrative integrity, it must be fully controllable in these core elements.

generate 32 seconds video with one click

the last time shengshu technology spoke out was in april this year, when professor zhu jun, vice president of the institute of artificial intelligence at tsinghua university, co-founder and chief scientist of shengshu technology, released vidu, a large video model with long duration, high consistency and high dynamics, which can generate videos up to 16 seconds long with one click. with this technical update, vidu videos can be generated up to 32 seconds long.

in 2024, the entire big model track gradually calmed down after the craziness of the previous year, and the video big model was seen as the only way to move towards multimodal big models or agi. short video companies represented by kuaishou and douyin under bytedance, internet giants represented by alibaba and tencent, and startups represented by shengshu technology, zhipu ai, and aishi technology have successively released video big model products.

according to statistics from debon securities, since the release of sora, more than a dozen companies at home and abroad have released or updated video generation models. objectively speaking, the gap between china and foreign countries is gradually narrowing, and basic functions such as video duration and resolution are replicable. in the future, competition may shift to grabbing users and improving stickiness. from a subjective perspective, debon securities believes that the quality of videos generated by large models has improved significantly, but there is still a distance from the physical world simulator. in the field of cultural videos, the video images are generally clear, but there are large differences in the range of motion and physical restoration. this is also one of the considerations for the functional upgrade of this time.

tang jiayu said that the current 32-second vidu generation is end-to-end generation with one click, not splicing and inserting frames. the difference is that the model has a stronger ability to compress information for a longer period of time, including information representation, which is actually more fundamentally related to the understanding of the physical world and the relationship between semantic input. therefore, increasing the duration requires improving the model's ability to abstractly understand the world, its compression ability, and its understanding ability, including its generation ability.

shi yuxiang, an aigc artist who created the animated short film "summer gift", believes that the industry is currently quite tolerant of ai videos, and that there are areas for improvement in details, such as the processing of complex shots, the processing of multi-character shots, and some processing with scene scheduling. compared with the basic image-generated video function, the "subject reference" function breaks free from the constraints of static images, improves the coherence of creation, and saves nearly 70% of the workload of generating images.

li ning, the founder of guangchi matrix and a young director, used vidu to pre-create a video clip of the male protagonist of the movie. all the character images were generated only through three makeup photos of the male protagonist: close-up, mid-shot, and long shot. li ning said that the previous ai movie creation process mostly used the traditional text-to-picture and picture-to-video process, which was difficult to control the coherence of the storyboards, and it was difficult to keep the overall shape of the characters consistent. a lot of energy was needed to debug the pictures in the early stage. at the same time, the pictures were prone to a series of problems such as out-of-control lens light and shadow, blurred images, and even deformation. as the length of the video increases, these problems are further magnified. vidu's "subject reference" function significantly improves the overall consistency of the characters. it is no longer necessary to generate a large number of pictures in the early stage. the movement of the characters and the transition of the pictures are also more natural, which can help the creation of long narratives.

essentially, the upgrade of the "subject reference" function is to improve the quality of large video model generation, the efficiency of combining technology with specific industries, and accelerate the implementation of ai in specific applications. currently, shengshu technology has launched a partner program, inviting industry organizations such as advertising, film and television, animation, and games to join.

at present, the business model of shengshu technology's video model is divided into saas subscription model and api interface method, which is also the commercial trial method commonly adopted in the field of large models. regarding the specific distribution of b-end and c-end, tang jiayu said that from the perspective of revenue, the b-end market has greater revenue. since the c-end product was launched one month ago, the growth curve has been very high. comprehensively judging, the b-end is relatively clear and direct, and contains relatively stable demand, so the b-end will be the company's long-term focus. the c-end product is still in the process of continuous exploration.

zhang peng, ceo of zhipu, talked about the industry's commercialization exploration when he released zhipu ying. he said that at this stage, whether it is toc or tob, it is still early to move towards large-scale commercialization. the so-called charging strategy is more of an early attempt, and we will also observe the feedback from the market and users and make timely adjustments.

where is the next step for the video big model?

in addition to upgrades and updates at the specific functional level, the current industry consensus is that multimodality is the general trend, while large video models are only a temporary state.

in this regard, zhang peng said that video generation does not exist in isolation, but is placed in the entire technology and product development route. zhipu believes that it is a link in the multimodal or agi multimodal path. from a product perspective, video generation will also become an independent product to achieve commercialization and generate value. tang jiayu also told reporters that the underlying layer of shengshu is a general large model, and video generation is only an intermediate stage.

in the process of moving towards multimodality, will the intensive release of multiple video models cause homogeneity problems? in this regard, tang jiayu told reporters that in terms of technical routes, shengshu is now in a state of convergence, but homogeneity does not mean that all progress and capabilities are the same. for example, current language models all involve the transformer architecture, but in reality, openai is still clearly ahead. because based on the architecture, there are still many links in the middle, such as how to effectively scale up, how to effectively compress videos, etc., there are many skills and practical experience. algorithmic skills, algorithmic difficulties, including algorithmic engineering difficulties, are all reasons for the differences in the current large video models.

as for commercialization, tang jiayu believes that the industry is similar in terms of business choices. even companies like sora and runway are actively embracing hollywood or engaging in advertising cooperation, because these fields are naturally easy to implement. the entire industry is moving forward with its own characteristics. the overall ai-generated video field is still in the early stages of development, and international leading players are moving forward together to expand the market.

regarding the intensive release of video models, zhang peng believes that controllability is something the industry needs to work hard on. on the one hand, at the technical level, the controllability of the video itself is a very big requirement. secondly, from a security perspective, because the video signal contains more content and details, it is necessary to ensure that the generated content meets the requirements; finally, if the generated content is to be commercially applied, controllability is also a necessary condition - it must accurately express the creator's intentions and make everyone pay for it.

after the basic conditions are met, the industry's expectations for large video models since the launch of sora are more focused on ai replacing long video shooting methods. zhang peng believes that from the perspective of technological development, this is an important direction and has positive significance for changes in the film and television industry. however, at present, large video models are not enough to be used directly in the production process for the audience, but they can be used for auxiliary work or even small-scale creation. there is still a long way to go before they can truly change the high requirements of film production.

as for sora, which debuted at the height of its debut and has not yet been opened to the public, the industry still regards it as a target to catch up with, but due to the lack of transparency in technical details, many areas require companies to explore on their own. as for sora's "disappearance", tang jiayu analyzed to reporters that the reasons may be in several aspects: video is not openai's current main line; some data copyright issues have not been resolved; other problems have arisen during the generation process, which require a certain amount of time and cost to resolve, and do not meet the company's priorities.

zhang peng and zhipu have always objectively faced the gap between them and the world's top level. at the same time, he believes that they still have to walk this road on their own. in many cases, chinese companies are also catching up in their own way, such as how to reduce the computing cost of video generation and increase the response speed so that everyone can use it. "while we are pursuing technological heights, we are also pursuing the popularization of technology at the same time," said zhang peng.

(this article comes from china business network)

report/feedback

news

sora went silent after the explosion, and domestic video models took over and lowered the threshold

introduction

my contact information