a conversation with tang jiayu, ceo of shengshu technology: ai video has reached the point of "popularization", and increasing the duration is not the focus of productization

2024-09-13

on september 11, shengshu technology held a media open day event and released the "subject consistency" function, which aims to solve the "consistency" problem of video model generation subjects.

at the event, tang jiayu, co-founder and ceo of shengshu technology, responded to questions from a reporter from the "daily economic news" about the business model and said that there are currently two types of business models in the industry: saas (software as a service) subscription and maas (model as a service). since vidu went online on july 30, it has received tens of thousands of api access applications worldwide.

regarding the underlying architecture, tang jiayu said that the "u-vit architecture" used in its product "vidu" is almost identical to the "dit architecture" used by sora, the difference being that u-vit has been designed more for implementation. in terms of technology, everyone is now in a state of convergence in underlying architecture, but homogeneity does not mean that everyone has the same progress and capabilities. tang jiayu gave an example: "for example, the current language model, (although) everyone uses the transformer architecture, from a realistic point of view, openai is still clearly ahead."

at present, the main users of ai videos are still professional users, such as filmmakers, but tang jiayu believes that ai videos have reached the point of "popularization".

in addition, in terms of revenue at the current stage, shengshu technology has earned more revenue from the b-end market, while the growth curve of the c-end has been very "steep" since the vidu product was launched one month ago.

“the ultimate goal is to make a universal large model”

tang jiayu holds a master's degree from the natural language processing laboratory of tsinghua university. he previously served as vice president of ruilai wisdom and senior product manager of tencent youtu laboratory. tang jiayu's current company shengshu technology was established in march 2023 and announced the completion of a new round of financing in early march this year. at the end of april this year, the original video model vidu jointly developed by the company and tsinghua university was released to the world, and was officially launched at the end of july and fully open for use.

as soon as vidu was launched, it was called the "chinese version of sora". on the one hand, this name was because the outside world was full of expectations for china's video big model, and on the other hand, from the technical architecture, the two also have similarities.

according to reports, vidu's underlying architecture is based on the self-developed u-vit architecture, while sora is based on the dit architecture. regarding the difference between the u-vit and dit architectures, tang jiayu introduced: "in a word, they are almost identical." both are a fusion of diffusion and transformer, and even some of the underlying technical details are the same. the difference is that the u-vit architecture "has made more optimization designs for implementation." to put it simply, when training the same model, u-vit requires less computing power in the same amount of time.

from the perspective of the overall technical route, several current domestic video model companies are taking the "sora-like route", so will everyone become more homogenized in the future?

in this regard, tang jiayu introduced that everyone is currently in a state of convergence of the underlying architecture, "but homogenization does not mean that everyone has the same progress and capabilities." he analyzed the language model as an example, saying that everyone uses the transformer architecture, but from a practical point of view, openai is still clearly ahead. this is because there are still many links based on this architecture that require technical skills and practical experience to help solve the difficulties, which leads to the gap in capabilities of different language models.

currently, the industry is also exploring new architectural routes, such as combining multimodal generation and multimodal understanding, but no particularly good solutions have emerged yet.

"our ultimate goal is to create a general large model. video generation is a stage in the middle of generating a multimodal large model." tang jiayu frankly admitted his ambition to develop a general large model.

he also said: "this does not mean that we are only doing this one thing (referring to the video model). in addition to video, we also have the ability to generate other modalities."

“currently, the b-end market has more revenue”

the convergence of underlying technical logic has also more or less led to similar market development ideas.

"everyone's business choices are relatively similar. even companies like sora and runway are actively embracing hollywood or advertising cooperation." tang jiayu believes that the field of ai-generated videos is generally still in its early stages of development, and the leading international players are moving forward in parallel, or "jointly expanding the market."

taking shengshu technology as an example, tang jiayu divides the implementation of the business model into two directions: the first is the saas subscription model. vidu has some free quotas every month, but if there are more needs or want to use more advanced capabilities, you need to pay a subscription fee. vidu will also continue to enrich product functions to meet the user's creative needs; the second is the model capability output model (maas). currently, many customers need video generation capabilities as a part of the workflow or to derive interesting gameplay. these customers hope to be able to call the model directly.

from the perspective of revenue, the b-end market has earned more revenue at this stage. however, in the month since vidu went online, the growth curve of the c-end has also been very "steep." "based on our current judgment, the b-end (demand) is relatively clear, direct and stable, so the b-end is a long-term and key direction for us. we are also constantly exploring the c-end." tang jiayu said.

at present, domestic video generation models and tools have formed an "export trend" and have performed well, but tang jiayu believes: "it cannot be said that china has completely taken the lead. the top players at home and abroad are in the first echelon."

“ai video has reached a point”

the audience of the big video model is mostly people working in the film, television and animation industries, and they are often regarded as "professional audiences". so for "ordinary people", when will ai video become a tool they can master?

tang jiayu took photography as an example. from the film camera era to the popularization of mobile phone photography, it is a process of continuously lowering the threshold for creators. "now ai video has reached a node." tang jiayu introduced that the "subject reference" function released by shengshu technology on september 11 is an effort to lower the threshold for creators or accelerate the creation process.

"technology is still a key factor. the current video generation is only initially in line with the laws of physics, and there is still a high ceiling that needs to be broken through, such as stronger model capabilities and more modal collaborative generation." tang jiayu introduced that the "subject reference" capability released this time has indeed made great improvements in consistent generation, but there are still many areas that need further improvement. "for example, if a large model is to be transformed from generating a commodity to generating a handicraft, and this handicraft has complex patterns and hollow parts, facing such a complex structure, the current generation success rate is still not high. scene generation involves many components, such as sports shoes. i hope it can perform better in more complex and dynamic scenes. all of these require continuous improvement of model capabilities."

in this process, the originality and breakthrough of technology need to go hand in hand with good commercialization, because commercial companies are not scientific research institutions after all.

taking the duration of video generation as an example, expanding the duration requires improving the model's ability to abstractly understand the world and its two-way ability to compress and amplify information. currently, vidu can generate up to 32 seconds of video, and shengshu technology plans to extend it to a longer duration. however, the duration is not yet a key product for shengshu technology.

"in actual creation, roughly speaking, more than 90% of the clips are just a few seconds. therefore, from a practical point of view, we have not yet considered duration as our priority for release." tang jiayu emphasized, but from the perspective of model capabilities, the company is actually continuing to improve.

reporter|li shaoting ke yang

edit|duan lianwenduo du hengfeng

proofreading|wang yuelong

｜daily economic news nbdnews original article｜

reproduction, excerpting, copying and mirroring are prohibited without permission

daily economic news

report/feedback

news

a conversation with tang jiayu, ceo of shengshu technology: ai video has reached the point of "popularization", and increasing the duration is not the focus of productization

introduction

my contact information