Using the "vertical model" to lead the commercialization of AIGC, what is FancyTech's technical path?

2024-08-16

Synced

Synced Editorial Department

We are witnessing another round of technological innovation. This time, AIGC provides individuals with tools to express themselves, making creation easier and more popular, but the driving force behind it is not the "big" model.

Over the past two years, AIGC technology has developed faster than anyone could imagine, sweeping across all fields from text, images to videos. Discussions on the commercialization path of AIGC have never stopped, with some consensus and some divergent paths.

On the one hand, the powerful capabilities of general models are amazing, showing their potential for application in various industries. In particular, the introduction of architectures such as DiT and VAR has enabled the Scaling Law to make a leap from text to visual generation. Under the guidance of this law, many large model manufacturers continue to move forward in the direction of increasing training data, computing power investment, and accumulating parameters.

On the other hand, we also see that a general model does not mean a "one-size-fits-all" model. When faced with tasks in many niche areas, a "well-trained" vertical model can actually achieve better results.

As large-scale model technology enters a period of accelerated implementation, the latter commercialization path has received rapid attention.

In this evolution, a startup company from China, FancyTech, stood out:It rapidly expanded the market with standardized products for commercial visual content generation, and verified the superiority of the "vertical model" in industry implementation one step earlier than its peers.

Looking around the domestic large-scale model startup circle, FancyTech's commercialization achievements are obvious to all. But what is less known is how this company, which was founded only a few years ago, has been at the forefront of the race with its vertical model and technological advantages.

In an exclusive interview, Synced and FancyTech talked about the technological explorations they are currently conducting.

FancyTech releases DeepVideo, a vertical video model

How to break through industry barriers?

Generally speaking, after the zero-sample generalization ability of a general model reaches a certain level, fine-tuning it can be used for downstream tasks. This is also the approach used by many large model products at present. However, from the actual effect, "fine-tuning" alone cannot meet the needs of industrial applications, because the content generation tasks of various industries have their own specific and complex set of standards.

The general model may be able to complete 70% of routine tasks, but what customers really need is a "vertical model" that can meet 100% of their needs. Taking commercial visual design as an example, related work in the past was completed by professionals with long-term accumulation, and it needed to be designed and adjusted according to the specific needs of the brand, which involved a lot of manual experience. Compared with indicators such as aesthetics and the degree of compliance with instructions, "product restoration" is a point that brands pay more attention to in this task, and it is also a determining factor in whether brands are willing to pay.

In the process of developing its own vertical model for commercial images/videos, FancyTech broke down the core challenge: how to restore the products sufficiently and blend them into the background, especially in the generated videos, so that the movement of the products can be controlled and not deformed.

视频链接：https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650930567&idx=1&sn=b5fc3170aa4c3be6701f2a21fb898120&chksm=84e439f9b393b0ef4b8ce1756e95c59a9dce6205478feea33a68b6a594d400cd0ac1b62e037f&token=2065772502&lang=zh_CN#rd

With the development of big model technology today, for the application layer, whether to go open source or closed source is no longer the core issue. FancyTech's vertical model is based on the open source underlying algorithm framework, superimposed with its own data annotation and retraining, and only requires a few hundred GPUs for continuous training iterations to achieve good generation results. In comparison, the two factors of "product data" and "training method" are more critical to the final landing effect.

Based on the accumulation of massive 3D training data, FancyTech introduced the idea of spatial intelligence to guide the model's 2D content generation.Specifically, in the generation of image content, the team proposed a "multimodal feature detector" to ensure the restoration of the product, and used special data collection to ensure the natural integration of the product and the background; in the generation of video content, the team rebuilt the underlying link of video generation, designed the framework in a targeted manner, and performed data engineering, thereby realizing video generation with products as the core.

True dimensionality reduction strike: How does “spatial intelligence” guide 2D content generation?

The core reason why the effects of many visual generation products are not satisfactory is that the current image and video generation models are often learned based on 2D training data and do not understand the real physical world.

This is a consensus in the field, and some researchers even believe that under the autoregressive learning paradigm, the model's understanding of the world is always shallow.

However, in the niche task of commercial visual generation, it is not completely unsolvable to enhance the understanding of the model's 3D physical world and better generate 2D content.

FancyTech has transferred the research ideas in the field of "spatial intelligence" to the construction of visual generative models. Different from general generative models, the idea of spatial intelligence is to learn from the raw signals obtained by a large number of sensors and accurately calibrate the raw signals obtained by the sensors to give the model the ability to perceive and understand the real world.

Therefore, FancyTech replaced traditional studio shooting with lidar scanning, accumulating a large amount of high-quality 3D data pairs that reflect the differences between products before and after integration, and combining 3D point cloud data with 2D data as model training data to enhance the model's understanding of the real world.

We know that in the generation of any visual content, shaping light and shadow effects is a very challenging task. Elements such as lighting, illuminants, backlighting, and light spots can make the spatial layering of the picture stronger, but this is a "knowledge point" that is difficult to understand for generative models.

In order to collect as much natural light and shadow data as possible, FancyTech set up dozens of lights with adjustable brightness and color temperature in each environment, which means that each pair in the massive data can be superimposed with multiple lights and different brightness and color temperature changes.

This high-intensity data collection simulates the lighting of real shooting scenes, making it more in line with the characteristics of e-commerce scenarios.

Combined with the accumulation of high-quality 3D data, FancyTech has made a series of innovations in the algorithm framework, organically combining spatial algorithms with image and video algorithms, allowing the model to better understand the interaction between core objects and the environment.

During the training process, the model can "emerge" an understanding of the physical world to a certain extent, and have a deeper understanding of three-dimensional space, depth, reflection and refraction of light, and the results of light running in different media and materials, ultimately achieving "strong restoration" and "super fusion" of the products in the generated results.

What algorithmic innovations are behind “strong restoration” and “hyper-convergence”?

For common commodity scene image generation tasks, the current mainstream method mainly uses mapping to ensure the restoration of the commodity part, and then implements the editing of the image scene based on the Inpainting technology. The user selects the area to be modified, enters a prompt or provides a reference image to guide the commodity scene generation. This method has a good fusion effect, but the disadvantage is that the controllability of the scene generation result is not high, such as not being clear enough or too simple, and cannot guarantee a high availability rate of a single output.

In response to problems that current methods cannot solve, FancyTech proposed its own "multimodal feature extractor" to extract product features in multiple dimensions, and then use these features to generate integrated scene graphs.

The work of extracting features can be divided into "global features" and "local features". Global features include the outline, color and other elements of the product, which are extracted using the VAE encoder; local features include the details of each product, which are extracted using the graph neural network. One of the great benefits of the graph neural network is that it can extract the information of each key pixel in the product and the relationship between key pixels, thereby improving the restoration of details inside the product.

In the content generation of flexible material products, this method has achieved significant improvement:

Compared to images, video generation also involves the motion control of the product itself and the resulting light and shadow changes. For general video generation models, the difficulty lies in the inability to independently protect a certain part of the video. To solve this problem, FancyTech breaks down the task into two branches: "product motion generation" and "video scene integration."

In the first step, FancyTech designed some targeted motion planning solutions to control the movement of products in the picture, which is equivalent to "freezing" the picture of the product in each frame of the video in advance;
The second step is to realize controllable video generation through the control module. The control module adopts a flexible design, which is compatible with different architectures such as U-net and DiT, and is easy to expand and optimize.

At the data level, in addition to using FancyTech's unique product data resources to provide control training and product protection, multiple open source data sets are also added to ensure scenario generalization capabilities. The training program combines comparative learning and curriculum learning, and ultimately achieves the protection effect for the product.

Let the dividends of the AIGC era

Starting from the vertical model to more ordinary people

Whether it is "universal" or "vertical", the end point of both routes is the issue of commercialization.

The most direct beneficiaries of the FancyTech vertical model are brands. In the past, the production cycle of an advertising video from planning, shooting, and editing may take several weeks. But in the AIGC era, it only takes a dozen minutes to create such an advertising video, and the cost is even only one-fifth of the original.

With the advantages of massive unique data and industry know-how, FancyTech has won wide recognition at home and abroad through the advantages of its vertical model. It has signed contracts with Samsung and LG together with Korean partners; started cooperation with Lazada, a well-known e-commerce platform in Southeast Asia; in the United States, it has been favored by local brands such as Kate Sommerville and Solawave; in Europe, it has won the LVMH Innovation Award and is in in-depth cooperation with European customers.

In addition to the core vertical model, FancyTech also provides the ability to automatically publish and provide data feedback for AI short videos throughout the entire process, driving continued growth in product sales.

More importantly,The vertical model visualizes the path for the general public to improve productivity using AIGC technology.For example, a traditional street photography studio can transform its business from simple portrait photography to professional-grade commercial visual material production with the help of FancyTech's products without adding professional equipment and professionals.

Now, almost everyone can shoot videos, record music, and share their creations with the world by just picking up a mobile phone. Imagine a future where AIGC once again unleashes individual creativity.

It allows ordinary people to cross professional thresholds and turn creativity into reality more easily, thereby achieving a leap in productivity in each industry and generating more emerging industries. The era dividend brought by AIGC technology will truly begin to move towards ordinary people from this moment on.

news

Using the "vertical model" to lead the commercialization of AIGC, what is FancyTech's technical path?

Introduction

My contact information