news

working for ai companies, chinese post-95s have reached a valuation of us$13.8 billion

2024-09-30

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

source丨chuangyebang (id: ichuangyebang)

author丨juny

editor | hai yao

picture source丨bloomberg

in san francisco's showplace plaza, a commercial building that once belonged to airbnb recently welcomed a new owner. at a time when most technology companies are shrinking their business, scale ai, an artificial intelligence data annotation company founded by chinese born after 1995, rented an office of approximately 180,000 square feet in downtown san francisco with a wave of hands.

not long ago, scale ai completed its latest round of financing of us$1 billion, with a valuation of us$13.8 billion, which doubled from the previous round of us$7.3 billion. in this round of f financing led by top silicon valley fund accel, in addition to existing investors such as yc and nvidia, a long list of new investors has also been added, including: amazon, meta, amd, qualcomm, cisco, intel, qualcomm, etc., with up to 22 participating institutions.

most of these giants’ starting points for investing in scale ai are similar – they are basically scale ai’s customers. with the rapid development of ai, data labeling, a seemingly simple, boring, labor-intensive and low-threshold business, has been turned into a big business step by step by scale ai.

ai “blue collar factory”

in the past period of time, nvidia is undoubtedly the most mentioned company when it comes to "ai selling shovels." but what many people don’t know is that scale ai plays the same role. as we all know, computing power, algorithms and data constitute the three pillars of artificial intelligence. nvidia occupies the peak of ai computing power, and scale ai is currently the main service provider that provides data support for ai.

scale ai was founded in 2016. its founder is chinese alexandr wang, who was born in 1997. he was only 19 years old when he founded the company and had just completed his freshman year at mit. when scale was founded, it mainly focused on artificial intelligence data annotation. its core business is to help enterprises collect, clean, annotate, and manage large-scale high-quality data in order to train and optimize machine learning models.

in fact, before the rise of scale ai, data annotation had actually been in a "marginal" position in the field of ai for a long time. the so-called data annotation refers to the process of adding structured information to raw data such as images, text, videos, or audio so that machine learning models can understand and learn from these data. sounds complicated? but in fact, this is something that even an elementary school student can do. for example, i give you a picture and ask you to mark the pedestrians, vehicles, buildings, etc. in the picture. i give you a piece of text and ask you to mark which exclamations and which are questions. a piece of your voice can be tagged with emotion or speaker identity, etc.

source: shaip

although the principle is simple, these annotated data are indispensable for the development of artificial intelligence. ai models require a large amount of annotated data for learning in order to have functions such as recognition, classification, and prediction.

but the headache for many ai companies is that although some automated tools can speed up part of the annotation process, in order to obtain high-quality, high-precision annotation data, a large amount of manual work is still required to process, label and verify the data. especially in fields with high accuracy requirements, such as medical imaging, autonomous driving, or military applications, incorrect labeling may lead to serious consequences. because of this, data annotation is considered a labor-intensive business, and many companies are unwilling and do not have the energy to manage it themselves, resulting in the process of obtaining annotated data being time-consuming and expensive.

scale ai took over this “hard work”. scale ai's early positioning is to create an efficient and accurate labeling platform by combining automated technology and human review to help companies quickly process and label large-scale data sets. its business model is very simple: it contacts companies with labeling needs, performs simple preprocessing and cleaning of the data, and then outsources it to workers in africa, southeast asia, etc. to label the data.

in 2017, scale ai established remotasks as its internal outsourcing agency. it has set up dozens of institutions in kenya, the philippines, venezuela and other places, and trained thousands of data annotators everywhere. most of the work of these annotators is they are paid on a piece-by-piece basis, with earnings as low as a few cents per call, and many contract workers even earn less than $1 an hour. under such a "global factory" model, scale ai's gross profit margin can remain above 65% for a long time.

hit every opportunity

although data annotation seems to be a low-threshold business, it was almost a blank in the market during the "ai silent period" around 2016. only some large companies such as google and amazon had their own data annotation departments. scale ai’s success is largely due to its accurate insight into this opportunity and its ability to seize several trends in the development of the artificial intelligence industry in the past 10 years.

the first is autonomous driving. a few months after scale ai was founded, they discovered the large-scale and rigid demand for data annotation in the field of autonomous driving. the development of autonomous driving technology relies on a large amount of high-precision annotation data, such as image data of road scenes, pedestrians and other objects. car companies need tens of thousands of hours of video data for annotation to train and verify their algorithms. as for the entire autonomous driving from an industry perspective, more than 90% of data annotation at that time was mainly manual. scale ai uses an efficient data annotation platform and uses model-assisted annotation and data preprocessing to accelerate the data processing process, thereby significantly reducing annotation costs and time, attracting companies such as waymo and cruise, which were in the limelight at the time, to become its customers. , and then gradually gain a foothold in the field of autonomous driving data annotation.

image source: scale ai

after initially enjoying success in the field of autonomous driving, scale ai has begun to fully enter the aiaas (ai as a service) market. it extends from simple data labeling to data services, providing full-process solutions from data labeling and management, model training and evaluation, to ai application development and deployment.

in addition, to address the challenge of insufficient data in some industries, scale ai also extends downstream to the generation of synthetic data to help train models by creating new data sets from existing data. so in the following years, scale ai rose rapidly in the data field, and its customers expanded to medical, national defense, e-commerce, government services and other fields. more than two years after its founding, scale ai’s revenue is approaching $50 million.

scale ai also accurately grasped the opportunity of the explosion of generative ai. as early as gpt-2, scale conducted the first collaborative experiment on reinforcement learning with human feedback with openai, and then extended these technologies to instructgpt and other fields. since generative ai models require massive amounts of training data to improve the accuracy and diversity of generated content, the explosive growth of large language models has greatly promoted the industry's demand for high-quality annotated data. scale ai integrates data annotation , data synthesis and other services provide necessary data support for generative ai. in addition, scale ai also helps enterprises quickly generate customized apis to reduce the complexity and cost of training models on their own.

image source: scale ai

for generative ai, scale has launched full-process platform services, including developer tool platform scale spellbook, synthetic data product scale synthetic, enterprise-level genai platform, etc. the goal is to allow enterprises to have enough data in every scenario to support model training, with its unique advantages in the data field, scale ai has seen a surge in customers in the past two years, including giants such as openai, meta, aws, and nvidia, as well as emerging unicorns such as cohere and adept. and many of them also became investors in scale ai in this round of financing.

why scale ai is breaking through

regarding the rise of scale ai, many people are wondering. for such an upstream and labor-intensive industry in ai, china seems to have an innate advantage. why does no similar company stand out? generally speaking, there are two main factors behind this, one is the industry, and the other is financing.

before the generative ai boom, domestic artificial intelligence development was once leading in scene applications. the data annotation business actually started developing very early, but it did not form a large scale. although many leading companies have established data annotation departments, they mainly serve their own business rather than seeking to match data with resources in various industries. at the same time, precisely because of the country’s demographic dividend, the cost of acquiring labeled data is low, and companies have no incentive to adopt technology platforms. it is understood that for a long time, prices in the domestic data annotation industry have been very transparent. hourly wages are generally around rmb 10-25 and most have no academic qualifications.

source: directly hired by boss

in comparison, the cost of labor in the united states is high. on linkedin, indeed and other platforms, most part-time hourly wages marked by data are between 30 and 200 us dollars. this objectively requires companies to think about solutions from a technical perspective. data production issues, or procurement of related services.

from the perspective of financing environment, the domestic data annotation market has always been on the edge of financing in the ai ​​field. around 2021, research estimates that the size of china's entire data annotation market is only 4.3 billion yuan, and will only grow to 5.1 billion yuan in 2022. this number is undoubtedly not worth mentioning compared to the trillions of scale of the entire ai market, and it has also caused financing difficulties for data annotation companies. in 2021, when scale ai has completed series e financing of us$325 million and its valuation reached us$7.3 billion, most similar startups in china are still in the series a round.

the reason why the domestic scale was so small before was because only the labeling aspect was simply considered. in fact, full-process data services such as data management, data evaluation, and data synthesis derived from data annotation are the value-added part of this industry.

regarding the importance of data for the development of large language models, alex wang, the founder of scale ai, said in a recent interview that people have exhausted all the data on the internet and want to develop artificial intelligence more powerful than gpt-4.5 , then cutting-edge data must be constructed. the so-called "cutting-edge data" refers to data that is closely related to application scenarios and can reflect the latest trends and changes in a timely manner. it often contains a large number of long-tail or rare scenarios, which helps to improve the performance of ai in atypical situations and promote artificial intelligence. the boundaries of intelligent capabilities are developing in directions such as complex reasoning and multi-modality.

as ai develops in depth, future data training needs to be more matched with specific tasks and specific application scenarios. therefore, it is also necessary to mine and produce more new and differentiated data. this is the reason for scale ai’s current round of 1 billion. the focus of work after us dollar financing has further opened up the imaginary boundaries of data annotation.