ai data is in short supply, and big companies are eyeing cheap young people

2024-09-03

in order to obtain new data and train large ai models, bytedance and other internet giants are personally involved, recruiting "ai recorders" and customizing corpora at prices ranging from 300 yuan per time.

the bytedance office building located in beijing dazhongsi is home to the bytedance douyin business team and the volcano engine business team. since the beginning of the year, they have been recruiting amateurs to record doubao models. two people form a team, and each session lasts for 3 hours, including 80 minutes of free chat, 60 groups of dialogues with prompts, and the single settlement amount is 300 yuan.

the three-hour recording was accompanied by at least two bytedance employees. "the conversation cannot be too long, it must have content and information, and if the quality is too poor, we will deduct money as appropriate," and "the prompt words cannot be modified, the big model cannot understand them." from 6pm to 9pm, the instructions given by bytedance employees during the recording process revealed more of their concern for the quality of the recording.

figure: interior view of the dazhongsi recording studio

in fact, second-tier cities such as chengdu, taiyuan, and guizhou have long become ai data outsourcing cities for large companies such as bytedance, baidu, and alibaba. "last year, data annotation and dialect reading were done by college students. now we recruit interns from 211 and 985 universities to lead outsourcing work," said a product manager of a large model.

yan junjie, the founder of minimax, which just launched a large video model in september, told alphabet list that in shanghai, in addition to the high-quality data from corpus companies, minimax will also purchase some platform data.

data, algorithms, and computing power are the three pillars of ai big models, among which data is the foundation for big model training. however, since internet data is scattered across different platforms and surrounded by numerous barriers, the public data that can be used to train ai big models is becoming exhausted.

in june, the research institute epoch ai released a new study predicting that the data available for public training of ai language models will be exhausted by technology companies between 2026 and 2032. as early as may 2023, openai ceo altman publicly admitted that ai companies will exhaust all data on the internet in the near future.

how to find high-quality new data to "feed" the big model has become a common problem for all ai big model teams.

some large companies have repeatedly been involved in disputes due to suspicion of unauthorized use of third-party data. in august, openai was sued by more than 100 youtube anchors, accusing it of unauthorized transcription of millions of youtube videos to train large models. giants such as nvidia, apple, and anthropic are also involved.

for large companies, only by having their own closed-source high-quality data can they ensure the timeliness and quality of the data used to feed large models. skipping third-party platforms with unstable quality control and trying to write the "script" for ai themselves may be a new approach for large model manufacturers.

at the beginning of this year, part-time ai recording jobs priced at 300 yuan per session quietly appeared on platforms such as xiaohongshu.

compared with the part-time ai recording job with an hourly wage of 30-55 yuan on platforms such as boss zhipin, the so-called "part-time recording job in leading manufacturers" with a one-time salary of 300 yuan and the recording location in beijing dazhongsi seems quite tempting.

in august, when i was pulled into the recording group via wechat, zizibang (id: wujicaijing) found that there were already more than 200 people waiting to be recorded in the group. since the recording was for two people in a group and the time was up to 3 hours, after entering the group, the most pop-up wechat messages were "looking for a partner" and "is there anyone who wants to record with me?"

in fact, it is not easy to be an ai sound recorder and "write scripts for ai" for 300 yuan per time.

first, before recording, everyone must upload a 2-3 minute conversation recording as a "sample audio". bytedance's reviewers will decide whether to notify part-time recording based on the effect of the sample audio.there will be three employees responsible for the review of this process. only after the review is passed by two of them can the recording time be scheduled directly. if it fails, there will be a cross-review.

after the second audition of the sample audio, zhang xue booked a recording time from 6 to 9 p.m. in the second week of submitting the sample audio. in the group chat, many people were stuck in the sample audio stage. "the review teachers like people who can chat and love to chat." the high-spirited conversations and thematic content made more people stuck in the first screening threshold.

caption: dazhongsi recording group source: screenshot of zizibang

on the night of recording, zhang xue sat on a chair across the transparent glass of the recording studio, adjusted to the best position where her voice could be clearly recorded, and listened to the instructions of byte employees through headphones.

the first part is an 80-minute free chat between two people without a theme. the bytedance staff requires that the chat cannot be a "mere chat" but must have content. at the same time, each topic cannot exceed 10 minutes, and there cannot be long monologues. it must be a relatively even conversation.

zhang xue and her partner talked through a huge microphone in the recording studio, trying to talk for 80 minutes without stopping. at the same time, they also tried to restrain their bodies from moving around and making sounds such as coughing and laughing that would disrupt the recording quality.

to ensure the quality of the voice, bytedance staff would occasionally insert headphones to remind people to re-record if there was noise or if the chat was "unnatural and too much guidance". the standard for high-quality voice is that the chat is natural, the topic is continuous, the emotions are positive but not interrupting, and there must be content and no nonsense. after repeated re-adjustments, the first stage took nearly 2 hours.

in the second stage, 60 groups of dialogues with prompts were recorded. although there was a script for reference, as an ai recorder, zhang xue not only had to edit the dialogues according to the situation, but also had to ensure a strict dialogue mode, that is, if the previous group of dialogues ended with a, then the next group of dialogues must start with b.

at the same time, in order to adapt to the debugging needs of large models, each instruction must be clearly stated with a prompt word, "can you be more detailed? can you be more detailed? can you be more detailed?" in the headphones, byte staff also made it clear that the script can be changed, but only the prompt words cannot be changed. in other words, ai may find it difficult to recognize.

in order to ensure the quality of the recording, any unclear recording, word swallowing or lack of emotion would be re-recorded. when the recording was finished and zhang xue left dazhong temple, it was already nearly 10 o'clock in the evening. for a 3-hour recording, bytedance staff had to record 3 sessions a day, and the weekly schedule was almost full.

in addition to beijing, bytedance has recruited recording engineers in shanghai, hangzhou, chongqing, nanjing, chengdu, tianjin and other cities.

for large model manufacturers who are hungry for new data, the practice of "spending money to get data" is not new.

in 2023, as ai big models become a new trend, large companies not only purchase data directly through third-party companies, but also create outsourced positions such as "big data labelers" and "ai editors".

in 2023, alin, who majored in a minor language, started "working" for big model through websites such as boss direct recruitment during her postgraduate entrance examination.

through a company called "x data", a lin checked the text content of the large model image recognition, that is, to check whether the small language text after the large model image recognition is consistent with the image. according to the price of "one word or sentence counts as one calculation box, and one box counts as 0.1 cent", a lin can earn dozens of yuan at a time by calculating hundreds of items.

this year, a-lin also received orders from a third-party data company to do ai data annotation for translation, and the price has risen to more than 1 yuan per line. however, in order to manually judge whether the french and other small languages translated by the large model are accurate, the annotator must not only find out the errors, but also use different colors to mark the translation content of 5-6 large models. "sometimes it takes 10-15 minutes to read one line."

after working for ai, alin also found that once these large models are separated from the original textbook corpus of small languages, they will start to lose intelligence when it comes to new words used on social platforms or idiomatic words used by small groups that are not included in their own databases. "restricted by copyright, they cannot learn new text content, and the translation effect is also affected."

in addition to third-party outsourcing companies, large companies have also established their own data bases.

for example, baidu's data bases are located in non-first-tier cities such as nanchang, yangquan, taiyuan, and guizhou, and it completes data collection such as data labeling and dialect reading in these cities. it only needs to "recruit some local college students who can operate computers. the monthly salary is often between 3,000 and 5,000 yuan." meituan has also long had its own on-site ai trainers.

however, compared with the big companies that are willing to spend money, it is much more difficult for the four little dragons of big models to obtain high-quality data.

“core closed-source high-quality data is often monopolized by large companies. ai startups, even the four ai unicorns, may only be able to get edge data.”leo, an algorithm staff member of a large model manufacturer, told alphabet list.

since high-quality data can significantly improve model performance, in addition to open-source public data, large model manufacturers need higher-quality data to complete training in order to achieve technology iteration. however, this data is often controlled by large companies. for example, domestic news data is controlled by tencent, bytedance and other large companies, while overseas data is controlled by common crawl, gdelt, the pile and others.

overseas, even youtube announced at the end of june that it would provide licensing agreements to top record companies in exchange for copyrighted music for training. openai has been reaching paid agreements with news publishers such as politico, the atlantic, time, and the financial times to use and quote their news materials.

when key data is mainly controlled by the "channel party", such as companies such as tencent, bytedance and meta, key user data has been divided up as early as the mobile internet era. if the four ai dragons want to achieve a technological breakthrough, they must first pay a considerable "data fee".

for manufacturers, in the second half of big model entrepreneurship, the "big data illusion" is also one of the reasons why big models collectively decline in intelligence and cannot measure which is bigger, 9/11 or 9/9.

when zizibang input "a little girl holding a ragdoll cat in her arms" into minimax's conch ai, it took 2 minutes. in the generated 6-second video, the little girl's fingers holding the cat were rich in details, but what she was holding in her arms was not a ragdoll cat.

faced with the generated results, minimax's video model staff explained, "this is because there is no ragdoll cat in the bound cat pictures used to train the big model."

when the content generated by the model is inconsistent with real-world facts or user input, the big model begins to hallucinate and start talking nonsense.for large model manufacturers eager for new users, the generation effect obviously determines whether the product has a chance to go viral.

"the input command was to extract all entertainment news in august, but the ai generated entertainment news content from august 2019." when using a certain head big model product, loyal user kong fang has caught several moments of ai "talking nonsense", either compiling citations that do not exist at all, or failing to understand new concepts in the past two years. this caused kong fang to have a crisis of confidence in the big model.

now, kong fang will use 2-3 large models from different manufacturers to "run" the same problem at the same time, and then cross-compare them. for key information such as time, quantity, and literature, he will also reconfirm it through search engines. "now ai generation is very much like drawing cards. the effect is uncontrollable and it is easy to be stupid," kong fang said helplessly.

however, high-quality data may gradually be exhausted. if we want to solve the "big model illusion" problem, it is obviously crucial to know what data to use to "feed" the big model.

someone close to baidu told alphabet list that large model manufacturers will purchase data directly through third-party companies, which saves time and effort but is not "troublesome" because the quality of the purchased data, whether it is text, recording or video, is uncontrollable.

for the top big models that actively develop b-side customers, more personalized customization of big models for a certain customer has become the main source of income for the ai business of big companies today. however, in order to train such a personalized model, it is necessary to "feed" with data that has been screened under corresponding high standards, and even adjust the data demand according to the learning effect of the big model at different stages. "it is not that you can just buy a bunch of voices and the big model will learn."

a lin, who worked in ai translation at a third-party data company, also found that "as the data provider, her company didn't seem to really care about the quality of the voice generated by the big model."

for alin, who specializes in small languages such as french and spanish, she needs to compare the generation effects of 5-6 large models to translate small language speech into text for the client at the same time, but she only needs to give a rough score. for the 5-6 texts generated, there are exactly what the detailed language differences are, and how they can be improved. the third-party company will not ask, and is "indifferent."

the lack of high-quality data may be the reason why many users say that "the content generated by any big model is almost the same", and it is also the root cause of users "switching directly to another company once one big model starts charging a fee".

for users, the domestic large models that claim to catch up with openai and continue to iterate in technology may not have any substantial differences, and they cannot become loyal users. this also casts a faint shadow on the large model manufacturers who are eager to commercialize.

therefore, even though it is time-consuming, laborious and expensive to personally "write scripts for ai", bytedance has blazed a new trail. and it is foreseeable that in order to solve the key problems of commercialization and attracting new users, spending a lot of money to "buy data" will probably become a new starting point for large model manufacturers.

(alin, kong fang and zhang xue are pseudonyms in this article)

news

ai data is in short supply, and big companies are eyeing cheap young people

introduction

my contact information