news

OpenAI's "Her" was delayed in production. What held it back?

2024-07-27

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Dream morning from Aofei Temple
Quantum Bit | Public Account QbitAI

Two months and two weeks have passed, and we still haven't seen the public release of OpenAI's "Her".

On May 14, OpenAI released GPT-4o andEnd-to-end real-time audio and video conversation mode, demonstrating on-site that AI conversations are as smooth as humans.

It can sense your breathing rhythm, and can respond in real time with a richer tone than before, and you can even interrupt the AI ​​at any time. The effect is amazing.

But amidst the expectation, there areput offThe news came out.



What is holding OpenAI back? Based on known intelligence:

haveLegal disputes, to ensure that the voice tone will not cause controversy like that with "Black Widow" Scarlett Johansson.

There are alsoSecurity Question, alignment needs to be done well, and real-time audio and video conversations open up new usage scenarios, and being used as a fraud tool will be one of them.

However, in addition to the above, are there any other technical problems and difficulties that need to be overcome?

After the initial excitement, the insiders began to understand the tricks.

Sharp-eyed netizens may have noticed thatThe phone was demonstrated at the press conference with an Internet cable plugged in



In the eyes of industry insiders, the GPT-4o press conference demonstration was so smooth, but there were still several major limitations:

need“Fixed network, fixed equipment, fixed physical environment”

After the public release, it remains unknown whether global users can get the same experience as at the launch conference.

There was an interesting detail at the press conference: the handsome researcher Barret Zoph,Being used as a table by ChatGPT during a video call demonstration



The delay in the video call part is obvious.The audio part has been processed, and the visual part is still processing the previous shot., which is the wooden table captured by the camera when the phone was just picked up.

Just imagine what scenarios many people will use it in after it is finally released?

One of the most talked-about cases in the promotional video is that a blind man hailed a taxi with the help of AI voice, which was talked about by netizens for a while.



However, please note that this will be aScenarios that rely heavily on low latencyIf the AI ​​guidance had come a little slower, the taxi would have passed.



The network signal in outdoor scenarios cannot necessarily be guaranteed to be stable, not to mention airports, train stations, and tourist attractions where there are many people and devices occupying the bandwidth, which makes the difficulty even greater.

also,There will also be noise problems in outdoor scenes

Large models are already plagued by the "hallucination" problem. If noise affects the recognition of the user's voice and some words irrelevant to the command appear, the answer will be completely different.

Finally, there is another issue that is easily overlooked.Multi-device adaptation

It can be seen that the current OpenAI press conferences and promotional videos all use the new iPhone Pro.

Whether the same experience can be obtained on lower-end models will have to wait until the official release to be revealed.



OpenAI promotes GPT-4o as being able toAs short as 232 milliseconds, average 320 millisecondsResponds to audio input within 10 seconds, which is consistent with human reaction time in conversation.

But this is only the time from input to output of the large model, not the entire system.

In short, just doing a good job on AI is not enough to create a smooth experience like "Her". It also requires a series of capabilities such as low latency, multi-device adaptation, and the ability to cope with various network conditions and noisy environments.

AI alone cannot make "Her"

To achieve low latency, multi-device adaptation, etc., we rely onRTC(Real-Time Communications) technology.

Before the AI ​​era, RTC technology had been widely used in scenarios such as live broadcast and video conferencing, and had developed relatively maturely.

From the RTC perspective, the user's voice prompts must go through a whole set of complex processes before being input into the large model.

Signal acquisition and preprocessing:On mobile phones and other end devices, the user's voice is collected as raw signals, and then processed through noise reduction, echo elimination, etc. to prepare for subsequent recognition.

Speech Coding and Compression:In order to save transmission bandwidth as much as possible, voice signals need to be encoded and compressed. At the same time, some redundancy and error correction mechanisms need to be added adaptively according to the actual network conditions to resist network packet loss.

network transmission:The compressed voice data is divided into data packets and sent to the cloud via the Internet. If the server is physically far away, the transmission often has to go through multiple nodes, and each hop may introduce delays and packet loss.

Speech decoding and restoration:After the data packet arrives at the server, the system decodes it and restores the original voice signal.

Finally, it was AI's turn to take action.The speech signal must first be converted into tokens through the Embedding model before the end-to-end multimodal large model can truly understand and generate responses.

Of course, after the big model generates a response, it has to go through a reverse process and finally transmit the response audio signal back to the user.



Throughout the entire process, every link needs to be optimized to truly make AI audio and video conversations real-time.

After all, compression and quantization of the large model itself will affect AI capabilities. It is particularly important to combine joint optimization with factors such as audio signal processing and network packet loss.

It is understood that OpenAI did not solve this problem independently, but chose to cooperate with a third party.

PartnersOpen source RTC vendor LiveKit, and has become the focus of the industry by supporting ChatGPT voice mode.



In addition to OpenAI, LiveKit has also cooperated with related AI companies such as Character.ai and ElevenLabs.

Except for a few giants such as Google that have relatively mature self-developed RTC technology,Cooperating with RTC manufacturers who specialize in the field is the current mainstream choice for players who want to have real-time audio and video conversations with AI.

Of course, domestic players are also involved in this wave. Many domestic AI companies are already stepping up the research and development of end-to-end multimodal large models and AI real-time audio and video dialogue applications.

Can domestic AI applications catch up with OpenAI’s results? When can everyone really experience it for themselves?

Since these projects are basically in the early stages, not much information has been disclosed publicly, but their RTC partnersAgoraIt turned out to be a breakthrough.

QuantumBit learned from SoundNet thatWith the current domestic technology level, the delay of a round of conversation can be reduced to about 1 second., supplemented by more optimization techniques, it is no problem to achieve smooth conversations with timely responses.

Do a good job in RTC, AI is not just "Her"

Who is Agora?

A representative enterprise in the RTC industry, it became the first stock in the global real-time interactive cloud service in 2020.

The last time Agora became popular was because it provided technical support for the popular audio social application Clubhouse.

In fact, many well-known applications such as Bilibili, Xiaomi, and Xiaohongshu have chosen Agora's RTC solution, and its overseas business has also developed rapidly in recent years.

So, for AI real-time audio and video conversation applications, how can we solve the difficulties of low latency and multi-device adaptation, and what effect can we achieve?

We invitedZhong Sheng, Chief Scientist and CTO of AgoraTo answer this question.

According to Zhong Sheng, large model reasoning is not calculated.The time it takes for a signal to go back and forth on a network line can be as short as 70-300 milliseconds.

Specifically, the optimization is mainly carried out from three aspects.

first,Agora has built more than 200 data centers around the world, and when establishing connections, the locations are chosen to be closest to the end users.

Combined with intelligent routing technology, when a line is congested, the system can automatically select other paths with better latency and bandwidth to ensure communication quality.

If there is no cross-region transmission involved, the end-to-end transmission time can be less than 100ms. If there is cross-region transmission involved, such as from China to the United States, it is more like 200-300ms.

second,Founded in 2014, SoundNet is analyzing various weak network scenarios through data mining based on the massive amount of real-world scenario data accumulated over the years, and then reproducing them in the laboratory. This provides a "target range" for optimizing transmission algorithms, enabling them to cope with complex and changing network environments; it can also make timely adjustments to transmission strategies when corresponding weak network modes appear during real-time transmission to make transmission smoother.

third,For vertical industries and specific tasks, Agora is also trying to customize models with smaller parameters to shorten the response time of large models. The ultimate capabilities of large language models and voice models of a certain size are worth exploring, which is critical to optimizing the cost-effectiveness and low-latency experience of conversational AI or chatbots.

at last,The RTC SDK developed by Agora is also adapted and optimized for different terminal devices, especially for some low-end models, which can achieve low power consumption, low memory usage, and extremely small package size. In particular, the device-side AI algorithm-based voice noise reduction, echo cancellation, and video quality improvement capabilities can directly affect the scope and effect of AI chatbot.

Zhong Sheng also introduced that in the process of exploring the combination of RTC and large model technology, the scope of RTC technology itself is also changing.

He gave some of his own ideas, such as changing the transmission of audio signals to the transmission of tokens that can be directly understood by the large model, or even implementing speech-to-text (STT) and emotion recognition on the end, so that only text and related emotion parameters can be transmitted.

In this way, more signal processing processes can be placed on the end side, and the Embbeding model with less computing power requirements can be placed closer to the user, reducing the bandwidth requirements of the entire process and the cost of the cloud model.

From this point of view, Zhong Sheng believes that the final form of the combination of AI and RTC technology will move towards end-cloud integration.

In other words, we cannot rely entirely on large models in the cloud. This is not the best choice in terms of cost, energy consumption, and latency experience.

Under the concept of end-cloud integration, the entire infrastructure needs to change accordingly. The computing power is not only in the cloud, but also in the mobile phone. The edge transmission nodes will also distribute computing power, and the data transmission protocol will also change accordingly...

At present, the sound network and large model application manufacturers have exploredThree cooperation modes, that is, the different supply methods of the three parts of the whole system, the big model, RTC and cloud server:

  • Private deployment:Agora only provides the RTC SDK, which is deployed in the partner's own data center together with the large model. It is suitable for companies that have self-developed large models or large model reasoning infrastructure.
  • Agora Cloud Platform:Agora provides RTC SDK and cloud server resources, allowing developers to flexibly select models, deployment locations, and computing resources based on their needs. There is no need to build your own infrastructure and quickly build AI voice applications.
  • Agora end-to-end solution:Agora provides self-developed large models, RTC SDK and cloud server resources. It can customize vertical models for subdivided industries such as education, e-commerce, social entertainment, customer service, etc., and deeply integrate with RTC capabilities to provide an integrated voice interaction solution.

Moreover, among the existing cooperative projects, the fast-running application will not be far from being available to everyone.

In the communication with SoundNet, QuantumBit discovered another new trend worth paying attention to:

Domestic AI applications are gradually moving beyond the scope of AI assistant question-and-answer and AI emotional companionship.

Take the social entertainment, e-commerce live broadcast and online education industries as examples. What everyone is most concerned about are Internet celebrities and famous teachers. Digital people driven by AI real-time audio and video dialogue can become their "digital avatars" and further interact with each fan or student one-on-one. At the same time, users themselves have limited time and energy, and they are unable to have multiple avatars, so they also have a demand for their own AI avatars. With the development of technology, the improvement of AI avatar technology experience, and the reduction of costs, their application scope will expand day by day.

To quote Zhong Ming, "The ultimate and most scarce thing for people is time":

I’m sure we all have this experience: what if two meetings conflict and we can only attend one?

You may attend one event yourself and send your AI assistant to attend another event to bring back interesting information. In the future, this assistant may even be your own AI avatar, which can have personalized communication during the event, ask or answer a variety of questions based on your own interests and concerns, and interact with other people or their avatars.

Therefore, AI real-time audio and video conversation can do much more than just "Her".