news

GPT-4o spot becomes futures, what is holding OpenAI back?

2024-07-16

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


RTC technology is one of the keys to the popularization of real-time AI.


Author | ray
edit| Jingyu

Her is moving from movies to reality.

In May this year, OpenAI released the latest AI multimodal model GPT-4o. Compared with the previous GPT-4 Turbo, GPT-4o is twice as fast and half as expensive. The average latency of real-time AI voice interaction has reached 320 milliseconds, compared with the previous version of 2.8 seconds (GPT-3.5) to 5.4 seconds (GPT-4), which is almost the same as the response speed of daily human conversations.

In addition to improving efficiency, emotional analysis during conversations has also become one of the features of this product update. In a conversation with the host, AI could hear his "nervousness" when he spoke, and made targeted suggestions for deep breathing.

OpenAI, is becoming the "creator" of silicon-based in the era of large models.

However, the press conference was very impressive, but the reality was very bleak. When it came to product launch, OpenAI, the initiator of this big model technology revolution, was gradually becoming like a "futures" company.

After the release of GPT-4o, which focuses on all-round and low-latency, the launch of real-time audio and video functions has been delayed so far; the release of the video multimodal product Sora has also been delayed.

But this is not just a problem for OpenAI - after the release of ChatGPT, there are numerous domestic versions of ChatGPT in China, but there is only one that can truly compete with GPT-4o, namely SenseTime's RRS 5.5, and its progress is also at the stage of public beta within the month.

Why is it that at the press conference, the real-time multimodal big model was only one step away from changing the world; but in the process of actually moving towards productization, it always turns from "spot" to "option"?

A new voice is emerging: in a multimodal world, perhaps (algorithmic) violence has no miracles.

01

Real-time voice,One

Must pass throughofAI Commercialization Roadmap

The maturity of technology is helping a brand new blue ocean industry to take shape.

Data from a16z, a well-known Silicon Valley venture capital firm, shows that among the top 50 AI applications in the world, 9 are companion products. Data from the AI ​​product rankings show that the number of visits to AI companions reached 432 million in May this year, a year-on-year increase of 13.87%.

High demand, high growth rate, and large market space, accompanied by AI, bring about a dual transformation of business models and human-computer interaction.

The maturity of business is also forcing continuous advancement of technology.Taking the first half of this year alone, real-time AI voice technology has undergone three iterations in just six months.

The representative product of the first wave of technology is Pi.

In March this year, startup Inflection AI updated Pi, an emotional chatbot for individual users.

Pi's product interface is very simple. Text + dialog box is the core interactive interface, but it also adds the design of AI voice functions such as voice reading and telephone.

To achieve this kind of voice interaction, Pi relies on the traditional three-step voice technology of STT (Speech-to-Text) -LLM (Large Model Semantic Analysis) -TTS (Text to Speech). Its characteristics are mature technology, but slow response, lack of understanding of key information such as tone, and inability to achieve real-time voice dialogue.

Another featured product released at the same time is Call Annie. Compared with Pi, Call Annie has a complete video call experience design. In addition to the design of answering and hanging up calls, the listening function can be minimized and then switched to other apps, and it supports more than 40 dialogue role settings.

However, they all have common technical problems - high latency and lack of emotional color. In terms of latency, even the most advanced OpenAI in the industry will have a delay of 2.8 seconds (GPT-3.5) to 5.4 seconds (GPT-4). In terms of emotion, information such as tone, pitch, and speech speed will be lost during the interaction, and it is even more impossible to output advanced voice expressions such as laughter and singing.

After this, the representative of the new wave of technology is a product called EVI.

This product was launched by Hume AI in April this year and brought Hume AI US$50 million (approximately RMB 362 million) in Series B financing.

In terms of product design, Hume AI launched the Playground function in the underlying algorithm link. Users can choose the configuration and select the large model by themselves. In addition to the official default, they can also choose Claude, GPT-4 Turbo, etc.But the difference is that the voice carries emotion, so there are changes in rhythm and intonation in expression.

This function is achieved mainly by adding the new SST (semantic space theory) algorithm to the traditional STT-LLM-TTS three-step process. SST can accurately draw the full spectrum of human emotions through extensive data collection and advanced statistical models, revealing the continuity between human emotional states, making EVI have many anthropomorphic features.

The price of emotional progress is the further sacrifice of time delay,When talking to EVI, the time users need to wait is further increased compared to Pi and Call Annie.

In mid-May, GPT-4o was released, and the integration of multimodal technology became the technical direction of this period.

Compared to previous three-step voice interaction products, GPT-4o is a new model trained end-to-end across text, vision, and audio, which means that all inputs and outputs are processed by the same neural network.

The latency problem has been greatly improved. OpenAI officially announced that GPT-4o's real-time voice interaction can respond to audio input in a maximum of 232 milliseconds and an average of 320 milliseconds. Emotionally, the interaction between users and AI is becoming more and more intelligent, with changes in speech speed and emotional understanding being achieved.

At the product level, it becomes possible for humans to fall in love with AI and for AI to replace the blind in seeing the world.

Character.ai, a rising star in Silicon Valley in 2024 that recently launched a voice call function, has become the biggest beneficiary of this technological wave.

In Character.ai, users have the opportunity to text with replicas of anime characters, TV celebrities and historical figures in ultra-realistic role-playing. The novel setting has brought a surge in the number of product users. According to data from Similarweb, Character.ai can process 20,000 AI reasoning requests per second, and the number of visits in May reached 277 million.


Traffic comparison between character.ai and perplexity.ai | Image source: Similarweb

At the same time, Microsoft, Google and others have officially announced that their large models will launch real-time voice call functions.

However, the watertight product design, in actual implementation, always presents the implementation effect of the Three Gorges flood discharge - in the third wave, the almost "her"-style accompanying products at the press conference, in actual implementation, have all become "planned" to be launched, about to be launched, or in internal testing.

One conclusion that is beyond doubt is that real-time audio and video may become the ultimate form of human-computer interaction.AIIn addition to the companionship scenario, scenarios such as intelligent NPCs in games, AI oral teachers, and real-time translation are all expected to see an explosion, but before that, how to solve the last mile from "press conference" to product landing is the most difficult problem in the current industry.

02

AI real-time voice,

There is no miracle without strength

AIReal-time voice: "There are no miracles with great effort", a pessimistic statement is quietly spreading in Silicon Valley.

Resistance comes from all aspects of technology, regulation and business.

The spiritual leader of the technical opposition is Yann LeCun, the "father of convolutional networks."

In his opinion, the biggest feature of big model technology, compared with various AI algorithms in the past, is that "power can produce miracles". Through big data feeding, hundreds of millions of parameters and high-performance computing cluster hardware support, the algorithm can be used to handle more complex problems and has higher scalability. However, we are currently too optimistic about big models, especially the view that multimodal big models may be world models, which is nonsense.

For example, people have five senses, which constitute our true cognition of the world. LLM, which is trained based on a large amount of Internet text, lacks observation and interaction with the physical world, and lacks enough common sense. Therefore, in the process of generating videos or voices, there will always be seemingly seamless content, movement trajectories, or voice emotions, but lack of realism. In addition, hard physical limitations are also a problem. Faced with the increasing model size and interaction dimensions, the current large models lack sufficient bandwidth to process such information.

At the regulatory level,AIReal-time speech, that is, the end-to-end speech model, faces a game between technology and ethics.

In the past, the traditional AI voice industry's three-step STT-LLM-TTS was first caused by the immaturity of technology. Evolving to an end-to-end voice model requires additional technical breakthroughs in model architecture, training methods, and multimodal interaction. At the same time, because voice itself is more difficult to regulate than text, AI voice is easily used in scenarios such as telephone fraud, pornography, and spam marketing. In order to facilitate review, the text link in the middle has also become necessary to a certain extent.

At the commercial levelEnd-to-end audio and video large model training requires a large amount of YouTube and podcast data during the training phase. The cost is dozens of times or even higher than the text training model in the past, and the cost of one training starts at tens of millions of US dollars.

For ordinary AI companies at this time, this kind of cost is useless even if money falls from the sky. They also have to pay for NVIDIA's high-end AI computing cards, gigabyte storage, and inexhaustible risk-free audio and video copyrights.

Of course, whether it is Yang Likun’s technical judgment, possible regulatory difficulties, or the cost dilemma of commercialization, these are not the most core issues for Open AI.

Really make GPT-4o class real-timeAIThe fundamental reason why voice interaction products change from spot to futures lies in the engineering implementation level.

03

GPT-4o is demonstrated with a network cable plugged in.

Still need a good RTC assist

An unspoken secret in the industry is, GPT-4o classAIReal-time voice products are only half successful at the engineering level.

At the launch of GPT-4o, while claiming low latency, a sharp-eyed user discovered that the mobile phone in the demonstration video was still plugged into an Internet cable.This means that the official average latency of 320ms for GPT-4o is likely a laboratory indicator that can only be achieved under ideal conditions for a demo with fixed equipment, fixed network, and fixed scenarios.


The mobile phone plug-in can be clearly seen at the OpenAI GPT-4o conference|Image source: OpenAI

where is the problem?

From a technical perspective, to achieve AI real-time voice calls, the three steps at the algorithm level are combined into one step, which is only one of the core links. The other core link, the RTC communication level, also faces a series of technical challenges. The so-called RTC can be simply understood as the transmission and interaction of audio and video in a real-time network environment. It is a technology that supports real-time voice, real-time video and other interactions.

Chen Ruofei, head of audio technology at Agora, told GeekPark that in actual application scenarios, users are usually not always in fixed devices, fixed networks, and fixed physical environments. In our daily video call scenarios, if one party's network is not good, there will be speech jams and high latency. This situation will also occur in AI real-time voice calls, so low-latency transmission and excellent network optimization are crucial for RTC transmission.

In addition, multi-device adaptation and audio signal processing are also technical links that cannot be ignored in the implementation of AI real-time voice.

How to solve these problems?

The answer lies in OpenAI’s latest recruitment needs. OpenAI specifically mentioned that it wants to recruit engineering talents to help them deploy the most advanced models into the RTC environment.

In terms of specific solution selection, the RTC technology used by GPT-4o is an open source solution based on WebRTC, which can solve certain delays at the technical level, as well as packet loss, communication content security, and cross-platform compatibility issues caused by different network environments.

However, the B-side of open source is the weakness of productization.

To give a simple example, the problem of multi-device adaptation, the usage scenarios of RTC are mostly represented by mobile phones, but the communication and sound collection capabilities of different models of mobile phones vary greatly: Currently, Apple mobile phones can achieve a stable delay of about tens of milliseconds, but the Android ecosystem is more complex, not only has many models, but also the performance gap between high-end and low-end products is quite obvious. For some low-end models, the delay at the collection and communication level can be as high as hundreds of milliseconds.

For example, in AI real-time voice application scenarios, human voice signals may be mixed with background noise, requiring complex signal processing to remove noise and echoes, ensure clean, high-quality voice input, and enable AI to better understand what people say.

Multi-device compatibility and advanced audio noise reduction capabilities are exactly what open source WebRTC lacks.

Industry experience is a bottleneck problem in the application of open source products. Therefore, compared with open source solutions, large model manufacturers and professional RTC solution providers work together to polish and optimize, which to a certain extent can better represent future industry trends.

In the field of RTC, Agora is the most representative manufacturer. It was once widely known for providing audio technology for Clubhouse. According to the news on Agora's official website, more than 60% of the world's pan-entertainment apps choose Agora's RTC services. In addition to well-known domestic apps such as Xiaomi, Bilibili, Momo, and Xiaohongshu, Yalla, the largest voice social and entertainment platform in the Middle East and North Africa, Kumu, the "king of social live broadcast platforms" in Southeast Asia, HTC VIVE, The Meet Group, Bunch and other well-known companies around the world have all adopted Agora's RTC technology.


The accumulation of industry experience and the polishing of global customers are proof of its leading technology. According to Chen Ruofei, the SD-RTN™ real-time transmission network developed by Agora covers more than 200 countries and regions around the world, and the global end-to-end delay of audio and video reaches an average of 200ms. In response to fluctuations in the network environment, Agora's intelligent routing technology and anti-weak network algorithm can ensure the stability and smoothness of calls. In response to the differences in terminal devices, Agora has accumulated know-how from pre-installing hundreds of millions of apps worldwide and adapting to complex environments.

In addition to technological leadership, industry experience is an invisible barrier.

In fact, this is why the business landscape of the RTC industry has been relatively stable over the years:To do a good job in RTC, we never rely on the "big force to make miracles" of large models.

The only way to achieve ultimate optimization of voice latency and widespread commercial use of real-time voice interaction is through long-term and meticulous work.

From this perspective,AIReal-time voice interaction is a war that should not be underestimated in terms of imagination and difficulty.

Its future - algorithms, audits, RTC, etc., all need to be overcome. To complete this long road, we must look up to the technological sky and keep our feet on the ground of engineering.

*Header image source: Visual China

This article is an original article from Geek Park. For reprinting, please contact Geek Jun on WeChat: geekparkGO

Geek Question

What AI companion apps have you used?


Zuckerberg's secret to success: Don't be fooled by the movies, no one knows how to do it at the beginning.

Like and followGeek Park Video Account