news

ChatGPT's advanced voice mode is now available: Chinese speakers are exposed as foreigners as soon as they open their mouths

2024-07-31

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Machine Heart Report

Editor: Danjiang, Xiaozhou

OpenAI's "Her" is finally open to some people.



In May of this year, OpenAI launched its new generation flagship generative model GPT-4o and desktop app at the "Spring New Product Launch Conference", and demonstrated a series of new capabilities.

Now, OpenAI has announced that it will open ChatGPT's advanced speech mode to a small number of ChatGPT Plus users, allowing users to get GPT-4o's ultra-realistic audio responses for the first time. These users will receive a reminder in the ChatGPT application and receive an email with instructions on how to use the application.

"Since we first demonstrated our advanced speech model, we’ve been working to enhance the safety and quality of voice conversations in preparation for bringing this cutting-edge technology to millions of people." OpenAI said the feature will be gradually rolled out to all Plus users in the fall of 2024.

Some users have shared their experience with advanced voice mode:

Source: https://x.com/tsarnick/status/1818402307115241608

When you tell jokes to ChatGPT, it can provide some laughter:

Source: https://x.com/yoimnotkesku/status/1818406786077970663

Using ChatGPT’s advanced speech model, “Her” can create background music while telling stories, and is suitable for multiple languages.

Source: https://x.com/yoimnotkesku/status/1818415019349901354

French, Spanish and Urdu are also available:

Source: https://x.com/yoimnotkesku/status/1818424494106853438

But the Chinese expression is not very authentic, as if a "foreigner" is learning Chinese:

Source: https://x.com/yoimnotkesku/status/1818446895083139170

Everyone who heard it was stunned:



The accent problem is not only found in Chinese, but also in German:



Source: https://x.com/yoimnotkesku/status/1818445235606671670

Finally, let me tell you a tongue twister:

Source: https://x.com/yoimnotkesku/status/1818427991514337695

OpenAI says the advanced speech model is different from the one currently provided by ChatGPT.

ChatGPT's old speech model solution used three separate models: one model to convert speech to text, GPT-4 to handle prompts, and a third model to convert ChatGPT's text to speech. GPT-4o is multimodal and can handle these tasks without the help of auxiliary models, significantly reducing conversation latency. OpenAI also said that GPT-4o can perceive the emotional tone of the user's voice, including sadness, excitement, and so on.

In May of this year, OpenAI demonstrated the voice function of GPT-4o for the first time. The speed of her response and the striking similarity to a real person's voice shocked the audience - this is where the problem lies.



The voice, named "Sky," resembles Scarlett Johansson, who played the artificial assistant in the movie "Her."

Shortly after the OpenAI demo, Johansson said she had resisted multiple requests from OpenAI CEO Sam Altman to use her voice, and after seeing the GPT-4o demo, she hired legal counsel to defend her voice. OpenAI denied using Johansson's voice, but also removed it from the demo.

In June, OpenAI said it would delay the release of an advanced speech model to improve its safety measures.

After a long wait, "Her" finally met with everyone. OpenAI said that the advanced voice mode launched this time will be limited to ChatGPT and paid voice actors to create four preset voices: Juniper, Breeze, Cove and Ember.

It is worth noting that the output voices are only these four - the Sky voice shown in OpenAI's May demonstration is no longer available for ChatGPT. OpenAI spokesperson Lindsay McCallum said: "ChatGPT cannot impersonate other people's voices, including personal and public figures, and will block output that is different from one of these preset voices."

The original intention of this setting was to avoid the Deepfake controversy. In January this year, the voice cloning technology of artificial intelligence startup ElevenLabs was used to impersonate US President Biden and deceive primary voters in New Hampshire, causing considerable controversy.

OpenAI also said it had introduced new filters to block certain requests to generate music or other copyrighted audio.

Last year, many image-generating and music-generating AI companies were involved in legal disputes for copyright infringement, especially record companies, which like to litigate, and have sued AI audio generators Suno and Udio. Audio models like GPT-4o have added a whole new category of companies that can file complaints.

It is said that OpenAI has tested the speech function of GPT-4o with more than 100 external "red team" members in 45 languages. And this key information will be published in more detail in an August report on the functions, limitations and security assessment of GPT-4o.

https://twitter.com/OpenAI/status/1818353580279316863

https://www.theverge.com/2024/7/30/24209650/openai-chatgpt-advanced-voice-mode

https://www.reuters.com/technology/openai-starts-roll-out-advanced-voice-mode-some-chatgpt-plus-users-2024-07-30/

https://www.bloomberg.com/news/articles/2024-07-30/openai-begins-rolling-out-voice-assistant-after-safety-related-delay?srnd=phx-technology

https://techcrunch.com/2024/07/30/openai-releases-chatgpts-super-realistic-voice-feature/

https://www.theinformation.com/briefings/after-delay-openai-releases-ai-voice-assistant