GPT-4o version of "Her" is finally here! Telling jokes, learning to meow, how seductive can an AI girlfriend be?

2024-07-31

New Intelligence Report

Editor: Peach is so sleepy

【New Wisdom Introduction】The GPT-4o voice function has finally arrived as expected, and the sci-fi version of Her has come into reality! Some netizens who have tested it in grayscale have been having a blast, but OpenAI currently only provides 4 preset voices. In addition, the output token of the new GPT-4o model has also skyrocketed 16 times to 64K.

Ultraman's promise was finally fulfilled.

Before the end of July, the GPT-4o voice mode finally started grayscale testing, and a small number of ChatGPT Plus users have already received tickets to try it out.

If you see the following interface after opening the ChatGPT App, congratulations, you are one of the first lucky ones.

OpenAI says the advanced voice mode provides more natural, real-time conversations that can be interrupted at will, and it can even sense and respond to your emotions.

It is expected that all ChatGPT Plus users will be able to use this feature this fall.

In addition, more powerful video and screen sharing will be launched later. That is, by turning on the camera, you can chat with ChatGPT "face to face".

Some netizens who were grayed out started testing and discovered many use cases for the GPT-4o voice mode.

So some people use it as a "second foreign language coach" to teach themselves how to speak English.

In the following tutorial, ChatGPT helped netizens correct the pronunciation of Croissant and Baguette.

At the same time, the output tokens of GPT-4o skyrocketed 16 times, from the initial 4,000 tokens to 64,000 tokens.

This is the new beta model gpt-4o-64k-output-alpha that OpenAI recently quietly launched on its official website.

A longer output token means that you can get about 4 complete feature-length movie scripts at one time.

Her has come

The reason why the GPT-4o voice function is released now is that OpenAI has been conducting security and quality tests on it over the past few months.

They tested GPT-4o's speech capabilities in 45 languages with more than 100 red team members.

To protect people's privacy, the team trained the model to speak using only four "preset voices."

They also created a system to block the output of sounds other than these four.

In addition, content filtering is also essential, and the team also takes measures to prevent the generation of violent and copyright-related content.

OpenAI announced that it plans to release a detailed report on GPT-4o's capabilities, limitations, and security assessments in early August.

Network-wide test

Below are some examples of GPT-4o voice modes shared by netizens.

ChatGPT can perform beatboxing.

ChatGPT also told jokes about beer in shy, angry, and angrier tones.

Some netizens even told a joke specifically for ChatGPT: "Why scientists don't believe in Adam-Atoms, because they make up everything."

ChatGPT laughed awkwardly.

What’s even funnier is that ChatGPT is pretty good at imitating a cat’s meow.

After some testing, some people found that ChatGPT's advanced voice mode is very fast and there is almost no delay in answering.

When asked to imitate some sounds, it can always reproduce the sounds authentically. And it can also imitate different accents.

The video below shows AI acting as a commentator for a football match.

ChatGPT tells stories in Chinese, which is also very vivid.

Although OpenAI claims that the video and screen sharing functions will be launched later, some netizens have already used them.

A netizen has a new pet cat. He built a nest for it and prepared food for it, but he didn’t know how it was doing, so he asked ChatGPT.

In the conversation in the video, the netizen showed it the cat's house. After seeing it, ChatGPT commented, "It must be very comfortable" and showed concern for the cat.

Netizens said that it has not eaten yet and looks a little worried. ChatGPT comforted them, saying, "This is normal. It takes time for cats to adapt."

It can be seen that the entire question-and-answer process was very smooth, giving people the feeling of communicating with a real person.

The netizen also found a game console with a Japanese interface, but he himself does not know Japanese.

At this time, he showed the game interface to ChatGPT and asked it to help him translate, and finally they finished the game together.

I have to say that with the support of vision + voice mode, ChatGPT is much stronger.

GPT-4o Long Output is quietly launched, with output up to 64K

In addition, GPT-4o, which supports larger token output, will follow.

Just yesterday, OpenAI officially announced that it would provide the GPT-4o Alpha version to testers, supporting the output of up to 64K tokens per request, equivalent to a 200-page novel.

However, the price of the new model has once again hit a new ceiling: $6 per million input tokens and $18 per million output tokens.

Although the output token is 16 times that of GPT-4o, the price has also increased by $3.

After comparing them, the price of gpt-4o-mini is still very attractive!

Researcher Simon Willison said long output is primarily used in data transformation use cases.

For example, when translating a document from one language to another, or extracting structured data from a document, almost every input token needs to be used in the output JSON.

Prior to this, the longest output model he knew was GPT-4o mini, which was 16K tokens.

Why launch a model with longer output?

Obviously, longer output allows GPT-4o to provide more comprehensive and detailed responses, which is very helpful in some scenarios.

For example, writing code and improving writing.

This is also an adjustment made by OpenAI based on user feedback - longer output content is needed to meet use cases.

Difference between context and output

Since its launch, GPT-4o has provided a maximum context window of 128K. For GPT-4o Long Output, the maximum context window is still 128K.

So how did OpenAI increase the number of output tokens from 4,000 to 64,000 while keeping the overall context window at 128K?

This is because OpenAI initially limited the number of output tokens to a maximum of 4,000 tokens.

This means that a user can use up to 124,000 tokens as input in one interaction, and can only get up to 4,000 output tokens.

Of course, you can also input more tokens, which means fewer output tokens.

After all, the length of the long context (128K) is fixed there. No matter how the input changes, the output token will not exceed 4000.

Now, OpenAI limits the output token length to 64,000 tokens, which means that you can output 16 times more tokens than before.

After all, the output requires more computation and the price increase is greater.

Similarly, for the latest GPT-4o mini, the context is also 128K, but the maximum output has been increased to 16,000 tokens.

So, a user can provide up to 112,000 tokens as input and ultimately get a maximum of 16,000 tokens as output.

In general, OpenAI provides a solution here to limit the input tokens to obtain longer responses from LLM, rather than directly expanding the context length.

As for other models on the market, the longest ones are over one million (Gemini), and the shorter ones are 200K (Claude); some models even have outputs of 200K, but OpenAI is still struggling.

This poses a difficult problem to developers: if you want more input, you have to accept less output; if you want more output, you have to input less.

How to measure it specifically depends on which one you are willing to sacrifice...

References:

https://x.com/OpenAI/status/1818353580279316863

https://x.com/tsarnick/status/1818402307115241608

https://x.com/kimmonismus/status/1818409637030293641

https://www.reddit.com/r/singularity/comments/1eg51gz/chatgpt_advanced_audio_helping_me_pronouce/

https://venturebeat.com/ai/openai-launches-experimental-gpt-4o-long-output-model-with-16x-token-capacity/

news