news

It’s getting fierce. GPT-4o was defeated by Google’s new model. ChatGPT: Everyone take a deep breath

2024-08-02

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

It's getting fierce.GPT-4oSurpassed by Google's new model!

Over a week, more than 12,000 people voted anonymously.Gemini 1.5 Pro(0801)Representing Google, it won the first place in the LMSYS Arena for the first time.(Chinese task also ranked first)

And this time it's a double champion, in addition to the overall ranking(The only score is above 1300),existVisual RankingsIt is also the first.

Simon Tokumine, a key figure on the Gemini team, wrote a post to celebrate:

[This new model] is the most powerful, smartest Gemini we’ve ever made.

One Reddit user also called the model "very good" and expressed hope that its functionality would not be scaled back.

More netizens expressed excitement:OpenAIFinally I have been challenged, and I am going to release a new version to fight back!

The official ChatGPT account also came out to hint at something.

Amid the excitement, Google AI Studio product manager announced that the model has enteredFree testing phase

Free to use in AI studio

Netizen: Google is finally here!

Strictly speaking, the Gemini 1.5 Pro (0801) is not actually a new model.

ShouldExperimental versionBuilding on the Gemini 1.5 Pro that Google released in February, the 1.5 series later expanded the context window to 2 million.

As the model is updated, the name becomes longer and longer, which has also caused people to complain.

Here, an OpenAI employee congratulated the team and said something sarcastic:

Of course, although the name is hard to remember, Gemini 1.5 Pro (0801) performed well in the official evaluation of the arena this time.

The overall win rate heatmap shows that it beats GPT-4o by 54% and Claude 3.5 Sonnet by 59%.

existMultilingual capabilitiesIn benchmark tests, it ranked first in Chinese, Japanese, German and Russian.

However, in Coding and Hard Prompt Arena, it still cannot beat opponents such as Claude 3.5 Sonnet, GPT-4o, and Llama 405B.

This point has also been criticized by netizens, translated as:

Coding is what matters, and it's not very good at that.

However, some people also came out to recommend Gemini 1.5 Pro (0801)Image and PDF extraction capabilities

DAIR.AI co-founder Elvis personally conducted a full set of tests on YouTube and concluded:

Visual capabilities are very close to GPT-4o

Also, someone used the Gemini 1.5 Pro (0801) to solve the problem that the Claude 3.5 Sonet had not answered well before.

It turns out that it not only performs better, but also beats its smaller competitor Gemini 1.5 Flash.

But, someClassic general knowledge testIt still can’t handle, say, “Write ten sentences ending with apple.”

One More Thing

At the same time, Google Gemma 2 series ushered in a new2 billion parameter model

Gemma 2(2B)Ready to use, which can be run on Google Colab's free T4 GPU.

In the Arena leaderboard, itSurpassed all GPT-3.5 models, even surpassing Mixtral-8x7b.

Faced with a series of new rankings recently achieved by Google, the arenaAuthority of the listOnce again, everyone questioned it.

Nous Research Co-Founder Teknium(Famous players in the training field after fine-tuning)Posting reminder:

Although Gemma 2 (2B) scores higher than GPT-3.5 Turbo in Arena, it is much lower than the latter in MMLU.
This discrepancy would be worrisome if one were using arena rankings as the sole indicator of model performance.

Abacus.AI CEO Bindu Reddy made a direct appeal:

Please stop using this human-assessed leaderboard immediately!
The Claude 3.5 Sonnet is much better than the GPT-4o-mini.
Something like Gemini/Gemma shouldn't score so high on this list.

So, do you think this method of anonymous human voting is still reliable?