Musk suddenly released a new version of the big model, sacrificing Tesla resources to challenge OpenAI, and the first-hand test is here

Musk suddenly released a new version of the big model, sacrificing Tesla resources to challenge OpenAI, first-hand test is here

2024-08-14

Dream morning from Aofei Temple
Quantum Bit | Public Account QbitAI

Musk's xAI large model has released its second generation!

Grok-2The beta version has been released, and the small cup Grok-2 mini is already playable on the platform online.

Musk also revealed the secret that has troubled the big model circle for more than a month in the form of a riddle man:

The mysterious anonymous model in the original Lmsys large model arenasus-column-r, whose true identity is Grok-2.

sus-column-r has accumulated more than 10,000 human votes on the leaderboard and hasTied for third place with the API version of GPT-4o。

In xAI’s own internal tests, Grok-2 performed on par with other cutting-edge models in areas such as general knowledge (MMLU, MMLU-Pro), math competition problems (MATH), and graduate-level scientific knowledge (GPQA).

In addition, Grok-2 is best at vision-based tasks and has achieved SOTA in visual mathematical reasoning (MathVista).

However, there is a bit of scheming in the layout of this picture: the GPT-4o and Claude-3.5-Sonnet with the highest scores are placed farther away from the image.

Just looking at the scores is still abstract, so let’s move on to the first-hand testing phase.

First-hand experience with Grok-2

If you are a paid user of the /Twitter platform, you can directly enter the Grok channel to try it out. If you don't want to spend money, you can also go to the Lmsys Large Model Arena and choose sus-column-r to try it out.

andPaid users can only play the mini version, free users can play big cups, which is also very thick。

Since Grok-2 can access real-time data onYou can directly ask him to summarize the news of the dayIf you turn on the fun mode, you can also add comments.

The paid version alsoConnected to the latest open source AI image model Flux.1, and will translate the Chinese prompts into English for understanding.

Clicking on the "Recommend a fantasy game" question example on the homepage, you can see that it first recommends "Baldur's Gate 3" and comments on it from several angles, including plot, character customization, game mechanics, world building, humorous elements and player community, which captures the highlights of the game very well.

At this time, you can switch to Chinese and continue asking questions.

Grok-2 also knows about the unreleased game Black Myth: Wukong, and accurately states that the release date is August 20, that it uses the Unreal 5 engine, and summarizes the discussions among netizens.

At the end, there are posts from netizens, and users can click on them to participate in the discussion. The functionality of the entire platform has been fully integrated.

However, since there is only a mini version of the model, we will move to the large model arena for the next strength test, and we can also compete with GPT-4o.

In the recent popular IQ test questions“Which is bigger, 9.9 or 9.11?”In the above table, Grok-2 (sus-column-r) outperforms the latest version of ChatGPT.

But another popular testHow many r's are there in strawberry?On the question, both still failed. (After trying a few more times, there is a small chance that both will be correct).

A more serious trap question"Which of the following candles was blown out first?"Among them, Grok-2 is slightly better than ChatGPT.

The test point is that the candle that is blown out first has the longer remaining part (the correct answer is 3). ChatGPT mistakenly understood it as the shortest one. The Grok-2 idea is correct, but the calculation of which one is the longest is incorrect.

As for the classic big model weakness, the "reversal curse", both seem to have overcome it in some way. Not only can they answer the question "Who is Tom Cruise's mother?", but they can also answer the question "Mary Lee Pfeiffer's son is Tom Cruise" which appears less frequently in the data.

(Of course, it is possible that more relevant data will become available after this problem becomes a classic problem.)

Musk's big model upgrade, sacrificed Tesla in exchange

The test is now over. It can be seen that Grok-2 has made great progress compared to the previous generation Grok-1.5.

Musk spent a lot of resources and manpower behind the scenes.

For example, a new researcher who joined xAI said that he could use100,000 card clusterDoing research is much more fun than having the meager resources in school.

But there is one group of people who are not satisfied: Tesla shareholders.

According to the Wall Street Journal,Musk continues to transfer talent, data and GPU resources from Tesla to xAI。

So far, xAI has hired at least 11 former Tesla employees, six of whom worked directly on the Autopilot team.

Musk also asked Nvidia to give priority to supplying xAI for the GPU orders originally reserved for Tesla.

Musk also talked publicly about the large amount of visual data collected by Tesla, which he said could be used as a resource for training xAI models.

At least three Tesla shareholders have sued Musk over this, claiming that shifting resources to xAI harmed the interests of Tesla investors.

The case is currently pending in Delaware court.

news

Musk suddenly released a new version of the big model, sacrificing Tesla resources to challenge OpenAI, first-hand test is here

Introduction

My contact information