Musk disrupts the situation again! The new large model challenges GPT-4o, and netizens are going crazy

2024-08-14

Zhidongxi reported on August 14 that this afternoon Beijing time, Musk’s large model startupxAISecond generation model launchedGrok-2 beta, including two versions: Grok-2 and Grok-2 mini.

MuskPosted passionately on his own social platform X, revealing Grok-2's "vest" in the Lmsys large model arena——sus-column-r。

He retweeted Lmsys's tweet and said: "Grok is rocket speed". sus-column-r received more than 12,000 votes on the leaderboard, showingOutperforms Claude 3.5 Sonnet and GPT-4-Turbo, andGPT-4oTied for third place。

In many assessments such as GPQA, MMLU, MMLU-Pro, MATH, and MathVista,Grok-2The scores of both models exceed those of mainstream models such as GPT-4 Turbo, Claude 3 Opus, and Gemini Pro 1.5, but are still inferior to GPT-4o.

Currently, X Premium and Premium+ users can now experience Grok-2 and Grok-2 mini, and Zhidongxi conducted actual testing as soon as possible.

After some experience, the most obvious feeling I got from Grok-2 is that its logic is very clear. For example, in the following example, although both Grok-2 and GPT-4o gave the correct answer, the steps and calculations of each step of the former are very clear and easier to understand. In addition, Grok-2's text-based graphics capabilities have skyrocketed with the support of FLUX.1, and it has retained its usual "bold" style.

xAI also plans to release two versions of its Grok-2 enterprise API later this month.

Experience address:https://lmarena.ai/?model=sus-column-r

1. Performance surpasses GPT-4 by several versions, with stronger visual and logical capabilities

In the LMSYS chatbot arena, an early version of Grok-2, sus-column-r, participated in the evaluation.The overall Elo score outperforms Claude and multiple GPT-4 versions。

As shown in the figure below, Grok-2's score surpassed the July 18 version of GPT-4o-mini and the April 9 version of GPT-4-Turbo, but it is still lower than the August 8 version of ChatGPT-4o-latest and the May 15 version of GPT-4o.

Internally, the xAI team follows a similar process to evaluate models, focusing on two core capabilities of the model:Accuracy in following instructions, secondly, to provide informationAccuracy and authenticity。

It is worth mentioning that Grok-2Reasoning analysis retrieval contentandUse the toolsIt has made significant progress in many areas, such as accurately identifying missing information, performing logical reasoning through event sequences, and effectively eliminating irrelevant posts.

For the benchmark test, the team used a series ofReasoning, reading comprehension, math, science, and codingThe Grok-2 model is comprehensively evaluated using academic benchmarks in fields such as

The results show that both the Grok-2 and its simplified version, the Grok-2 mini, are significantly improved over the previous model, the Grok-1.5.

At the graduate levelScientific knowledge (such as GPQA), general knowledge questions and answers (such as MMLU, MMLU-Pro)as well asMathematics competition questions (such as MATH)In areas such as computing and analytics, their performance is comparable to other top models.

As shown in the figure below, Grok-2 scored high in all these tests.It surpasses GPT-4 Turbo, Claude 3 Opus, and Gemini Pro 1.5, but still cannot beat GPT-4o。

It is worth mentioning that Grok-2Visual tasksExcellent performance, especially inVisual Mathematical Reasoning (MathVista)andDocument-based Question Answering (DocVQA)Particularly outstanding performance.

2. Grok-2 has been launched on the X platform. First-hand test: the text and image are rising rapidly, and the logical reasoning is clearer

X subscribers can now use Grok-2 and Grok-2 mini, and non-subscribers can also try out the early version of Grok-2, the sus-column-r, for free in the Big Model Arena.

There are 62 models including GPT-4o in the Big Model Arena to choose from. For the sake of comparison, let's test this early model first.

The first is the size comparison question that caused a lot of models to fail some time ago: which is bigger, 13.11 or 13.8. Both Grok-2 and GPT-4o answered accurately, but Grok-2's thinking process was clearer and listed the detailed thinking steps.

On another classic question, “How many r’s are there in Strawberry?”, Grok-2 got it wrong at first, but gave the correct answer after the question was asked in English. GPT-4o got both the Chinese and English answers right. It seems that large models still have a touch of luck.

The models in the big model arena were not connected to the Internet in real time. When I asked "What are the highlights of the Pixel 9 just released by Google", both models said that they did not have this information yet. Then Grok-2 made a prediction based on the technology development trend and the previous characteristics of Pixel. It was quite reliable to say that the camera, processor, AI, etc. are all the focus of Google's update this time.

GPT-4o did not make a prediction, but instead summarized the highlights of the Pixel phone in the past.

In terms of coding capabilities, the two models perform equally well, and both provide detailed solutions and complete codes based on the requirements.

In terms of logical reasoning, Grok-2 once again demonstrated the clarity of logic, with subheadings for each step of reasoning. Although GPT-4o also answered correctly, its thinking steps were not clear enough.

The image generation capability is a major focus of this update of Grok-2. The FLUX.1 model it connects to has recently become very popular in the open source community due to its powerful performance. However, the image generation capability cannot be experienced in the large model arena and can only be achieved through X subscription.

Netizens have already played around with Grok-2’s text map, such as using its text generation capabilities to help Grok-2 hold an offline press conference.

Or use your imagination and have Musk drive a car on Mars.

Based on Grok's almost zero censorship system, many netizens played with memes, such as letting Trump shoot and letting Bush Jr. take cocaine...

Or let Trump take a SpaceX rocket to space. Faced with the same request, GPT-4o refused decisively.

How undisguised is Grok's censorship? One netizen tested the model and asked it to "rank the top 10 by IQ by race". Only Grok-2 answered without hesitation.ChatGPT, Claude directly refused, and Gemini began to educate him earnestly.

Overall, Grok-2 still adheres to its bold style. At the same time, its model performance is comparable to that of top models such as GPT-4o. Its logic is clearer, and its multimodal capabilities have skyrocketed with the support of FLUX.1.

3. Launching the enterprise API platform at the end of the month to seamlessly integrate enterprise systems

Later this month, xAI willEnterprise API Platform, officially launched Grok-2 and Grok-2 mini to developers.

This API will use a new customized technical architecture to supportMulti-region inference deployment,forGlobal UsersProvides a smooth experience with low latency.

At the same time, xAI strengthens security features, including mandatory multi-factor authentication (such as Yubikey, Apple TouchID or TOTP), and provides detailedTraffic statistics and advanced billing analysis services, supports data export.

In addition, xAI has also launched a management API that supports seamless integration of team, user and billing management functions into existing internal tools and services.

Conclusion: Grok-2 and X platform are more closely linked, and OpenAI and others are under more pressure

Grok-2 and Grok-2 mini are now available on the X platform, and features such as enhanced search experience, in-depth analysis of X posts, and optimized reply functions are all quite exciting. xAI will also release a preview version of the multimodal understanding function soon.

Since the launch of Grok-1 in November 2023, xAI has been making great strides in technology, products and financing. The launch of Grok-2 is a new milestone. Once Musk connects the Grok large model capabilities with the powerful content user ecosystem of the X platform to form a closed loop, includingOpenAIThe pressure on large model startups is even greater.

Author | Li Shuiqing Herb

Editor | Yunpeng

news