robin li punctures the illusion of "running scores" of large models: the list does not represent all strengths, and the gap between models will become larger in the future

2024-09-12

whenever a new version of a big model is released, the industry is always keen to cite third-party ranking data, "run a score" with its own big model and gpt-4, and claim to have surpassed it in certain indicators, in order to prove the technical strength of its own big model.

however, in a recent exchange between baidu chairman robin li and internal employees, he broke the "window paper" of the big model industry's benchmark scores. "every time a new model is released, it is compared with gpt-4o, saying that my score is almost the same as it, and even exceeds it in some individual items, but this does not mean that there is no gap with the most advanced model."

he further explained that the gap between models is multi-dimensional. one dimension is the ability, whether it is the gap in basic abilities such as comprehension ability, generation ability, logical reasoning ability or memory ability; the other dimension is the cost. although some models can achieve the same effect, they are costly and slow in reasoning, and are actually not as good as advanced models.

"there is also over-fitting of the test set. every model that wants to prove its ability will go to the list. when making the list, it has to guess what others are testing and which questions i can answer correctly with what techniques. so from the list or test set, you think your ability is very close, but there is still a clear gap in actual application," said robin li.

a large model practitioner told the reporter that the over-fitting of the test set mentioned by robin li mainly refers to the phenomenon that during the model training process, the model learns the training data too finely, so that the model performs very well on the training data, but performs poorly on the unseen test data. this usually means that the model is too complex to "remember" the noise and details in the training data, but these details and noise are not universal, so the model cannot be well generalized to more new data.

the above-mentioned person believes that there are indeed limitations in ranking and scoring. for example, due to the openness of the evaluation data set, the model can be trained in a targeted manner to improve the ranking, resulting in the phenomenon of "ranking manipulation". however, it is not completely meaningless. the list still provides a relatively quantitative evaluation standard to help people quickly understand the performance of different large models, prompting everyone to continuously optimize the technical level of large models through competition, and it also has a certain role in publicity and promotion.

in li yanhong's opinion, "the hype of some self-media, coupled with the motivation to promote each new model when it is released, has given people the impression that the difference in capabilities between models has become relatively small, but this is actually not the case." li yanhong said that in actual use, baidu does not allow technical personnel to compete for rankings. the real measure of the capabilities of a large model should be in specific application scenarios to see whether it can meet user needs and generate value gains.

as for the "12-month lead or 18-month lag" often mentioned in the large model industry, he believes that it is not that important. because every company is in a fully competitive market environment, no matter what direction it is in, there are many competitors. "if you can always guarantee a 12-18-month lead over your competitors, you are invincible. don't think that 12-18 months is a short time. even if you can guarantee a 6-month lead over your competitors, you win. your market share may be 70%, while your competitors may only have 20% or even 10%."

he believes that the gap between large models may become larger in the future. because the ceiling of large models is very high and it is still far from the ideal situation, the models need to be continuously and rapidly iterated, updated and upgraded; they need to be invested for several years or even more than ten years to continuously meet user needs and reduce costs and increase efficiency.

in addition to discussing whether there are still barriers to big model competition, during the exchange, robin li also mentioned that there are quite a lot of misunderstandings about big models in the outside world, including topics such as the efficiency of open source and closed source models, and ai agents.

robin li is a staunch supporter of closed-source big models. "before the big model era, people were used to the idea that open source meant free and low cost." he explained that, for example, open source linux was free to use because people already had computers. but this is not true in the big model era. big model reasoning is very expensive, and open source models do not provide computing power. people have to buy equipment themselves, which makes it impossible to achieve efficient use of computing power.

"open source models are not efficient," he said. "to be more precise, closed source models should be called commercial models, where countless users share the r&d costs, machine resources and gpus used for reasoning. gpus are the most efficient. the gpu utilization rates of baidu wenxin big model 3.5 and 4.0 have reached more than 90%."

li yanhong analyzed that the open source model is valuable in fields such as teaching and scientific research; but in the business field, when the pursuit is efficiency, effectiveness and lowest cost, the open source model has no advantage.

he also expressed his views on the evolution of the application of large models. the first to appear is copilot, which assists people; the next is agent intelligent body, which has a certain degree of autonomy and can use tools, reflect, and evolve itself; if this level of automation is further developed, it will become ai worker, which can independently complete various tasks.

currently, intelligent agents have attracted more and more attention from large model companies and customers. li yanhong believes that although many people are optimistic about this development direction, intelligent agents are still not a consensus to date.

"the threshold for intelligent agents is indeed very low." he said that many people do not know how to turn large models into applications, and intelligent agents are a very direct, efficient and simple way, and it is quite convenient to build intelligent agents on top of the models.

(this article comes from china business network)

report/feedback

news

robin li punctures the illusion of "running scores" of large models: the list does not represent all strengths, and the gap between models will become larger in the future

introduction

my contact information