news

gpt-4o mini ranking avalanche, large model arena rules updated, ultraman score-boosting tricks no longer effective

2024-08-31

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

large model arena rules updated,gpt-4o mini ranking immediately collapsed and fell out of the top 10

new list answers questions from aifeatures such as length and style have been downgraded, ensuring that the score reflects the model's ability to truly solve the problem.

trying to please users or boost rankings by using beautiful formatting, increasing the number of subheadings, and other techniques are all useless now.

under the new rules, ultraman'sGPT-4o mini、musk'sgrok-2 seriesrankings dropped significantly, googleGemini-1.5-flashthe small model also fell back.

claude seriesLlama-3.1-405bthe scores of large models have all risen.

when only the difficult tasks (hard prompt) are counted, the advantage of the large model in the style control list is more obvious.

previously, the gpt-4o mini model once topped the list, tied for first place with the gpt-4o full-blood version, which was obviously inconsistent with the netizens' physical perception.

the lmsys large model arena, an evaluation standard once recommended by karpathy, has also fallen to the point where its reputation can only reflect user preferences rather than model capabilities.

lmsys learned from its mistakes and first released data from 1,000 battles in which gpt-4o mini participated, thereby analyzing the factors that affected the voting results, including the model's refusal to answer rate, the length of the generated content, and the formatting.

moreover, before the release of gpt-4o mini, ultraman hinted that it was optimized according to human preferences.

now, lmsys has gone a step further and introduced new algorithms to control these factors, and this is just the first step in the planning.

how to control the influence of style?

assume thatmodel ait's good at generating code, facts, unbiased answers, etc., but its output is very concise.

model bit's not very good at substantive things (such as correctness), but its output is long, detailed, and beautifully formatted.

so which one is better?

there is no single answer, and lmsys tries to mathematically figure out how much of a model's score is contributed by content or style.

in addition, recent studies have shown thathumans may have a preference for ai answers that are nicely formatted and more detailed.

bybradley-terry returnswe add style features such as response length, number of markdown subheadings, lists, and amount of bold text as independent variables.

this is a common technique in statistics and has recently been used by alpacaeval lc and others for large model evaluation.

including any confounding variables in the regression (e.g., response length) allows the increase in scores to be attributed to the confounding variables rather than to the model ability itself.

the relevant code has been made public on google colab.

in addition, the team also conducted ablation experiments on "only controlling length" and "only controlling format". the scores of gpt-4o mini and google gemini series are more affected by the format.

however, this approach has limitations, such as the possibility of unobserved confounding factors, such as the positive correlation between length and response quality, which were not taken into account (e.g., thought chaining prompts).

many netizens said that the adjusted list of difficult tasks is more consistent with their subjective impressions.

some people also think that it is the back-and-forth game between the list and the big model companies that rush to the list that allows the entire field to progress together.

do you still refer to the results of the big model arena to select models? or if you have a better evaluation method, please share it in the comments section.