news

LLM inference performance is affected by output format, JSON being the worst

2024-08-16

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Different output formats can actually affect the performance of large models? !

Let the large language models (LLMs) solve the same math problem under two prompts. The problem is as follows:

  • Eliza is paid $10 per hour for the first 40 hours she works each week, and $1.2 per hour for overtime. If Eliza works 45 hours this week, how much will she earn this week?

Thought chain prompt: "Provide output in the following format, reasoning step by step: ...Answer: The final answer is...".

Format restriction prompt"Provide output in the following valid JSON format: ... (see the figure for the specific JSON format)".

The correct answer is460, it can be seen that the chain of thought (letting the model think step by step) works, but the format restriction ("output in JSON format") fails! !



This is a scene from a new study by National Taiwan University and Appier AI Research, who found that

Format restrictions will reduce the reasoning ability of LLMs, and the stricter the restrictions, the worse the reasoning. (Mainly a rebel)



But the good news is, it can be cured.

They found thatBest SolutionIt is a "secondary conversion" (like a profiteer, right), that is, LLMs first answer the question in natural language and then convert the answer into the target format.

In this process, they compared the performance differences of different models such as GPT-3.5 Turbo, Claude 3 Haiku, and Gemini 1.5 Flash when generating data in different formats.The results also found

GPT likes YAML, Claude likes XML, Gemini/Gemma likes JSON. (Each has its own preferences)

After reading the study, some netizens pointed out that itBalancing structured generation and task reasoningSignificance:



Format restrictions reduce the reasoning power of LLMs

The above research has been published on arXiv. The paper mainly reveals that under format restrictions, the reasoning ability of LLMs is significantly reduced.Especially in JSON mode



All along,Incorporating LLMs into industrial applicationsA major obstacle to the development of is their lack of adherence to a standardized output format.

A common solution is structured generation, which requires LLMs to provide output in a standardized format such as JSON or XML through format constraints.

But then again, although there are many ways to implement this restriction, no one has studied the subsequent impact. (Does the restriction affect model performance?)

The researchers took action immediately.3 common methodsTo evaluate the impact of different format restrictions on downstream performance:

  • JSON-mode: restricting the output of LLMs via predefined tag spaces
  • FRI: Instructs LLMs to generate standardized responses that conform to a specific schema
  • NL-to-Format: A two-step process that first answers the question in natural language and then converts to the target format

Oh, and addNatural Language (NL), which is the least constrained format, allowing the model to freely answer questions in natural language.

The evaluation objects are GSM8K (containing math problems in a natural language environment) and Last Letter Concatenation (the last letter connection task), two datasets that require exact matching of answers, as well as Shuffled Objects (shuffled object tracking task).



They found that in these tasks involving reasoning, more permissive prompts generally led to better results.

at the same time,JSON Schema performs worst in most cases, followed by Format Restriction Instructions (FRI), then Natural Language to Format (NL to Format) conversion, and Natural Language (NL) prompts.

The study also found that different LLMs have different data formats.Show different preferences

For example, GPT prefers the YAML format, Claude prefers the XML format, and Gemini/Gemma prefers the JSON format.

However, in classification tasks,Format restrictions may improve accuracy, because it reduces the possible answer choices and thus reduces the error rate.



They further conclude that format restrictions reduce the model's reasoning ability.reason, mainly including:

  • This limits the model's ability to generate necessary intermediate reasoning steps.
  • Enforcing format requirements may be incompatible with the way the model naturally generates answers.
  • Formatting errors may result in the answer being judged as incorrect due to formatting issues even though the reasoning is correct.

Good news: It can be cured

They proposed several solutions to this problem:

First, as mentioned earlier, JSON Schema performs the worst in most cases, followed by Natural Language to Format (NL to Format) conversion.

Then conversely,The best solution to solve the format limitation is NL to Format, that is, LLMs first answer the question in natural language and then convert the answer into the target format. This approach allows the separation of reasoning and format compliance, resulting in better performance.



In addition, the structured outputKey sequenceIt has an important impact on the way LLMs answer.

For example, when using GPT-3.5 Turbo, 100% of JSON-mode responses incorrectly placed the “answer” key before “reasoning”, which caused the model to directly give the answer instead of showing the thought process.

Research also shows that format restrictions lead toParsing ErrorNot the main reason for the performance difference.

For example, in the LLaMA 3 8B model, the JSON format parsing error rate for the Last Letter task is only 0.15%, but the performance gap compared to natural language responses reaches 38.15%.



And you canMitigate these errors with corrective promptsFor example, for the Claude-3-Haiku model, in the Last Letter task, the accuracy of JSON and YAML formats is improved by +2.8% and +44.8% respectively through the correction step.



The above also means that when applying LLMs, one needs to find a balance between an easy-to-parse format and retaining inherent reasoning capabilities.

Finally, the researchers reminded in the paper:

  • Compared with regular expressions, LLMs as answer parsers can provide more in-depth and accurate text understanding. They are not limited to surface pattern matching, but can truly understand the meaning and context of the answer.