ACL2024: Intellifusion SPACE engine debuts, large model reasoning may enter a new stage

ACL2024: Intellifusion’s SPACE engine debuts, large model reasoning may enter a new stage

2024-08-14

From August 11th to 16th, the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) was held in Bangkok, Thailand.

The paper "Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding" by the Intellifusion Big Model team was accepted as a long paper for the Findings of ACL24. This is a phased display of some of the research results of Intellifusion Big Model.

The ACL Annual Conference is the world's top academic conference in the field of computational linguistics and natural language processing. It is organized by the Association for Computational Linguistics and held annually. It is listed as a Class A conference in the list of recommended conferences of the China Computer Federation (CCF).

InnoLife's selected paper proposed the SPACE engine, an innovative solution to achieve lossless acceleration of large model reasoning. The results of tests on different types of large models show thatAfter using the SPACE engine, the model's reasoning speed on the HumanEval test set increased by 270%-400%, the quality of the reasoning results remains unchanged, and both "fast calculation" and "accurate calculation" can be taken into account.

Selected papers from Intellifusion's large model team

Mainstream reasoning solutions are difficult to achieve “both”

SPACE isSmart Parallel Auto-Correct DeCoding is the abbreviation of "intelligent parallel automatic error correction decoding".

This reasoning scheme has two major characteristics: First, it adoptsSemi-autoregressiveInference model greatly speeds up the inference speed; second, addverifyThis method can improve the reasoning speed while ensuring the reasoning accuracy.

What is "semi-autoregression"? Why do we need to add a verification step? Before explaining these questions, we need to understand how the current large model "works".

Open the Big Language Model app and type "What is a Big Model?" in the dialog box. The Big Model will output its answer word by word: "A Big Model is a deep learning model with tens of millions of parameters." This answer process seems simple. But in fact, the Big Model has gone through multiple "autoregressive" cycles behind the scenes.

First, the big model will predict the first word to be outputted, "大", based on the content we input, and then bring the word "大" back to the input end to predict the next word to be outputted based on the word "大". Of course, this "prediction" is not a "blind guess" out of thin air, but the model will make a comprehensive judgment based on the data seen in the previous training process and select the word with the highest probability as the next output word.

In this case, the second word output is "模". After the second word is output, the big model will bring the two words "大模" back to the input end again to predict the third generated word. This cycle continues until the complete sentence is completed.

This process is called "autoregression".

At present, autoregression is the mainstream solution used for large model inference.Whether it is ChatGPT, open source Llama, or a number of large domestic models, they mainly use autoregressive reasoning solutions.

Schematic diagram of autoregressive scheme

The advantages and disadvantages of the autoregressive solution are also very obvious. The advantage is that it can ensure that the generated content is accurate and meaningful, and the context is coherent. The disadvantage is that the computational cost is high and the inference delay is long.

To overcome these problems, the solution proposed by the industry is“Semi-autoregressive”and"Speculative Decoding"。

"Semi-autoregressive" is a compromise between "autoregressive" and "non-autoregressive". As mentioned above,"Autoregression"It uses the generated words to predict the next word;“Non-autoregressive”It is the opposite of "autoregression", predicting the entire sentence at once.“Non-autoregressive”The solution can improve the efficiency of reasoning, but the accuracy of the output is greatly reduced. The "semi-autoregressive" solution comprehensively considers the advantages and disadvantages of "autoregressive" and "non-autoregressive" to balance the requirements of speed and accuracy for large model reasoning.

However, the use of the "semi-autoregressive" solution has caused new problems - first, most large models cannot be used, and second, the accuracy cannot meet industry requirements.The mainstream large model is built according to the autoregressive reasoning mode. If you want to use the semi-autoregressive solution, you need to retrain the large model from scratch. Training a large model requires a lot of electricity, computing power, and manpower. Almost no one will tear down the hard-trained large model to change the reasoning scheme.

Another option is "speculative decoding".This plan follows“Draft – Verify”To solve the problem of the process, we first need to introduce an auxiliary model with relatively small parameters. The small model first "drafts" the candidate answers, and then the large model verifies whether the candidate answers are correct. Since the small model has a faster reasoning speed than the large model, and the large model can verify multiple candidate answers at the same time, this decoding method can not only ensure the accuracy of the output results, but also speed up the reasoning speed.

However, this solution also has its drawbacks. First, it is necessary to make a very "reliable" small model first, which must be able to "draft" the answer quickly and accurately, which is difficult in itself. Second, the two models must be "written in the same language, the same track, and the same system", and be highly consistent in terms of word segmentation and vocabulary, in order to ensure the verification results.

SPACE Inference Engine - A Small Modification, a Big Speedup

Since several solutions cannot achieve "both", is there a solution that can retain their advantages and avoid their disadvantages? This is the SPACE reasoning engine proposed by the Yuntian Lifei Big Model Team. SPACE combines the two solutions of "semi-autoregressive supervised fine-tuning" and "automatic correction decoding" to enable the big model to generate multiple results in one reasoning, and complete the result verification synchronously to ensure the quality of the generated results. At the same time,This inference engine is applicable to any large modelThrough fine-tuning and optimization of the model, when any large model adopts this inference engine, not only does it not need to train additional auxiliary models, but it can also improve the inference efficiency and make full use of parallel computing resources such as GPU to achieve a higher computing power utilization rate.

The difference between the autoregressive solution (left) and the SPACE solution (right)

As mentioned above, most large language models have the property of "autoregression" and cannot be directly applied to the "semi-autoregression" solution. In response to this, SPACE adopts the "semi-autoregression supervised fine-tuning" method. Through supervised training, the model learns to propose a series of possible candidate words when encountering the special [MASK] logo (as shown in the figure above). This enables the model to perform operations similar to "guessing" during reasoning and output several most likely correct candidate words, thus having the ability of semi-autoregressive reasoning.

Simply put, with the support of the "semi-autoregressive supervised fine-tuning" solution, the large model can make "guesses" by itself during reasoning and output multiple words that are likely to be correct as candidate answers.

However, just like an exam, a lot of content can be listed on the draft, but what is filled in the exam paper must be the correct answer. How to ensure correctness? This requires verification of the results, and this is what "Automatic Correction Decoding" does.

Specifically, during reasoning, we also input the candidate words generated by the large model in the previous reasoning step into the model, allowing the model to perform self-verification and determine whether these candidate answers are correct.

The judgment method is also very simple. If the word generated by the model matches the earlier candidate answer, then the candidate word is considered correct. Recall that in traditional autoregressive reasoning, if a word is correct, then this word needs to be re-input into the language model to infer the next word.

But this is not necessary in SPACE. Since we have already input the candidate words into the model in advance and the candidate words have been verified to be correct, we can directly get the new answer from the correct candidate words, thus saving the time of re-inputting the answer into the model for another reasoning. Therefore, the advantage of this mechanism is that when a candidate word is verified to be correct, it does not need to be input back into the model to generate the next answer, thus reducing the reasoning time.

As an analogy, traditional autoregressive reasoning can be compared to a 4×100-meter relay race: in a regular race, four athletes need to take turns to complete the entire race in order, which is like an autoregressive scheme, requiring word-by-word reasoning. In the SPACE scheme, the four athletes start at the same time. When the first athlete sprints 100 meters and reaches the finish line, the other athletes also reach the finish line of their respective 100-meter sections. However, the first athlete needs to be verified after reaching the finish line. If the verification is successful, the second athlete's performance can be confirmed, and then the second athlete can be verified, and so on.

If an athlete fails to pass the verification, he will need to return to his 100-meter starting line and start again to complete the race. In the best case, if each of the four athletes can pass the verification, then the group only needs to spend 1/4 of the time of a regular race to complete the race, thus achieving a speed-up effect; in the worst case, if each athlete fails to pass the verification, then the time required is the same as a regular race. Whether the verification can be passed depends mainly on the accuracy of the candidate answer.

At the same time, during the reasoning process of the SPACE model, we also insert a special [MASK] tag into the input to guide the large model to generate updated candidate answers. Under this mechanism, each round of reasoning not only verifies the accuracy of the candidate words generated in the previous round, but also provides new candidate words for the next reasoning.

This design aims toImprove the accuracy of candidate words, because every time a new answer appears, the original candidate word will become more accurate through updating. This process is like weather forecasting: we make a prediction for the climate in the next week every day, and as time goes by, the accuracy of the weather forecast for a specific day in the future will gradually improve. This is because as time goes by, we accumulate more sensor data, which enables us to provide more accurate weather forecasts.

The traditional verification and correction method is the "speculative decoding" mentioned above, which means that you need to first train a reliable small model and then use a large model to verify it. The generation quality of the small model greatly affects the final result.

However, SPACE proposes a new solution that can achieve the purpose of generation and verification without using a small model, and the verification work and generation work can be carried out simultaneously. In this way, the efficiency and accuracy of reasoning can be greatly improved.

Let's go back to the example at the beginning. When we input "What is the big model?", in the SPACE reasoning mode, the big model will first generate the words "The big model has tens of millions of parameters" at the same time. At the same time, the automatic correction decoding algorithm will immediately verify the multiple generated words one by one, and only retain the word output with the correct verification result as the final answer, thereby achieving the effect of generating multiple words in the process of forward reasoning of the big model and achieving the purpose of acceleration.

Finally, let's take a look at the effect of SPACE.

We conducted experiments on a number of open source large language models, covering mainstream large language models with different parameter sizes ranging from 6 billion to 70 billion.As can be seen from the table below, SPACE has a more obvious acceleration effect on models with larger parameters.。

In addition, SPACE can also be used in combination with other inference acceleration technologies, such as continue batching, flash attention, KV cache, quantization, etc., to achieve faster inference speed.

To verify this point of view, we implemented SPACE on a mainstream reasoning framework TGI. Experiments have shown that when combined with other reasoning acceleration technologies, the acceleration effect brought by SPACE is equally outstanding.

Big models are used in all industries, and “reasoning” is crucial

Training and reasoning are the two core stages of the life cycle of a large model. Training is to solve the problem of building a large model from scratch, while reasoning is to solve the problem of how to apply the large model to thousands of industries.

If last year is defined as the first year of the explosion of big models, then this year is the first year of the application of big models. Therefore, the reasoning ability of big models is increasingly valued.

Intellifusion has made a lot of efforts to accelerate the application of large models. In terms of computing power, the company launched the large-model edge inference chip DeepEdge10 last year and recently launched the IPU-X6000 accelerator card, which can be applied to the inference acceleration of various large models such as language, vision, and multi-modality.

In terms of algorithms, Yuntianlifei proposed the SPACE inference engine, which greatly improved the inference speed of large models. In terms of applications, Yuntianlifei's self-developed large model Yuntianshu has been applied in many industries such as smart government affairs, urban governance, smart security, smart transportation, smart business, and smart education, exploring and creating industry benchmarks.

In the future, Yuntian Lifei will continue to work hard to make greater contributions to the research and development and application promotion of large model related technologies.

Report/Feedback

news

ACL2024: Intellifusion’s SPACE engine debuts, large model reasoning may enter a new stage

Introduction

My contact information