Why is the delayed interaction model the standard for the next generation of RAG?

2024-08-05

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

Zhang Yingfeng: Co-founder of Infiflow, with many years of experience in search, AI, and Infra infrastructure development, he is currently working on the construction of the next generation of RAG core products.

In the development of RAG systems, a good Reranker model is an indispensable link and is always used in various evaluations. This is because queries represented by vector search will face the problem of low hit rate, so an advanced Reranker model is needed to remedy it. This constitutes a two-stage sorting architecture with vector search as the coarse screening and the Reranker model as the fine sorting.

There are two main types of ranking model architectures:

1. Dual encoder. Taking the BERT model as an example, it encodes the query and document separately, and finally passes through a Pooling layer so that the output contains only one vector. In the ranking stage of the query, you only need to calculate the similarity of the two vectors, as shown in the figure below. The dual encoder can be used for both the ranking and reranking stages. Vector search is actually this sorting model. Since the dual encoder encodes the query and document separately, it cannot capture the complex interactive relationship between the query and document tokens, and there will be a lot of semantic loss, but since only vector search is required to complete the sorting and scoring calculation, the execution efficiency is very high.

2. Cross Encoder. Cross-Encoder uses a single encoder model to encode queries and documents at the same time. It can capture the complex interactive relationship between queries and documents, and therefore can provide more accurate search ranking results. Cross-Encoder does not output the vectors corresponding to the tokens of the query and document, but adds a classifier to directly output the similarity score of the query and document. Its disadvantage is that since each document and query need to be encoded together during the query, the sorting speed is very slow, so Cross-Encoder can only be used for re-ranking of the final results. For example, re-ranking the Top 10 of the initial screening results still takes seconds to complete.

Since the beginning of this year, another type of work represented by ColBERT [Reference 1] has attracted widespread attention in the RAG development community. As shown in the figure below, it has some characteristics that are significantly different from the above two types of ranking models:

First, compared to Cross Encoder, ColBERT still uses a dual encoder strategy, encoding queries and documents with independent encoders. Therefore, the query token and document token do not affect each other during encoding. This separation allows document encoding to be processed offline, and only query encoding is performed during query, so the processing speed is much higher than Cross Encoder.

Secondly, compared to the dual encoder, ColBERT outputs multiple vectors instead of a single vector, which is directly obtained from the final output layer of the Transformer, while the dual encoder converts multiple vectors into one vector output through a Pooling layer, thus losing some semantics.

When calculating the ranking, ColBERT introduced a delayed interactive similarity function and named it the maximum similarity (MaxSim). The calculation method is as follows: For each query token vector, the similarity calculation is performed with the vectors corresponding to all document tokens, and the maximum score of each query token is tracked. The total score of the query and document is the sum of these maximum cosine scores. For example, for a query with 32 token vectors (maximum query length is 32) and a document with 128 tokens, 32*128 similarity operations need to be performed, as shown in the figure below.

Therefore, in comparison, Cross Encoder can be calledEarly Interaction Model, and the work represented by ColBERT can be calledLate Interaction Model.

The following figure compares the above ranking models in terms of performance and ranking quality. Since the delayed interaction model captures the complex interactions between queries and documents during the ranking process and avoids the overhead of document token encoding, it can not only ensure good ranking results but also achieve faster ranking performance - under the same data scale, ColBERT's efficiency can reach more than 100 times that of Cross Encoder. Therefore, the delayed interaction model is a very promising ranking model. A natural idea is:Can we directly use the delayed interaction model in RAG to replace the two-stage sorting architecture such as vector search + refined sorting?

To this end, we need to consider some issues in ColBERT engineering:

1. ColBERT's MaxSim delayed interactive similarity function has much higher computational efficiency than Cross Encoder, but compared with ordinary vector search, the computational overhead is still very large: because the similarity between the query and the document is a multi-vector calculation, the MaxSim overhead is M * N times that of the ordinary vector similarity calculation (M is the number of tokens in the query, and N is the number of tokens in the document). In response to this, the ColBERT author launched ColBERT v2 [Reference 2] in 2021. Through Cross Encoder and model distillation, the quality of the generated Embedding was improved, and compression technology was used to quantize the generated document vectors, thereby improving the computational performance of MaxSim. The project RAGatouille [Reference 3] based on the ColBERT v2 package has become a solution for high-quality RAG ranking. However, ColBERT v2 is just an algorithm library, and it is still difficult to use it end-to-end in an enterprise-level RAG system.

2. Since ColBERT is a pre-trained model, and the training data comes from search engine queries and return results, these text data are not large. For example, the number of query tokens is 32, and the number of document tokens is 128, which is a typical length limit. Therefore, when ColBERT is used for real data, the length exceeding the limit will be truncated, which is not friendly for long document retrieval.

Based on the above problems, the open source AI native database Infinity provides the Tensor data type in the latest version, and natively provides an end-to-end ColBERT solution. When Tensor is used as a data type, multiple vectors output by ColBERT encoding can be directly stored in a Tensor, so the similarity between Tensors can directly derive the MaxSim score. In view of the large amount of MaxSim calculation, Infinity provides two solutions to optimize: one is binary quantization, which can make the space of the original Tensor only 1/32 of the original size, but does not change the relative ranking result of the MaxSim calculation. This solution is mainly used for Reranker, because it is necessary to take out the corresponding Tensor according to the result of the coarse screening in the previous stage. The other is Tensor Index. ColBERTv2 is actually the Tensor Index implementation launched by the author of ColBERT. Infinity uses EMVB [Reference 4], which can be regarded as an improvement of ColBERT v2, mainly through quantization and pre-filtering technology, and introduces SIMD instructions on key operations to accelerate the implementation. Tensor Index can only be used to serve Ranker, not Reranker. In addition, for long texts that exceed the token limit, Infinity introduces the Tensor Array type:

A document that exceeds the ColBERT limit will be split into multiple paragraphs, each of which will be encoded into a Tensor and saved in one line with the original document. When calculating MaxSim, the query and these paragraphs are calculated separately, and the maximum value is taken as the score of the entire document. As shown in the figure below:

Therefore, by using Infinity, a delayed interaction model can be introduced end-to-end to serve RAG with high quality. So, should ColBERT be used as the Ranker or the Reranker? Below we use Infinity to evaluate on real data sets. Since the latest version of Infinity implements the most comprehensive hybrid search solution ever, the recall methods include vector search, full-text search, sparse vector search, the Tensor mentioned above, and any combination of these methods, and it provides a variety of Reranker methods, such as RRF, and ColBERT Reranker, etc., so we included various combinations of hybrid search and Reranker in the evaluation.

We use the MLDR dataset for evaluation. MLDR is a benchmark set used by MTEB [Reference 5] to evaluate the quality of Embedding models, of which MLDR is one of the datasets. Its full name is Multi Long Document Retrieval, which contains a total of 200,000 long text data. The evaluation uses BGE-M3 [Reference 6] as the Embedding model, and Jina-ColBERT [Reference 7] to generate Tensor. The evaluation script is also placed in the Infinity warehouse [Reference 8].

Evaluation 1: Is ColBERT effective as a Reranker?The 200,000 MLDR data were used to generate dense vectors and sparse vectors using BGE-M3, and then inserted into the Infinity database. The database contains 4 columns, which store the original text, vector, sparse vector, and Tensor, respectively, and build the corresponding full-text index, vector index, and sparse vector index. The evaluation includes all recall combinations, including single-way recall, dual-way recall, and triple-way recall, as shown below:

The evaluation index uses nDCG@10. Other parameters: When using RRF Reranker, the top N returned by the rough screening is 1000, the total number of queries is 800, and the average length of each query is about 10 tokens.

As can be seen from the figure, all recall schemes have significantly improved after adopting ColBERT Reranker. As a delayed interaction model, ColBERT can provide ranking quality comparable to that of the top Reranker rankings of MTEB, but its performance is 100 times that of them, so it can be reranked in a larger range. The results given in the figure are for Reranker for Top 100, while ColBERT reranking for Top 1000 has no significant change in value and a significant decrease in performance, so it is not recommended. Traditionally, using an external Reranker based on Cross Encoder, the Top 10 will have a delay of seconds, while Infinity has implemented a high-performance ColBERT Reranker internally. Even if reranking is done for Top 100 or even Top 1000, it will not affect the user experience, and the range of recall is greatly increased, so the final ranking effect can be significantly improved. In addition, this ColBERT Reranker calculation only needs to run on a pure CPU architecture, which also greatly reduces the cost of deployment.

Evaluation 2: The comparison is based on ColBERT as a Ranker instead of a Reranker.Therefore, it is necessary to construct a Tensor Index for the Tensor column data. At the same time, in order to evaluate the precision loss introduced by the Tensor Index, a brute force search is also performed.

It can be seen that compared with Reranker, even brute force search without precision loss does not have a significant improvement, and the ranking quality based on Tensor Index is even lower than that of Reranker. However, the query time of Ranker is much slower: the MLDR dataset contains 200,000 document data, about 2GB, and after converting it into Tensor data using Jina-ColBERT, it is as high as 320G. This is because the Tensor data type saves the vector corresponding to each Token of a document. The dimension of the ColBERT model is 128, so the default data volume will expand by 2 orders of magnitude. Even if the Tensor Index is built, it takes an average of 7s to return a query when querying so much data, but the results are not better.

Therefore, it is obvious that ColBERT's benefits as a Reranker are much higher than those as a Ranker. The current best RAG retrieval solution is to add ColBERT Reranker on the basis of 3-way hybrid search (full-text search + vector + sparse vector). Some partners may ask, in order to adopt ColBERT Reranker, it is necessary to add a separate Tensor column, and this column will expand by 2 orders of magnitude compared to the original data set. Is it worth it? First: Infinity provides Binary quantization for Tensor. As a Reranker, it does not affect the sorting results much, but it can make the final data only 1/32 of the original Tensor size. Secondly, even so, some people think that such overhead is too high. However, from the user's perspective, it is still worthwhile to use more storage in exchange for higher sorting quality and cheaper cost (no GPU is required for the sorting process). Finally, I believe that we will soon be able to launch a Late Interaction model that will have a slightly reduced effect but greatly reduced storage overhead. As a Data Infra infrastructure, it is a wise choice to be transparent about these changes and give these trade-offs to users.

The above is based on Infinity's multi-way recall evaluation on the MLDR dataset. The evaluation results on other datasets may be different, but the overall conclusion will not change - 3-way hybrid search + Tensor-based reranking is the highest recall method for current search results.

From this, we can see that ColBERT and its delayed interaction model have great application value in RAG scenarios. The above is related work on text dialogue content generation. Recently, the delayed interaction model has also achieved SOTA results in multimodal scenarios. This is ColPali [Reference 9], which changes the workflow of RAG, as shown in the following figure:

When RAG faces documents with complex formats, the current SOTA is to use document recognition models to identify the layout of the document, and then call the corresponding models for the identified partial structures, such as charts, pictures, etc., to convert them into corresponding text, and then save them in various formats to the RAG supporting database. ColPali saves these steps and directly uses a multimodal model to generate Embedding content. When asking questions, you can directly answer the charts in the document:

The training of the ColPali model is similar to that of ColBERT, which also uses query-document page pairs to capture the semantic association between query and document multimodal data, but uses PaliGemma [Reference 10] to generate multimodal embeddings. Compared with BiPali, which does not use the Late Interaction mechanism but also uses PaliGemma to generate embeddings, the comparison of the evaluation indicators in nDCG@5 is 81.3 vs 58.8. This gap is the difference between "excellent" and "not working at all".

Therefore, although ColBERT has been around for 4 years, the application of the Late Interaction model in RAG has just begun. It will surely expand the use scenarios of RAG and provide high-quality semantic recall in complex RAG scenarios including multimodality. Infinity is ready for its end-to-end application. Welcome to follow and star Infinity, https://github.com/infiniflow/infinity, and strive to become the best AI native database!

references

1. Colbert: Efficient and effective passage search via contextualized late interaction over bert, SIGIR 2020.

2. Colbertv2: Effective and efficient retrieval via lightweight late interaction, arXiv:2112.01488, 2021.

3. RAGatouille https://github.com/bclavie/RAGatouille

4. Efficient Multi-vector Dense Retrieval with Bit Vectors, ECIR 2024.

5. https://huggingface.co/mteb

6. https://huggingface.co/BAAI/bge-m3

7. https://huggingface.co/jinaai/jina-colbert-v1-en

8. https://github.com/infiniflow/infinity/tree/main/python/benchmark/mldr_benchmark

9. ColPali: Efficient Document Retrieval with Vision Language Models, arXiv:2407.01449, 2024.

10. https://github.com/google-research/big_vision/tree/main/big_vision/configs/proj/paligemma

news

Why is the delayed interaction model the standard for the next generation of RAG?

Introduction

my contact information