news

Abandoning the visual encoder, this "native version" multimodal large model can also rival the mainstream method

2024-07-16

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

The first author, Haiwen Diao, is a doctoral student at Dalian University of Technology, and his supervisor is Professor Huchuan Lu. He is currently interning at Beijing Zhiyuan Artificial Intelligence Research Institute, and his supervisor is Dr. Xinlong Wang. His research interests are vision and language, efficient migration of large models, and multimodal large models. The co-first author, Yufeng Cui, graduated from Beijing University of Aeronautics and Astronautics and is an algorithm researcher at the Visual Center of Beijing Zhiyuan Artificial Intelligence Research Institute. His research interests are multimodal models, generative models, and computer vision, and his main work is the Emu series.

Recently, research on multimodal large models is in full swing, and the industry has invested more and more in this area. Foreign countries have successively launched popular models, such as GPT-4o (OpenAI), Gemini (Google), Phi-3V (Microsoft), Claude-3V (Anthropic), and Grok-1.5V (xAI). At the same time, domestic models such as GLM-4V (Intelligent Spectrum AI), Step-1.5V (Step Star), Emu2 (Beijing Zhiyuan), Intern-VL (Shanghai AI Laboratory), and Qwen-VL (Alibaba) are flourishing.

Current visual language models (VLMs) usually rely on a vision encoder (VE) to extract visual features, which are then combined with user instructions to be passed into a large language model (LLM) for processing and answering. The main challenge lies in the separation of the training of the visual encoder and the large language model. This separation leads to the introduction of visual induction bias problems when the visual encoder is connected to the large language model, such as limited image resolution and aspect ratio, and strong visual semantic priors. As the capacity of the visual encoder continues to expand, the deployment efficiency of multimodal large models when processing visual signals is also greatly limited. In addition, how to find the optimal capacity configuration of the visual encoder and the large language model has become increasingly complex and challenging.

In this context, some more cutting-edge ideas quickly emerged:

  • Is it possible to remove the visual encoder and directly build a native multimodal large model without a visual encoder?
  • How to efficiently and smoothly evolve a large language model into a native multimodal large model without a visual encoder?
  • How to bridge the performance gap between encoder-free native multimodal frameworks and the encoder-based mainstream multimodal paradigm?

Adept AI released the Fuyu series of models at the end of 2023 and made some related attempts, but there was no disclosure on training strategies, data resources, and equipment information. At the same time, the Fuyu model has a significant performance gap with mainstream algorithms in public visual text evaluation indicators. At the same time, some pilot experiments we conducted showed that even if the scale of pre-training data is greatly increased, the native multimodal large model without encoder still faces thorny problems such as slow convergence and poor performance.

In response to these challenges, the vision team of the Academy of A-Science, together with domestic universities such as Dalian University of Technology and Peking University, launched a new generation of encoder-free visual language model EVE. Through refined training strategies and additional visual supervision, EVE integrates visual-language representation, alignment, and reasoning into a unified pure decoder architecture. Using public data, EVE performs well in multiple visual-language benchmarks, comparable to mainstream encoder-based multimodal methods of similar capacity, and significantly outperforms the similar Fuyu-8B. The proposal of EVE aims to provide a transparent and efficient path for the development of native multimodal architectures for pure decoders.





  • Paper address: https://arxiv.org/abs/2406.11832
  • Project code: https://github.com/baaivision/EVE
  • Model address: https://huggingface.co/BAAI/EVE-7B-HD-v1.0

1. Technical highlights

  • Native visual language model: It breaks the fixed paradigm of mainstream multimodal models, removes the visual encoder, and can handle any image aspect ratio. It significantly outperforms the similar Fuyu-8B model in multiple visual language benchmarks and is close to the mainstream visual language architecture based on the visual encoder.
  • Low data and training cost: The pre-training of the EVE model only screened public data from OpenImages, SAM and LAION, and used 665,000 LLaVA instruction data and an additional 1.2 million visual dialogue data to build the regular version and high-resolution version of EVE-7B respectively. Training takes about 9 days on two 8-A100 (40G) nodes, or about 5 days on four 8-A100 nodes.
  • Transparent and efficient exploration: EVE attempts to explore an efficient, transparent and practical path to the native visual language model, providing new ideas and valuable experience for the development of a new generation of pure decoder visual language model architecture, and opening up new exploration directions for the development of future multimodal models.

2. Model Structure



First, the Vicuna-7B language model is used for initialization, which gives it rich language knowledge and strong command-following capabilities. On this basis, the deep visual encoder is removed and a lightweight visual encoding layer is constructed to efficiently and losslessly encode image inputs and input them into a unified decoder together with user language commands. In addition, the visual alignment layer is used to align features with the general visual encoder to strengthen the encoding and representation of fine-grained visual information.



2.1 Patch Embedding Layer

  • First, a single convolutional layer is used to obtain the 2D feature map of the image, and then downsampled through an average pooling layer;
  • Use the cross attention module (CA1) to interact in a limited receptive field to enhance the local features of each patch;
  • Use the <CLS> token in combination with the cross-attention module (CA2) to provide global information for each subsequent patch feature;
  • A learnable <SPL> token is inserted at the end of each patch feature row to help the network understand the 2D spatial structure of the image.

2.2 Patch Aligning Layer

  • Record the 2D shape of the valid patch; discard <CLS>/
  • tokens and restore them to their original two-dimensional shape using an adaptive pooling layer;
  • Through the hierarchical criss-cross attention module (CA3), multi-layer network visual features are integrated to achieve fine-grained alignment with the visual encoder output.

3. Training strategy



  • Pre-training phase guided by a large language model: Establishing the initial connection between vision and language, laying the foundation for subsequent stable and efficient large-scale pre-training;
  • Generative pre-training stage: further improve the model's ability to understand visual-linguistic content and achieve a smooth transition from a pure language model to a multimodal model;
  • Supervised fine-tuning phase: further regularizes the model’s ability to follow language instructions and learn conversational patterns to meet the requirements of various vision-language benchmarks.



  • In the pre-training stage, 33 million public data from SA-1B, OpenImages, and LAION were screened, and only image samples with a resolution higher than 448×448 were retained. In particular, to address the high redundancy of LAION images, K-means clustering was applied to the image features extracted by EVA-CLIP to generate 50,000 clusters, and the 300 images closest to each cluster center were selected from them, and finally 15 million LAION image samples were selected. Subsequently, high-quality image descriptions were regenerated using Emu2 (17B) and LLaVA-1.5 (13B).
  • In the supervised fine-tuning stage, the LLaVA-mix-665K fine-tuning dataset is used to train the standard version of EVE-7B, and mixed datasets such as AI2D, Synthdog, DVQA, ChartQA, DocVQA, Vision-Flan and Bunny-695K are integrated to train the high-resolution version of EVE-7B.

4. Quantitative Analysis



The EVE model significantly outperforms the similar Fuyu-8B model in multiple visual language benchmarks, and performs on par with a variety of mainstream encoder-based visual language models. However, due to the large amount of visual language data used for training, it faces challenges in accurately responding to specific instructions, and its performance in some benchmarks needs to be improved. What is exciting is that through efficient training strategies, the encoder-free EVE can achieve comparable performance to the encoder-based visual language models, fundamentally solving the problems of mainstream models in input size flexibility, deployment efficiency, and modality capacity matching.



Compared with models with encoders, which are prone to problems such as language structure simplification and loss of rich knowledge, EVE shows that its performance gradually and steadily improves as the data scale increases, gradually approaching the performance level of encoder-based models. This may be because it is more challenging to encode and align visual and language modalities in a unified network, making encoder-free models less prone to overfitting than models with encoders.

5. What do your peers think?

Ali Hatamizadeh, a senior researcher at NVIDIA, said that EVE is refreshing and attempts to propose a new narrative, which is different from building complex evaluation standards and incremental improvements to visual language models.



Armand Joulin, chief researcher at Google Deepmind, said it was exciting to build a decoder-only visual language model.



Apple machine learning engineer Prince Canuma said that the EVE architecture is very interesting and is a good complement to the MLX VLM project set.



6. Future Outlook

As a native visual language model without an encoder, EVE has achieved encouraging results. Along this path, there are some interesting directions worth exploring in the future:

  • Further performance improvement: Experiments have found that using only visual-language data for pre-training significantly reduces the model's language ability (SQA score drops from 65.3% to 63.0%), but gradually improves the model's multimodal performance. This indicates that when the large language model is updated, there is a catastrophic forgetting of language knowledge. It is recommended to appropriately integrate pure language pre-training data, or adopt a mixture of experts (MoE) strategy to reduce interference between visual and language modalities.
  • Vision of encoder-free architecture: Through appropriate strategies and high-quality data training, encoder-free visual language models can compete with models with encoders. So how do the two perform with the same model capacity and massive training data? We presume that by expanding the model capacity and the amount of training data, the encoder-free architecture can reach or even surpass the encoder-based architecture, because the former inputs images almost losslessly, avoiding the prior bias of the visual encoder.
  • Native multimodal construction: EVE fully demonstrates how to efficiently and stably build native multimodal models, which opens a transparent and feasible path for integrating more modalities (such as audio, video, thermal imaging, depth, etc.) in the future. The core idea is to pre-align these modalities through a frozen large language model before introducing large-scale unified training, and use the corresponding unimodal encoders and language concept alignment for supervision.