2024-07-16
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]
The first author, Haiwen Diao, is a doctoral student at Dalian University of Technology, and his supervisor is Professor Huchuan Lu. He is currently interning at Beijing Zhiyuan Artificial Intelligence Research Institute, and his supervisor is Dr. Xinlong Wang. His research interests are vision and language, efficient migration of large models, and multimodal large models. The co-first author, Yufeng Cui, graduated from Beijing University of Aeronautics and Astronautics and is an algorithm researcher at the Visual Center of Beijing Zhiyuan Artificial Intelligence Research Institute. His research interests are multimodal models, generative models, and computer vision, and his main work is the Emu series.
Recently, research on multimodal large models is in full swing, and the industry has invested more and more in this area. Foreign countries have successively launched popular models, such as GPT-4o (OpenAI), Gemini (Google), Phi-3V (Microsoft), Claude-3V (Anthropic), and Grok-1.5V (xAI). At the same time, domestic models such as GLM-4V (Intelligent Spectrum AI), Step-1.5V (Step Star), Emu2 (Beijing Zhiyuan), Intern-VL (Shanghai AI Laboratory), and Qwen-VL (Alibaba) are flourishing.
Current visual language models (VLMs) usually rely on a vision encoder (VE) to extract visual features, which are then combined with user instructions to be passed into a large language model (LLM) for processing and answering. The main challenge lies in the separation of the training of the visual encoder and the large language model. This separation leads to the introduction of visual induction bias problems when the visual encoder is connected to the large language model, such as limited image resolution and aspect ratio, and strong visual semantic priors. As the capacity of the visual encoder continues to expand, the deployment efficiency of multimodal large models when processing visual signals is also greatly limited. In addition, how to find the optimal capacity configuration of the visual encoder and the large language model has become increasingly complex and challenging.
In this context, some more cutting-edge ideas quickly emerged:
Adept AI released the Fuyu series of models at the end of 2023 and made some related attempts, but there was no disclosure on training strategies, data resources, and equipment information. At the same time, the Fuyu model has a significant performance gap with mainstream algorithms in public visual text evaluation indicators. At the same time, some pilot experiments we conducted showed that even if the scale of pre-training data is greatly increased, the native multimodal large model without encoder still faces thorny problems such as slow convergence and poor performance.
In response to these challenges, the vision team of the Academy of A-Science, together with domestic universities such as Dalian University of Technology and Peking University, launched a new generation of encoder-free visual language model EVE. Through refined training strategies and additional visual supervision, EVE integrates visual-language representation, alignment, and reasoning into a unified pure decoder architecture. Using public data, EVE performs well in multiple visual-language benchmarks, comparable to mainstream encoder-based multimodal methods of similar capacity, and significantly outperforms the similar Fuyu-8B. The proposal of EVE aims to provide a transparent and efficient path for the development of native multimodal architectures for pure decoders.
1. Technical highlights
2. Model Structure
First, the Vicuna-7B language model is used for initialization, which gives it rich language knowledge and strong command-following capabilities. On this basis, the deep visual encoder is removed and a lightweight visual encoding layer is constructed to efficiently and losslessly encode image inputs and input them into a unified decoder together with user language commands. In addition, the visual alignment layer is used to align features with the general visual encoder to strengthen the encoding and representation of fine-grained visual information.
2.1 Patch Embedding Layer
2.2 Patch Aligning Layer
3. Training strategy
4. Quantitative Analysis
The EVE model significantly outperforms the similar Fuyu-8B model in multiple visual language benchmarks, and performs on par with a variety of mainstream encoder-based visual language models. However, due to the large amount of visual language data used for training, it faces challenges in accurately responding to specific instructions, and its performance in some benchmarks needs to be improved. What is exciting is that through efficient training strategies, the encoder-free EVE can achieve comparable performance to the encoder-based visual language models, fundamentally solving the problems of mainstream models in input size flexibility, deployment efficiency, and modality capacity matching.
Compared with models with encoders, which are prone to problems such as language structure simplification and loss of rich knowledge, EVE shows that its performance gradually and steadily improves as the data scale increases, gradually approaching the performance level of encoder-based models. This may be because it is more challenging to encode and align visual and language modalities in a unified network, making encoder-free models less prone to overfitting than models with encoders.
5. What do your peers think?
Ali Hatamizadeh, a senior researcher at NVIDIA, said that EVE is refreshing and attempts to propose a new narrative, which is different from building complex evaluation standards and incremental improvements to visual language models.
Armand Joulin, chief researcher at Google Deepmind, said it was exciting to build a decoder-only visual language model.
Apple machine learning engineer Prince Canuma said that the EVE architecture is very interesting and is a good complement to the MLX VLM project set.
6. Future Outlook
As a native visual language model without an encoder, EVE has achieved encouraging results. Along this path, there are some interesting directions worth exploring in the future: