news

In the big model industry, is there no such thing as "real" open source?

2024-08-01

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


Author: Zhou Yixiao
Email: [email protected]

The open source large model market has been very active recently. First, Apple open-sourced the 7-billion-parameter small model DCLM, and then the heavyweight Meta's Llama 3.1 and Mistral Large 2 were open-sourced one after another. In many benchmark tests, Llama 3.1 surpassed the closed-source SOTA model.

However, the debate between open source and closed source shows no signs of stopping.

On the one hand, Meta said after the release of Llama 3.1: "Now, we are ushering in a new era led by open source." On the other hand, Sam Altman wrote an article in The Washington Post, directly raising the contradiction between open source and closed source to the national and ideological level.

At the World Artificial Intelligence Conference some time ago, Robin Li bluntly stated that "open source is actually a kind of IQ tax", because closed-source models obviously have stronger performance and lower reasoning costs, which once again sparked discussion.

Later, Fu Sheng also expressed his opinion. He believed that the two camps of open source and closed source compete with each other and develop together. He also refuted the view that "open source is actually a kind of IQ tax": "Open source large language models are free, so how can they be an IQ tax? Who is collecting the tax?" "If companies use paid closed source large language models today, that is called an 'IQ tax', especially if they charge very high model licensing fees and API fees, spending hundreds of millions of yuan a year, and finally buying them back as a decoration, and even employees can't use (the model) at all."

The core of this debate involves the direction and model of technological development, reflecting the views and positions of different stakeholders. Before we talk about the open source and closed source of large language models, we need to clarify the two basic concepts of "open source" and "closed source".

The term "open source" originates from the software field and refers to making the source code of software public during its development process, allowing anyone to view, modify and distribute it.Open Source SoftwareThe development of usually follows the principles of mutual cooperation and peer production, which promotes the improvement of production modules, communication channels and interactive communities. Typical representatives include Linux and Mozilla Firefox.

Closed-source software (proprietary software)For commercial or other reasons, the source code is not made public, and only computer-readable programs (such as binary format) are provided. The source code is only controlled by the developer. Typical representatives include Windows and Android.

Open source is a software development model based on openness, sharing and collaboration. It encourages everyone to participate in the development and improvement of software, and promotes the continuous advancement and wide application of technology.

Choosing closed-source software is more likely to be a stable, focused product, but closed-source software usually costs money, and if it has any bugs or missing features, you can only wait for the developer to fix the problem.

As for what is an open source big model, the industry has not reached a clear consensus like open source software.

The open source of large language models and software open source are similar in concept. Both are based on openness, sharing, and collaboration, encouraging the community to participate in development and improvement, promoting technological progress, and increasing transparency.

However, there are significant differences in implementation and requirements.

Software open source is mainly for applications and tools, and has lower resource requirements, while the open source of large language models involves a lot of computing resources and high-quality data, and may have more usage restrictions. Therefore, although both open source are aimed at promoting innovation and technology dissemination, the open source of large language models faces more complexity and the forms of community contribution are also different.

Robin Li also emphasized the difference between the two. Model open source does not mean code open source: "Open source models only give you a bunch of parameters. You still need to do SFT (supervised fine-tuning) and security alignment. Even if you get the corresponding source code, you don't know what proportion and what proportion of data are used to train these parameters. You can't achieve the goal of everyone adding fuel to the fire. Getting these things doesn't allow you to iterate on the shoulders of giants for development."

The full open source process of the large language model includes making the entire process of model development, from data collection, model design, training to deployment, all links are open and transparent. This practice not only includes the disclosure of data sets and model architecture, but also covers the code sharing of the training process and the release of pre-trained model weights.

The past year has seen a huge increase in the number of large language models, many claiming to be open source, but how open are they really?

Andreas Liesenfeld, an AI researcher at Radboud University in the Netherlands, and Mark Dingemanse, a computational linguist, also found that while the term “open source” is widely used, many models are at best “open weights,” with most other aspects of the system’s construction hidden.

For example, although Meta and Microsoft advertised their large language models as "open source", they did not disclose important information about the underlying technology. To their surprise, AI companies and institutions with fewer resources performed more commendably.

The research team analyzed a series of popular "open source" large language model projects, evaluating their actual openness from multiple aspects such as code, data, weights, APIs, and documents. The study also used OpenAI's ChatGPT as a reference point for closed source, highlighting the true status of "open source" projects.




✔ is open, ~ is partially open, X is closed

The results show significant differences between projects. According to this ranking, OLMo from the Allen Institute for AI is the most open open source model, followed by BloomZ from BigScience, both of which were developed by non-profit organizations.

The paper states that although Meta's Llama and Google DeepMind's Gemma claim to be open source or open, they are actually just open weights. External researchers can access and use pre-trained models, but cannot inspect or customize the models, nor do they know how the models are fine-tuned for specific tasks.

The recent release of LLaMA 3 and Mistral Large 2 has attracted widespread attention. In terms of model openness, LLaMA 3 discloses model weights, and users can access and use these pre-trained and fine-tuned model weights. In addition, Meta also provides some basic codes for model pre-training and fine-tuning, but does not provide complete training codes, and the training data of LLaMA 3 is not public. However, this time LMeta brought a 93-page technical report on LLaMA 3.1 405B.

The situation with Mistral Large 2 is similar, maintaining a high degree of openness in terms of model weights and APIs, but a lower degree of openness in terms of complete code and training data, adopting a strategy of balancing commercial interests and openness, allowing research use but restricting commercial use.

Google said it was “very precise in its language” when describing its model, and that it referred to Gemma as open rather than open source. “Existing open source concepts don’t always apply directly to AI systems,” it said.

An important context for this research is the EU’s AI Directive, which, when it comes into force, imposes lighter regulation on models classified as open, so the definition of open source is likely to become more important.

The only way to innovate is by tweaking the models, which requires enough information to build your own version, the researchers said. Not only that, the models must also be scrutinized; for example, if a model is trained on a large number of test samples, then passing a particular test may not be an achievement.

They are also pleased that so many open source alternatives have emerged. ChatGPT is so popular that it's easy to forget that you know nothing about its training data or other behind-the-scenes methods. This is a hidden danger for those who want to better understand the model or build applications based on it, and open source alternatives make critical basic research possible.

Silicon Stars also conducted statistics on the open source status of some large open source language models in China:


From the table we can see that, similar to the situation overseas, the more thorough open source model is basically dominated by research institutions. This is mainly because the goal of research institutions is to promote scientific research progress and industry development, and they are more inclined to open up their research results.

Commercial companies use their resource advantages to develop more powerful models and gain advantages over the competition through appropriate open source strategies.


From GPT-3 to BERT, open source has brought important impetus to the large model ecosystem.

By making its architecture and training methods public, researchers and developers can conduct further exploration and improvement based on these foundations, giving rise to more cutting-edge technologies and applications.

The emergence of open source big models has significantly lowered the threshold for development. Developers and small and medium-sized enterprises can use these advanced AI technologies without having to build models from scratch, saving a lot of time and resources. This has enabled more innovative projects and products to be quickly implemented, driving the development of the entire industry. Developers actively share optimization methods and application cases on open source platforms, which has also promoted the maturity and application of technologies.

For education and scientific research, open source large language models provide valuable resources. By studying and using these models, students and novice developers can quickly master advanced AI technologies, shorten the learning curve, and inject fresh blood into the industry.

However, the openness of large language models is not a simple binary feature. The Transformer-based system architecture and its training process are extremely complex and difficult to simply classify as open or closed. Open source large models are not a simple label, but more like a spectrum, ranging from fully open source to partially open source.

Open sourcing large language models is a complex and meticulous task, and not all models need to be open source.

We should not demand full open source in a "moral kidnapping" way, because this involves a lot of technology, resources and security considerations, and requires a balance between openness and security, innovation and responsibility. Just like other aspects of the technology field, diversified contribution methods can build a richer technology ecosystem.

The relationship between open source and closed source models can perhaps be compared to the coexistence of open source and closed source software in the software industry.

The open source model promotes the widespread dissemination and innovation of technology, providing more possibilities for researchers and enterprises, while the closed source model promotes the improvement of standards for the entire industry. The healthy competition between the two stimulates the motivation for continuous improvement and provides users with a variety of choices.

Just as open source and proprietary software together shape today's software ecosystem,There is no binary opposition between open source and closed source big models.The coexistence and development of the two is an important driving force for the continuous advancement of AI technology and meeting the needs of different application scenarios. Ultimately, users and the market will make the choice that suits them.