news

Unveiling DeepSeek: A more extreme story of Chinese technological idealism | 36Kr exclusive

2024-07-22

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Text | Yu Lili
Editor | Liu Jing

Among China's seven large-scale model startups, DeepSeek is the quietest, but it is always remembered in unexpected ways.

A year ago, this unexpectedness came from the fact that the quantitative private equity giant behind it, Huanfang, was the only company outside of large manufacturers that had stockpiled 10,000 A100 chips. A year later, it was the source of the price war for large models in China.

In May, when AI was bombarded continuously, DeepSeek became famous. The reason was that they released an open source model called DeepSeek V2, which provided an unprecedented cost-effectiveness: the inference cost was reduced to only 1 yuan per million tokens, which is about one-seventh of Llama3 70B and one-seventieth of GPT-4 Turbo.

DeepSeek was quickly dubbed the "Pinduoduo of AI", and other large companies such as ByteDance, Tencent, Baidu, and Alibaba could not wait and cut prices one after another. A price war for large models in China was imminent.

The diffuse smoke of war actually conceals a fact: unlike many large companies that burn money on subsidies, DeepSeek is profitable.

Behind this is DeepSeek's all-round innovation in model architecture. It proposes a new MLA (A new multi-head latent attention mechanism) architecture, reducing the memory usage to 5%-13% of the most commonly used MHA architecture in the past. At the same time, its original DeepSeekMoESparse structure also reduces the amount of calculation to the extreme. All of these ultimately contribute to cost reduction.

In Silicon Valley, DeepSeek is called "the mysterious power from the East." SemiAnalysis chief analyst believes that the DeepSeek V2 paper "may be the best one this year." Andrew Carr, a former employee of OpenAI, believes that the paper is "full of amazing wisdom" and applied its training settings to his own model. Jack Clark, former policy director of OpenAI and co-founder of Anthropic, believes that DeepSeek "hired a group of unfathomable geniuses" and believes that the large models made in China "will become a force that cannot be ignored, just like drones and electric vehicles."

This is a rare case in an AI movement where the story is largely driven by Silicon Valley.Many industry insiders told us thatThis strong response stems from innovation at the architectural level, which is a rare attempt among domestic big model companies and even global open source base big models.An AI researcher said that the Attention architecture has been proposed for many years, but has hardly been successfully modified, let alone verified on a large scale. "This is even an idea that will be cut off when making decisions, because most people lack confidence."

On the other hand, domestic large models have rarely been involved in architectural innovation before, because few people have taken the initiative to break such a prejudice:The United States is better at technological innovation from 0 to 1, while China is better at application innovation from 1 to 10.Moreover, this behavior is very uneconomical - the new generation of models will naturally be developed in a few months, and Chinese companies only need to follow and apply them well. Innovating the model structure means that there is no path to follow, and many failures will occur, which will cost a lot of time and money.

DeepSeek is clearly a retrograde. Amid the clamor that big model technologies are bound to converge and that following is a smarter shortcut, DeepSeek values ​​the value accumulated in the "detours" and believes that Chinese big model entrepreneurs can join the global technological innovation in addition to application innovation.

Many of DeepSeek's choices are different from others. So far, among the seven Chinese large-scale model startups, it is the only one that has given up the "both" route and has focused on research and technology, not doing toC applications. It is also the only company that has not fully considered commercialization, firmly chosen the open source route, and has not even raised funds. These have often made it forgotten outside the table, but on the other hand, it is often spread by users in the community.

How did DeepSeek come about? We interviewed Liang Wenfeng, the founder of DeepSeek who rarely appears in public.

This post-80s founder, who has been devoted to researching technology behind the scenes since the Magic Square era, continues his low-key style in the DeepSeek era. Like all researchers, he "reads papers, writes code, and participates in group discussions" every day.

Unlike many quantitative fund founders who have experience in overseas hedge funds and graduated from majors such as physics and mathematics, Liang Wenfeng has always had a local background and studied artificial intelligence at the Department of Electronic Engineering at Zhejiang University in his early years.

Many industry insiders and DeepSeek researchers told us that Liang Wenfeng is a very rare person in China's AI community who "has both strong infra engineering and model research capabilities and the ability to mobilize resources", "can make accurate judgments from a high level and is better than first-line researchers in details". He has "terrifying learning ability" and at the same time "is not like a boss at all, but more like a geek".

This is a rare interview. In the interview, this technological idealist provided a voice that is particularly scarce in China's current technology community:He is one of the few people who puts "right and wrong" before "interests", reminds us to see the inertia of the times, and puts "original innovation" on the agenda.

A year ago, when DeepSeek was just starting out, we first interviewed Liang Wenfeng in his book "Crazy Magic Squares: The Big Model Journey of an Invisible AI Giant"."You have to be crazy ambitious and crazy sincere"It is still just a beautiful slogan, but a year later, it has become an action.

The following is part of the conversation:

How was the first shot in the price war fired?

"Undercurrent": After the release of the DeepSeek V2 model, it quickly triggered a bloody large-scale model price war. Some people said that you are a catfish in the industry.

Liang Wenfeng: We didn't intend to be a catfish, we just became a catfish accidentally.

"Undercurrent": Does this result surprise you?

Liang Wenfeng: It was very unexpected. I didn't expect that people would be so sensitive to price. We just do things at our own pace and then calculate the cost and set the price. Our principle is not to lose money or make huge profits. This price is also a little bit of profit above the cost.

"Undercurrent": Five days later, Zhipu AI followed suit, followed by ByteDance, Alibaba, Baidu, Tencent and other major companies.

Liang Wenfeng: Zhipu AI launched an entry-level product, and the model of the same level as ours is still very expensive. ByteDance was the first to follow up. The flagship model was reduced to the same price as ours, which triggered other large manufacturers to reduce their prices. Because the model costs of large manufacturers are much higher than ours, we did not expect that anyone would lose money doing this, and finally it became the logic of burning money and subsidies in the Internet era.

"Undercurrent": From an outside perspective, price cuts look like an attempt to snatch users, which is usually the case with price wars in the Internet era.

Liang Wenfeng: Attracting users is not our main purpose. We lowered the price because we have already reduced the cost while exploring the structure of the next generation model. On the other hand, we also believe that both API and AI should be accessible to everyone and affordable.

"Undercurrent": Before this, most Chinese companies would directly copy this generation of Llama structure for application. Why did you start from the model structure?

Liang Wenfeng:If the goal is to develop applications, then it is a reasonable choice to continue using the Llama structure and quickly launch products. But our destination is AGI, which means we need to study new model structures to achieve stronger model capabilities under limited resources. This is one of the basic research needed to scale up to larger models. In addition to the model structure, we have also done a lot of other research, including how to construct data, how to make the model more human-like, etc., which are all reflected in the models we released. In addition, the structure of Llama is estimated to be two generations behind the advanced level abroad in terms of training efficiency and inference cost.

"Undercurrent": Where does this generation gap mainly come from?

Liang Wenfeng: First, there is a gap in training efficiency. We estimate that the best domestic and foreign models may have a gap of one-fold in model structure and training dynamics. For this reason alone, we need to consume twice as much computing power to achieve the same effect. In addition, there may also be a gap of one-fold in data efficiency, that is, we need to consume twice as much training data and computing power to achieve the same effect. Together, we need to consume four times more computing power. What we need to do is to continuously narrow these gaps.

"Undercurrent": Most Chinese companies choose to have both models and applications. Why does DeepSeek currently choose to only do research and exploration?

Liang Wenfeng: Because we think the most important thing now is to participate in the wave of global innovation. In the past many years, Chinese companies have been accustomed to others making technological innovations and we take them over to monetize applications, but this is not a matter of course. In this wave, our starting point is not to take advantage of the opportunity to make a fortune, but to be at the forefront of technology and promote the development of the entire ecosystem.

"Undercurrent": The inertial perception left to most people in the era of the Internet and mobile Internet is that the United States is good at technological innovation, while China is better at applications.

Liang Wenfeng:We believe that with economic development,China must also gradually become a contributor rather than a free rider.In the past thirty years of IT boom, we have basically not participated in real technological innovation.We are used to Moore's Law coming out of nowhere, with better hardware and software appearing in 18 months. Scaling Law is also being treated in the same way.

But in fact, this was created tirelessly by the Western-dominated technology community from generation to generation. It was only because we did not participate in this process before that we ignored its existence.

The real gap is not one or two years, but the difference between originality and imitation.

"Undercurrent": Why did DeepSeek V2 surprise many people in Silicon Valley?

Liang Wenfeng: This is a very common innovation among the many innovations that happen in the United States every day. They are surprised because this is a Chinese company.Join their game as an innovative contributor.After all, most Chinese companies are used to following rather than innovating.

"Undercurrent": But this choice is too extravagant in the Chinese context. The big model is a heavy investment game, and not all companies have the capital to only research innovation instead of considering commercialization first.

Liang Wenfeng: The cost of innovation is definitely not low. The inertia of the past was also related to the national conditions in the past. But now, whether it is China's economic size or the profits of large companies such as ByteDance and Tencent, they are not low in the world. What we lack in innovation is definitely not capital, but lack of confidence and not knowing how to organize high-density talents to achieve effective innovation.

"Undercurrent": Why do Chinese companies - including large companies that have no shortage of money - so easily regard rapid commercialization as the top priority?

Liang Wenfeng: In the past thirty years, we have only emphasized making money and ignored innovation. Innovation is not entirely driven by business, but also requires curiosity and creativity. We are just bound by the inertia of the past, but it is also a phased process.

"Undercurrent": But you are a commercial organization, not a public welfare research institution. You choose to innovate and share it through open source. Where do you build a moat? For example, the innovation of the MLA architecture in May will be quickly copied by other companies, right?

Liang Wenfeng:existIn the face of disruptive technology, the moat formed by closed source is short-lived. Even if OpenAI closes its source code, it cannot prevent being overtaken by others.Therefore, we accumulate value in the team. Our colleagues grow in the process, accumulate a lot of know-how, and form an innovative organization and culture, which is our moat.

Open source and publishing papers actually do not lose anything. For technical personnel, being followed is a very fulfilling thing. In fact, open source is more like a cultural behavior rather than a commercial behavior. Giving is actually an extra honor. A company doing this will also have cultural appeal.

"Undercurrent": What do you think of market believers like Zhu Xiaohu?

Liang Wenfeng: Zhu Xiaohu is self-consistent, but his approach is more suitable for companies that make money quickly. If you look at the most profitable companies in the United States, they are all high-tech companies that have accumulated strength over time.

"Undercurrent": But when it comes to making large models, it is difficult to form an absolute advantage simply by taking the lead in technology. What is the bigger thing you are betting on?

Liang WenfengWhat we see is that China’s AI cannot always be in a follower position.We often say that there is a gap of one or two years between China’s AI and that of the United States, but the real gap is the difference between originality and imitation. If this does not change, China will always be a follower, so some exploration is inevitable.

Nvidia's leading position is not just the result of the efforts of one company, but the joint efforts of the entire Western technology community and industry. They can see the technology trends of the next generation and have a roadmap. The development of China's AI also requires such an ecosystem. Many domestic chips cannot be developed because of the lack of supporting technology communities and only second-hand information. Therefore, China must have someone standing at the forefront of technology.

More investment does not necessarily lead to more innovation

"Undercurrent": DeepSeek now has a kind of idealistic temperament of OpenAI's early days, and it is also open source. Will you choose to close the source later? OpenAI and Mistral have both gone through the process of going from open source to closed source.

Liang Wenfeng: We will not close the source. We think it is more important to have a strong technology ecosystem first.

"Undercurrent": Do you have any financing plans? According to media reports, Huanfang has plans to spin off DeepSeek and list it independently. AI startups in Silicon Valley will inevitably be tied to large companies in the end.

Liang Wenfeng: There is no financing plan in the short term. The problem we face has never been money, but the ban on high-end chips.

"Undercurrent": Many people believe that doing AGI and doing quantitative work are two completely different things. Quantitative work can be done quietly, but AGI may require a high-profile approach and alliances, which can increase your investment.

Liang Wenfeng: More investment does not necessarily lead to more innovation. Otherwise, large companies can take over all innovations.

"Undercurrent": You don't develop applications now, is it because you don't have the genes for operations?

Liang Wenfeng:We believe that the current stage is an explosive period of technological innovation, not an explosive period of application. In the long run, we hope to form an ecosystem where the industry directly uses our technology and output, and we are only responsible for basic models and cutting-edge innovations, and then other companies build toB and toC businesses based on DeepSeek. If we can form a complete upstream and downstream industry, we don’t need to make applications ourselves. Of course, if necessary, there is no obstacle for us to make applications, but research and technological innovation will always be our top priority.

"Undercurrent": But if you choose an API, why choose DeepSeek instead of a big company?

Liang Wenfeng:The world of the future is likely to be a world of specialization and division of labor. Basic large models require continuous innovation. Large companies have their own capacity limits and may not be suitable.

"Undercurrent": But can technology really make a difference? You also said that there are no absolute technological secrets.

Liang Wenfeng: Technology has no secrets, but resetting it takes time and costs. Nvidia's graphics cards, in theory, have no technical secrets and are easy to copy, but it takes time to reorganize the team and catch up with the next generation of technology, so the actual moat is still very wide.

"Undercurrent": After you lowered your prices, ByteDance was the first to follow suit, which shows that they still feel a certain threat. What do you think of the new solutions for startups to compete with large companies?

Liang Wenfeng: To be honest, we don’t really care about this, we just did it by the way. Providing cloud services is not our main goal. Our goal is still to achieve AGI.

At present, we have not seen any new solutions, but the big companies have no obvious advantage. The big companies have ready-made users, but their cash flow business is also a burden for them, and they will also become the target of disruption at any time.

"Undercurrent": What do you think about the final outcome of the six large model startups other than DeepSeek?

Liang Wenfeng: Maybe 2 or 3 companies will survive. Now they are still in the stage of burning money, so those with clear self-positioning and more refined operations have a better chance of survival. Other companies may be reborn. Valuable things will not disappear, but they will change in a different way.

"Undercurrent": In the era of magic squares, the attitude towards competition is evaluated as "going one's own way", and people rarely care about horizontal comparison. What is the origin of your thinking about competition?

Liang Wenfeng: What I often think about is whether something can make society more efficient, and whether you can find a position in the industrial division of labor chain. As long as the end result is to make society more efficient, it is valid. Many of the things in between are staged, and excessive attention will inevitably be dazzling.

A group of young people doing "unfathomable" things

"Undercurrent": Jack Clark, former policy director of OpenAI and co-founder of Anthropic, believes that DeepSeek has hired "a group of unfathomable geniuses". What kind of people made DeepSeek v2?

Liang Wenfeng: There are no mysterious geniuses. They are all recent graduates from top universities, interns in their fourth and fifth year of doctoral studies, and some young people who have graduated just a few years ago.

"Undercurrent": Many large model companies are persistent in recruiting people from overseas. Many people think that the top 50 talents in this field may not be in Chinese companies. Where do your people come from?

Liang Wenfeng: The V2 model does not have people coming back from overseas, they are all local. The top 50 talents may not be in China, but maybe we can create such people ourselves.

"Undercurrent": How did this MLA innovation come about? I heard that the idea first came from the personal interest of a young researcher?

Liang Wenfeng: After summarizing some of the mainstream changes in the Attention architecture, he suddenly came up with an idea to design an alternative solution. However, it was a long process from idea to implementation. We formed a team for this and it took several months to get it working.

"Undercurrent": The birth of this divergent inspiration is closely related to the structure of your completely innovative organization. In the era of magic squares, you rarely assigned goals or tasks from top to bottom. But for AGI, a frontier exploration full of uncertainty, do you have more management actions?

Liang Wenfeng: DeepSeek is also all bottom-up. And we generally do not divide the work in advance, but naturally. Everyone has his or her own unique growth experience and comes with his or her own ideas, so there is no need to push him or her. During the exploration process, if he or she encounters a problem, he or she will invite others to discuss it. However, when an idea shows potential, we will also allocate resources from top to bottom.

「Undercurrent」: I heard that DeepSeek is very flexible in the deployment of cards and people.

Liang Wenfeng: We have no upper limit on the transfer of cards and people. If you have an idea, everyone can call the training cluster card at any time without approval. At the same time, because there is no hierarchy or cross-department, you can also flexibly call everyone as long as the other party is interested.

"Undercurrent": A loose management style also depends on you selecting a group of people with strong passion drive. I heard that you are very good at recruiting people based on details, and you can select excellent people with some non-traditional evaluation indicators.

Liang Wenfeng: Our selection criteria have always been passion and curiosity, so many people will have some unique experiences, which are very interesting. Many people's desire to do research far exceeds their concern for money.

"Undercurrent": Transformer was born in Google's AI Lab, and ChatGPT was born in OpenAI. What do you think is the difference between the value of innovation generated by a large company's AI Lab and a startup's?

Liang Wenfeng: Whether it is Google Labs, OpenAI, or even the AI ​​Labs of Chinese giants, they are all very valuable. In the end, it was OpenAI that did it, which was also a historical accident.

"Undercurrent": Is innovation also largely a matter of chance? I see that the row of conference rooms in the middle of your office area has doors on both sides that can be pushed open at will. Your colleagues said that this is to leave room for chance. In the birth of transfomer, there was a story that someone who happened to pass by heard it and joined in, and finally turned it into a universal framework.

Liang Wenfeng: I think innovation is first of all a question of belief. Why is Silicon Valley so innovative? First of all, it is courage. When Chatgpt came out, the whole country lacked confidence in cutting-edge innovation. From investors to large companies, they all felt that the gap was too big and they should just do applications. But innovation first requires confidence. This confidence is usually more obvious in young people.

"Undercurrent": But you do not participate in financing and rarely speak out. Your social voice is definitely not as strong as those companies with active financing. How do you ensure that DeepSeek is the first choice for people who make large models?

Liang Wenfeng: Because we are doing the hardest thing.What attracts top talents the most is definitely solving the world's most difficult problems.In fact, top talents are underestimated in China. Because there are too few hard-core innovations at the social level, they have no chance to be identified. We are doing the most difficult things, which is attractive to them.

"Undercurrent": The previous release of OpenAI did not wait for GPT5. Many people think that this means that the technology curve is obviously slowing down, and many people have begun to question the Scaling Law. What do you think?

Liang Wenfeng: We are optimistic, and the entire industry seems to be in line with expectations. OpenAI is not a god, and it is impossible for it to always be at the forefront.

"Undercurrent": How long do you think it will take to realize AGI? Before releasing DeepSeek V2, you released code generation and mathematical models, and also switched from dense models to MOE. So what are the coordinates of your AGI roadmap?

Liang Wenfeng: It may be 2 years, 5 years or 10 years, but it will be realized in our lifetime. As for the roadmap, there is no consensus even within our company. But we do bet on three directions. One is mathematics and code, the second is multimodality, and the third is natural language itself. Mathematics and code are the natural testing grounds for AGI, a bit like Go, a closed and verifiable system that may be able to achieve high intelligence through self-learning. On the other hand, multimodality and participation in learning in the real world of humans may also be necessary for AGI. We remain open to all possibilities.

"Undercurrent": What do you think the final outcome of the big model will be?

Liang Wenfeng: There will be specialized companies providing basic models and basic services, and there will be a long chain of professional division of labor. More people will be on top to meet the diverse needs of the entire society.

All the routines are the product of the previous generation

"Undercurrent": Over the past year, there have been many changes in China's large-scale model entrepreneurship. For example, Wang Huiwen, who was very active at the beginning of last year, withdrew midway, and the companies that joined later also began to show differentiation.

Liang Wenfeng: Wang Huiwen took all the losses himself, allowing others to get away with it. He made a choice that was the worst for himself but good for everyone, so he is a very kind person, and I admire this.

"Undercurrent": Where are you focusing most of your energy now?

Liang Wenfeng: The main focus is on studying the next generation of large models. There are still many unresolved issues.

"Undercurrent": Several other large model startups insist on having both. After all, technology will not bring permanent leadership, and it is also important to seize the time window to apply technological advantages to products. Is DeepSeek bold enough to focus on model research because its model capabilities are not enough?

Liang Wenfeng: All routines are the product of the previous generation and may not be valid in the future. Using the business logic of the Internet to discuss the profit model of AI in the future is like discussing General Electric and Coca-Cola when Ma Huateng started his business. It is likely to be a case of trying to find a sword by carving a mark on the boat.

"Undercurrent": In the past, Huan Fang has had strong technical and innovative genes, and its growth has been relatively smooth. Is this the reason why you are optimistic?

Liang Wenfeng: To some extent, Magic Square has strengthened our confidence in technology-driven innovation, but it is not all smooth sailing. We have gone through a long process of accumulation. What the outside world sees is the part of Magic Square after 2015, but in fact we have been doing it for 16 years.

"Undercurrent": Let's go back to the topic of original innovation. Now that the economy is starting to go down and capital is also entering a cold cycle, will it bring more inhibition to original innovation?

Liang Wenfeng: I don’t think so. The adjustment of China’s industrial structure will rely more on the innovation of hard-core technology. When many people find that making quick money in the past may be due to luck of the times, they will be more willing to bend down and do real innovation.

"Undercurrent": So you are also optimistic about this matter?

Liang Wenfeng: I grew up in a fifth-tier city in Guangdong in the 1980s. My father was a primary school teacher. In the 1990s, there were many opportunities to make money in Guangdong. At that time, many parents came to my house. Basically, they thought that studying was useless. But now when I look back, the concept has changed. Because it is difficult to make money, even the opportunity to drive a taxi may be gone. It has changed in a generation.

There will be more and more hardcore innovations in the future. It may not be easy to understand now because the whole society needs to be educated by facts. When this society allows hardcore innovators to succeed and become famous, the collective thinking will change.We just need a bunch of facts and a process.