news

Urgent shortage! Universities are in urgent need of GPUs, Fei-Fei Li asks Hinton for help

2024-07-18

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


Smart Things
Compiled by Chen Junda
Edit Panken

According to foreign media reports on July 18, many US universities are facing a serious shortage of computing power due to the high prices of AI computing clusters and the rush of orders from large companies. This has led to a lag in AI research in universities and the loss of AI research talent.

The computing power shortage in universities has been around for a long time, and even top universities and academic giants are troubled by this problem. In May this year, Fei-Fei Li, a professor at Stanford University, said that the academic community is facing a serious shortage of AI computing resources.Stanford University's NLP lab has only 64 GPUs(NVIDIA A100). Turing Award winner Geoffery Hinton even said when a student asked for help:I don’t know if there is any other way to solve this problem except asking the government for help.

In stark contrast, Facebook parent company Meta is expected toIt has a huge computing cluster with computing power equivalent to 600,000 NVIDIA H100s, which is almost 10,000 times that of the Stanford NLP Laboratory cluster.

However, the 64 GPUs in Stanford University's NLP lab are considered unrealistic by students from many other universities. In fact, except for a few top universities such as Princeton University and RWTH Aachen University in Germany,Many universities don’t even have an Nvidia A100 GPU

In a related discussion on the Reddit forum, a doctoral student from a North American university said that small universities can only get the V100 GPU released by Nvidia many years ago. The situation is even more serious for universities in Europe and Asia, where many universities can onlyUsing Nvidia's consumer graphics cards for AI researchEven so, computing power is extremely scarce, and some students must purchase graphics cards at their own expense or apply for computing power subsidies from Nvidia, Amazon Web Services (AWS), etc.

Many universities are also working hard to change the status quo, such as establishing shared computing clusters through inter-school cooperation, or turning to other AI research directions that require lower computing power.

1. Shortage of computing power and loss of talent, how serious is the GPU shortage in universities?

In fact, for a long time in the past, universities have been at the forefront of AI research.Many breakthroughs have been made by researchers in universities.For example, in 2015, Jascha Sohl-Dickstein, a postdoctoral fellow at Stanford University, invented the world's first diffusion model, which became the basis for many subsequent image and video generation models.

While basic research at universities has been critical to this wave of technological innovation, recent generative AI research has been led by private companies, largely because they have access to the computing power and data needed to build and train large models like ChatGPT and Gemini.

Generative AI research is expensive. OpenAI CEO Sam Altman once estimated that the cost of training GPT-4 was about $100 million. Meta CEO Mark Zuckerberg announced plans in early 2024 to purchase 350,000 Nvidia H100 GPUs to expand Meta's computing power to the equivalent of 600,000 Nvidia H100 GPUs. At the price of nearly $40,000 for H100,This will be a huge order worth tens of billions of dollars.

Currently, no university in the world can afford this level of AI computing infrastructure. Princeton University, as a strong CS school, has one of the largest single AI computing clusters in American universities.But this cluster only has 300 Nvidia H100 GPUs, which was only officially introduced in March this year.

Sanjeev Arora, director of the Center for Language and Intelligence at Princeton University, said:If you don’t have computing power, you can’t conduct large-scale research, and you can’t even participate in the conversation.”。

In a related discussion on the Reddit forum, a PhD student from one of the top five machine learning labs in the United States said that they don't have even one Nvidia H100 so far.


▲Questions from PhD students from the top five machine learning labs in the United States (Source: Reddit)

A PhD student from Asia faces the same dilemma. Most of the GPUs he uses are consumer grade, and there are only one or two rather than a cluster. His school only recently got a server with 8 H100s, and it can only be accessed at limited time. The PhD student said,In the two weeks that he was lucky enough to use the H100 GPU for training, he obtained more data than he had collected in the previous six months.


▲A student doing CV research in Asia recalls a series of GPUs he has used (Source: Reddit)

Another student shared that his school could not provide any computing power support. He could only obtain $1,000 of AWS cloud computing power quota through his internship company.If you use this amount to run a cluster of 8 H100s, it will only last for about 1 day., this level of computing power is simply not enough to produce high-quality research. He also said that this is the norm for third world countries to conduct AI research.


▲A master's student shared his experience of obtaining computing credits through an internship company (Source: Reddit)

The computing resources of European universities are not optimistic. A student studying in Germany shared that he was very lucky because his school could provide 16 A100 GPUs and dozens of other models of GPUs.In Europe, many universities and research laboratories provide little computing power support.


▲A European student is grateful for the computing resources he has (Source: Reddit)

Another student from RWTH Aachen University in Germany shared that his school has more than 200 NVIDIA H100 GPUs, which has attracted the envy of many netizens. However, these resources are shared by all colleges and also with external institutions. If a longer computing time is required, a special application is required.


▲Students from RWTH Aachen University in Germany share the school’s computing power (Source: Reddit)

People from the industry were surprised by the GPU shortage in colleges and universities. One industry insider said that he worked for a major cloud computing provider.I often come into contact with H100 GPU in daily life, develop and fix software for it. Another industry insider said that cutting-edge GPUs such as the H100, which are in high demand, are usually pre-ordered in large quantities by large corporate customers before being added to data centers, so the H100 is "rare" for most researchers.


▲Industry insiders are surprised by the shortage of GPUs in universities (Source: Reddit)

In the case of insufficient computing resources, long-term training is extremely expensive. AI computing clusters in universities often need to be applied for several days or even weeks in advance, and even if they are used, the duration of use is also limited. Many larger training tasks are difficult to complete within one usage cycle, and researchers must also spend extra effort to build checkpoint and recovery code.

The shortage of computing resources has also led to the problem of talent loss in universities., students who are interested in generative AI research have turned to large companies. Because large technology companies generally have hundreds or thousands of times more computing power than universities, which is extremely attractive to AI talents.

2. Establishing a computing power alliance and changing research directions, universities are unwilling and cannot fall behind

Faced with the crisis of lagging behind in AI research and the loss of AI talent, many universities are striving for additional computing power and shifting their research focus to non-computing-intensive AI research areas.

“Academic institutions are scrambling to get computing power,” said Hod Lipson, chairman of Columbia University’s mechanical engineering department. He also stressed that while it is important for industry and government to participate in AI research,But to balance these two forces, others, such as academia and open source developers, should also have a say in the development of this technology.

In order to alleviate the shortage of computing power in colleges and universities, many colleges and universities have involved the government in the construction of computing power clusters. In early 2024, seven universities and research institutions, including Columbia University, Cornell University, New York University and Rensselaer Polytechnic Institute, jointly with the New York State government and charities, created a computing power alliance called Empire AI.


▲Empire AI alliance members (Source: Empire AI official website)

The computing alliance has raised nearly $400 million in funding, of which $275 million came from the government, and the rest came from the seven universities and research institutions participating in the alliance. They will use this money to build an advanced AI computing center, and the alliance members can share these computing resources, while also effectively sharing the holding costs.

Speaking of the reasons for establishing this alliance, the New York Governor's Office stated:Currently, AI computing resources are increasingly concentrated in the hands of large technology companies, which have enormous control over the AI ​​development ecosystem.As a result, researchers, nonprofits, and small companies are being left behind.This has huge implications for AI safety and society as a whole.

Academia and industry are also actively collaborating, which is already common in Silicon Valley, Seattle, Austin and other American technology centers. Dan Grossman, associate dean of the School of Computer Science and Engineering at the University of Washington, said that they have some programs that allow academic researchers to work in industry. Academic staff can get better resources, and universities can also retain these talents.

In fact, there are many important AI researches that do not require high computing power., such as AI explainability research, AI planning and reasoning ability research, etc. Under the limitation of computing power, university researchers began to do more targeted research to ensure that academia would not be completely surpassed by the industry.

Kavita Bala, dean of the School of Computing and Information Sciences at Cornell University, said that universities can reduce their investment in building and training large language models and focus more on developing applications based on large language models. Such applications can still be cutting-edge and play a huge role in unique application areas.

Armando Solar-Lezama, a professor at MIT whose work focuses on code development using AI, believes that building large models from scratch is simply not feasible in academia. Students and researchers can focus on developing applications or even creating synthetic data that can be used to train large language models.

Solal Lezama said professors at his college also took the initiative to contribute money to buy servers and chips, but funding was not the only problem.Even if you have the money, it is difficult to get a top-notch GPU.

Conclusion: AI computing power shortage in universities continues, but multi-party cooperation may offer hope for a breakthrough

In the current situation where large technology companies dominate AI research, AI research in universities is an effective supplement to these studies. Unlike researchers in companies, university researchers are not affected by short-term factors such as financial reports and market demand. If they can obtain more computing resources, they may be able to produce results with significant impact in areas that companies will not or are unwilling to pay attention to.

In fact, in the past few decades, AI has been an unpopular research field and has to be disguised as deep learning and machine learning. However, it is precisely because of persistent researchers such as Hinton, Yann LeCun and Yoshua Bengio in universities who have persisted in related research for decades that the current AI boom has a foundation for realization.

In addition to computing alliances such as Empire AI in New York State, many universities and research institutions in North America have also carried out cross-institutional cooperation of varying sizes to share computing resources. By the end of 2023, more than a dozen Chinese universities also established the China University Computing Alliance. Perhaps this kind of cooperation can bring hope of breaking the computing power shortage in universities.

Source: The Wall Street Journal, Reddit