news

"AI Godmother" Fei-Fei Li: Sora is still a two-dimensional image, only three-dimensional spatial intelligence can achieve AGI|Titanium Media

2024-08-03

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


Titanium Media App reported on August 2 that at a closed-door meeting of the Asian American Scholar Forum held by Stanford University,Fei-Fei Li, a professor at Stanford University known as the "godmother of AI," exclusively told Titanium Media App that although the Sora model of the US company OpenAI can generate videos, in essence it is still a flat two-dimensional model and does not have the ability to understand three-dimensional space. Only "spatial intelligence" is the future direction of AGI.

Fei-Fei Li made the above response in response to a question raised by Hejuan Zhao, founder of Titanium Media, about the relationship between the "spatial intelligence" model and the large language model. She further explained that most current models, such as GPT4o and Gemini 1.5, are still language models, that is, input language, output language, and although there are multimodal models, they are still limited to language, and even if there are videos, they are based on two-dimensional flat images. However, a key link to achieving AGI in the future is "spatial intelligence", which requires a three-dimensional visual model.

She cited Sora's AI video of a Japanese woman walking through the neon-lit streets of Tokyo as an example.


"If you want the algorithm to change the angle and show the video of the woman walking on the street, such as putting the camera behind the woman, Sora cannot do it because the model does not have a real deep understanding of the three-dimensional world. Humans can imagine the scene behind the woman in their minds," said Fei-Fei Li. "Humans can understand how to act in complex environments. We know how to grasp, how to control, how to make tools, and how to build cities. Fundamentally, spatial intelligence is geometry, the relationship between objects, and three-dimensional space. Spatial intelligence is about releasing the ability to generate (visual maps) and reason and plan actions in three-dimensional space. Its applications are wide-ranging, such as for AR and VR, for robots, and the design of apps also requires spatial intelligence."

Li Feifei emphasized to Titanium Media App that "natural evolution enables animals to understand the three-dimensional world, live, predict and interact in three-dimensional space. This ability has a long history of 540 million years. When the trilobite first saw light in the water, it had to "navigate" in the three-dimensional world. If it could not "navigate" in the three-dimensional world, it would soon become a feast for other animals. As evolution progresses, animals' spatial intelligence capabilities are strengthened. We understand shapes, we understand depth."

Fei-Fei Li, 48, is a famous computer scientist, a member of the National Academy of Engineering and the National Academy of Medicine of the United States, and the director of the Human-centered AI Institute at Stanford University. She led the development of the ImageNet image database and visual recognition competition in 2009, which accurately labeled and classified massive images, promoted the advancement of computer vision recognition capabilities, and was also one of the key factors in the rapid development of AI. Last year, she announced VoxPoser, which became a key technical direction in the development of embodied AI.

In July this year, World Labs, an AI company founded by Fei-Fei Li, announced the completion of two rounds of financing, with investors including a16z (Andreessen Horowitz),The company’s latest valuation has reached US$1 billion (approximately RMB 7.26 billion).

At the closed-door meeting of the Asian American Scientists Forum at the end of July, Fei-Fei Li's speech also made more people understand what Word Labs and her "spatial intelligence" development concept are, that is, to make AI truly "from seeing to doing."

How to achieve from "seeing" to "doing"

The so-called "spatial intelligence" refers to the ability of people or machines to perceive, understand and interact in three-dimensional space.

This concept was first proposed by American psychologist Howard Gardner in his theory of multiple intelligences, which allows a model of the external spatial world to be formed in the brain and to be able to use and operate it. In fact, spatial intelligence enables people to think in three-dimensional space, perceive external and internal images, and reproduce, transform or modify images, so that they can move freely in space and manipulate the position of objects at will to generate or interpret graphic information.

In a broad sense, spatial intelligence includes not only the ability to perceive spatial orientation, but also visual discrimination and image thinking. For machines, spatial intelligence refers to their ability to process visual data in three-dimensional space, make accurate predictions, and take actions based on these predictions. This ability enables machines to navigate, operate, and make decisions in a complex three-dimensional world like humans, thus transcending the limitations of traditional two-dimensional vision.

In a TED speech held in April this year, Fei-Fei Li admitted that visual ability triggered the Cambrian explosion, and the evolution of the nervous system brought intelligence. "We want not only AI that can see and speak, but also AI that can do."

In Fei-Fei Li's view, spatial intelligence is "the key to solving AI technology problems."

At this closed-door event at the end of July, Fei-Fei Li first reviewed the three major driving forces of modern AI that began 10 years ago: "neural networks" composed of algorithms, namely "deep learning"; modern chips, mainly Nvidia GPU chips; and big data.

Since 2009, the field of computer vision has experienced explosive growth. Machines can quickly recognize objects, comparable to human performance. But this is just the tip of the iceberg. Computer vision can not only recognize stationary objects and track moving objects, but also divide objects into different parts and even understand the relationship between objects. Therefore, based on big data of images, the field of computer vision has made great progress.

Fei-Fei Li clearly remembers that about 10 years ago, her student Andrej Karpathy participated in the research of establishing the image interpretation algorithm. They showed a picture to the computer, and then through the neural network, the computer could output natural language, such as: "This is a cat lying on the bed."

“I remember telling Andrej, let’s reverse it. For example, give a sentence and let the computer give a picture. We all laughed and thought that it might never be realized, or it would be realized in the very distant future,” Fei-Fei Li recalled.

Generative AI has been developing rapidly in the past two years. In particular, a few months ago, OpenAI released the video generation algorithm Sora. She showed a similar product developed by her students at Google, which was of very good quality. This product existed a few months before Sora was released, and the GPU (graphics processing unit) used was much smaller than Sora. The question is, where will AI go next?

"For many years, I have said that 'seeing' is 'understanding the world'. But I would like to push this concept a step further. 'Seeing' is not just for understanding, but for doing. Nature created animals with perception like us, but in fact, such animals have existed for 450 million years. Because this is a necessary condition for evolution: seeing and doing are a closed loop," said Fei-Fei Li.

She used her favorite cat as an example.


A picture of a cat, a glass of milk, and a plant on a table. When you see this picture, you actually see a three-dimensional video in your mind. You see the shape, you see the geometry.

In fact, you see what happened a few seconds ago, and what might happen a few seconds later. You see the three-dimensional space of this photo. You plan what to do next. Your brain is working, calculating what to do to save your rug, especially since this cat is yours and the rug is yours.

"I call all this spatial intelligence, which is modeling the three-dimensional world and reasoning about objects, places, events, etc. in three-dimensional space and time. In this example, I'm talking about the real world, but it can also refer to the virtual world. But the bottom line of spatial intelligence is to connect "seeing" and "doing". One day, AI will be able to do this," said Fei-Fei Li.

Secondly, Fei-Fei Li demonstrated a 3D video reconstructed based on multiple photos, and then she gave a 3D video based on a single photo. These techniques can be used in design.

Fei-Fei Li said that embodied intelligent AI or humanoid robots can form a closed loop of "seeing" and "doing".

She said that colleagues at Stanford University and chip giant Nvidia are jointly conducting a study called BEHAVIOR, which builds a benchmark dynamic space for household activities to evaluate the performance of various robots in home environments. "We are studying how to connect language models with large visual models so that robots can make plans and start actions," she said. She gave three examples, one of which was a robot opening a drawer, another was a robot unplugging a mobile phone charging cable, and the third was a robot making a sandwich. All instructions were given through human natural language.

Finally, she gave an example, believing that the future belongs to the world of "spatial intelligence", where humans can sit there, wear an EEG cap with sensors, and tell the robot remotely with thoughts without speaking: "Make a Japanese meal." After the robot receives the thought, it decrypts the thought and can make a full meal.

“When we connect ‘seeing’ and ‘doing’ through spatial intelligence, we can do it,” she said.

Li Feifei also said that she has witnessed the exciting development of AI in the past 20 years. However, she believes that the key link of AI or AGI is spatial intelligence. Through spatial intelligence, robots can see the world, perceive the world, understand the world, and let robots do things, thus forming a virtuous closed loop.

Will robots take over humans?

Li Feifei said at the conference that people today are too exaggerated about what AI can do in the future. She warned people not to confuse ambitious and courageous goals with reality, as people hear too much of this argument.

In fact, AI has reached a turning point, especially large language models. "However, it is still a limited technology full of errors, and it still requires deep human involvement and understanding of its limitations. A very dangerous argument now is the so-called risk of human extinction, that is, AI is becoming a machine master of humans. I think this is very dangerous to society, and such remarks will bring many unexpected consequences. The limitations of AI are not fully understood by humans. We need thoughtful, balanced, and unbiased communication and education about AI," Fei-Fei Li emphasized.

Li Feifei believes that AI should be rooted in humans. Humans created it, humans are developing it, humans are using it, and humans should also manage it.

Fei-Fei Li said that at Stanford University's "Human-Centered AI" Institute, they adopted three approaches to AI, including individual, community and social levels:

  • At the individual level, we must participate and embrace AI. This is a civilizing technology. AI changes how children learn, how doctors use diagnostic methods, how artists design, and how teachers teach. Whether you are a technician or not, you can play your part and use AI responsibly.
  • At the community level, AI can empower communities and meet their environmental protection or agricultural needs. Some agricultural communities use machine learning technology to detect community water quality. Artist communities are not only using AI, but also expressing their concerns and ideas on how to solve problems and mitigate risks.
  • At the societal level, governments, research institutes, businesses, federal agencies and international institutions should all take this technology seriously. There are energy issues, which affect geopolitics. There is still a big discussion about open source and closed source, which affects the economy and ecology. There are still management issues, such as the risks and safety of AI. A proactive approach must be taken, a multi-party approach, a whole-society approach. There is no turning back now, said Fei-Fei Li, who led the AI ​​project at Google from 2017 to 2018, served on the board of Twitter from 2020 to 2022, and is currently an AI advisor to the White House.

Fei-Fei Li shared her views on the impact of AI on work.

Fei-Fei Li pointed out that there is a Digital Economy Lab in the Human-Centered AI Institute at Stanford University, led by Professor Erik Brynjolfsson. This is a very complex issue with many aspects. She particularly emphasized that "work" and "task" are two different concepts, because in reality everyone's work consists of multiple tasks.

She used American nurses as an example. It is estimated that during an 8-hour shift, nurses perform hundreds of tasks. Therefore, when people discuss AI taking over or replacing human work, they must distinguish whether it is replacing tasks or jobs.

Fei-Fei Li believes that AI changes multiple tasks within a job, and therefore will gradually change the nature of the job. In the call center scenario, the work quality of novices was improved by 30% by AI, but the work quality of skilled personnel was not improved by AI. An article from Stanford University's Digital Economy Lab echoed Fei-Fei Li's point of view. The title of the article is: "AI will not replace managers' jobs: managers who use AI are replacing managers who don't use AI."

Li Feifei stressed that technology will bring about the improvement of productivity, but the improvement of productivity will not automatically translate into the common prosperity of society. She pointed out that such events have happened many times in history.

(This article was first published on Titanium Media App, author: Chelsea_Sun, editor: Lin Zhijia)