news

Want to understand Fei-Fei Li's entrepreneurial direction? Here is a list of papers on robotics 3D

2024-08-16

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Machine Heart Report

Editor: Zhang Qian

More than 80 papers to understand the research progress of "Robotics + 3D".

Some time ago, many media reported that World Labs, a startup company founded by Fei-Fei Li, a famous AI scholar and professor at Stanford University, had completed two rounds of financing in just three months. The latest round of financing raised about US$100 million. The company's valuation has exceeded US$1 billion, making it a new unicorn.

The development direction of World Labs focuses on "spatial intelligence", that is, developing models that can understand the three-dimensional physical world and simulate the physical properties, spatial position and functions of objects. Fei-Fei Li believes that "spatial intelligence" is a key link in the development of AI. Her team is training computers and robots in the Stanford University laboratory to take actions in the three-dimensional world, such as using a large language model to allow a robotic arm to perform tasks such as opening doors and making sandwiches according to verbal commands. (For details, please see "Fei-Fei Li interprets the entrepreneurial direction of "spatial intelligence" to make AI truly understand the world")



To explain the concept of spatial intelligence, Li showed a picture of a cat with its claws outstretched pushing a glass toward the edge of a table. In a split second, she said, the human brain can assess "the geometry of this glass, its position in three-dimensional space, its relationship to the table, the cat, and all the other things," then predict what will happen and take action to stop it.

In fact, in addition to Fei-Fei Li, many research teams are now focusing on the direction of 3D vision + robotics. These teams believe that many of the limitations of current AI are due to the lack of a deep understanding of the 3D world in the model. If this puzzle is to be completed, more research efforts must be invested in the direction of 3D vision. In addition, 3D vision provides the ability to perceive the depth of the environment and understand space, which is crucial for robots to navigate, operate, and make decisions in a complex three-dimensional world.

So, is there any systematic research material that can be used as a reference for researchers in this field? Synced recently found one:



Project link: https://github.com/zubair-irshad/Awesome-Robotics-3D

This GitHub repository named "Awesome-Robotics-3D" has collected more than 80 papers in the field of "3D Vision + Robotics". Most of the papers provide corresponding paper, project, and code links.



The papers can be grouped into the following topics:

  • Strategy Learning
  • Pre-training
  • VLM and LLM
  • express
  • Simulations, Datasets, and Benchmarks

These papers include arXiv preprints, as well as papers from top robotics conferences such as RSS, ICRA, IROS, CORL, and top conferences in computer vision and machine learning such as CVPR, ICLR, and ICML. They are very valuable.

The list of papers in each section is as follows:

1. Strategy Learning





2. Pre-training



3. VLM and LLM





4. Representation





5. Simulations, Datasets, and Benchmarks





In addition, the author also gave two review papers for reference:

  • 论文 1:When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models
  • Paper link: https://arxiv.org/pdf/2405.10255

Paper Introduction: This paper provides a comprehensive overview of the methodology that enables LLM to process, understand, and generate 3D data, and highlights the unique advantages of LLM, such as in-context learning, step-by-step reasoning, open vocabulary capabilities, and extensive world knowledge, which are expected to significantly advance spatial understanding and interaction in embodied AI systems. The research covers various 3D data representation methods from point clouds to neural radiance fields (NeRFs), and examines their integration with LLM for tasks such as 3D scene understanding, description generation, question answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. In addition, the paper briefly reviews other methods that integrate 3D and language. Through a meta-analysis of these studies, the paper reveals the significant progress made and emphasizes the need to develop new methods to fully exploit the potential of 3D-LLM.

To support this investigation, the authors set up a project page to collate and list papers related to the topic: https://github.com/ActiveVisionLab/Awesome-LLM-3D



  • Paper 2: A Comprehensive Study of 3-D Vision-Based Robot Manipulation
  • Paper link: https://ieeexplore.ieee.org/document/9541299

Paper Introduction: This paper comprehensively analyzes the latest progress in 3D vision in the field of robotic manipulation, especially in imitating human intelligence and giving robots more flexible working capabilities. The paper discusses the 2D vision system that traditional robotic manipulation usually relies on and its limitations, and points out the challenges faced by 3D vision systems in the open world, such as general object recognition in cluttered backgrounds, occlusion estimation, and flexible human-like manipulation. The article covers key technologies such as 3D data acquisition and representation, robot vision calibration, 3D object detection/recognition, 6-DOF pose estimation, grasp estimation, and motion planning. In addition, some public datasets, evaluation criteria, comparative analysis, and current challenges are introduced. Finally, the article explores the relevant application areas of robotic manipulation and discusses future research directions and open issues.

Interested readers can click on the project link to start learning.