news

Meta releases Sapiens visual model to enable AI to analyze and understand human actions in images and videos

2024-08-24

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

IT Home reported on August 24 that Meta Reality Lab has recently launched an AI vision model called Sapiens, which is suitable for four basic human-centric vision tasks: two-dimensional pose estimation, body part segmentation, depth estimation, and surface normal prediction.

The number of parameters of these models varies from 300 million to 2 billion. They adopt the visual transformer architecture, where tasks share the same encoder but each task has a different decoder head.

2D Pose Estimation:This task involves detecting and localizing key points of a human body in a 2D image. These key points usually correspond to joints such as elbows, knees, and shoulders, and help understand a person’s posture and movements.

Body Part Segmentation:This task segments an image into different body parts, such as head, torso, arms, and legs. Each pixel in the image is classified as belonging to a specific body part, which is useful for applications such as virtual try-on and medical imaging.

Depth Estimation:The task is to estimate the distance of each pixel in the image from the camera, effectively generating a 3D image from a 2D image. This is crucial for applications such as augmented reality and autonomous driving, where understanding the spatial layout is important.

Surface Normal Prediction:The task is to predict the orientation of surfaces in an image. Each pixel is assigned a normal vector that indicates which direction the surface is facing. This information is very valuable for 3D reconstruction and understanding the geometry of objects in the scene.

Meta said the model can natively support 1K high-resolution inference and is very easy to adjust for individual tasks, simply by pre-training the model on more than 300 million wild human images.

Even when labeled data is scarce or entirely synthetic, the generated models can show excellent generalization capabilities to in-the-wild data.