news

The PyTorch team released its first technical roadmap, with nearly 100 pages of documents disclosing the development direction for the second half of 2024

2024-07-15

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Qiao Yang

【New Wisdom Introduction】Recently, the PyTorch team announced its development roadmap for the first time, which was directly modified from internal technical documents and disclosed the next development direction of this classic open source library.

If you use Python to develop in the field of AI, PyTorch must be one of your old friends. In 2017, Meta AI released this open source library in the field of machine learning and deep learning, and it has now reached its 7th year.

According to Assembly AI's 2021 statistics, the top 30 most popular models on HuggingFace can all run on PyTorch, and 92% of the models are exclusive to PyTorch, a proportion that is far beyond the reach of many competitors including TensorFlow.


Just on July 10, PyTorch's engineering team publicly released their roadmap document for the first time, outlining the development direction for the second half of 2024.

Soumith Chintala, co-founder of Meta and leader of the PyTorch team, announced the news on Twitter.

He said he hopes to make engineers' research motivations and goals public.

“While all PyTorch development is public on GitHub, the actual planning and roadmap documents written by the teams at the various PyTorch affiliates are not public, so we decided to make this change to provide more transparency.”


Gott Brath, technical program manager of the PyTorch team, also made a similar statement in the forum.


We've been thinking about how to share a roadmap for the work the team is doing on PyTorch. We plan semi-annually, so these are some public versions of our 2024 H2 OSS plans for several key areas in PyTorch.

These files are basically the internal documents and work plans of the PyTorch team. After deleting some content, they are released as a roadmap, which involves the following aspects of PyTorch:

- Core libraries and core performance

- Distributed

- torchune、Torchrec、TorchVision

- PyTorch Edge

- Data Loading

- Compiler core and deployment

- Developer Infrastructure

Each document contains at least three parts, which are developed based on the OKR concept:

- background

- Top 5 focus areas and goals: objectives, key results, known or unknown risks and corresponding mitigation measures (maximum one page)

- Top 3~5 aspects to improve engineering level: BE Pillar classification, goals, indicators/status/specific goals, known or unknown risks and mitigation measures, impact/cost, priority/confidence level (maximum one page)

The BE Pillar can be seen as the "five maxims" Meta wrote to the development team. The specific content is:

Better Code, Better Doc, Empowering teams, Modern Code, Better Architecture

I wonder if the "one page maximum" rule has bothered developers who are concerned about the length of their documents. After all, quality is more important than length. Simplifying numerous development requirements into one page not only saves colleagues' time, but also tests the writer's skills.

In addition, the document also shows some excellent ideas of the Meta development team, such as emphasis on collaboration among module teams, API integration and joint development with external partners, and interaction with open source communities and developers.

When launching a new code base like ExecuTorch, or wanting to increase the influence of the PyTorch compiler, the team generally starts with two approaches: one is to work hard to improve performance and directly reach the SOTA goal; the other is to start with deep integration and provide more out-of-the-box cases.

Perhaps these are the keys to Meta's success in the open source field over the years.

The following are some excerpts and summaries of the contents of each document.


Original address: https://dev-discuss.pytorch.org/t/meta-pytorch-team-2024-h2-roadmaps/2226

Core libraries and core performance

The core libraries involved in the document include TendorDict, torchao, NN, TorchRL, etc.

In terms of performance, the PyTorch team has set the goal of achieving SOTA performance in model training and reasoning. Measures include introducing architecture optimization technology and high-performance kernels to form a combination with the entire PyTorch technology stack.

The past year has witnessed the rapid development of GenAI, and many external libraries have emerged to support development in the research field, but many of them do not directly depend on PyTorch, which will threaten PyTorch's dominance in the scientific research field.

In order to catch up again, PyTorch will provide support for common development techniques such as quantization, sparsification, MoE, and low-precision training, including building modules and APIs (mainly integrated in torchao) to help improve the performance of various Transformer architecture models.

The torchao library can support researchers to customize high-performance dtype, layout, and optimization techniques within the PyTorch framework, extending the scope of use to various scenarios such as training, inference, and tuning.

In addition, updates to the core library will include the following:

- The automatic optimization library torchao has achieved breakthrough success. The next step is to improve its code organization and separate the numerical operations from the core library.

- Fixed core modularity of TendorDict, supported serialization of loads/stores, and made it run 2x faster in eager mode

- Continuing the success of memory mapped load in the first half of the year, continue to improve the performance and security of model loading/storing

- Reduce TorchRL overhead by 50%

- Added core support for NoGIL

- Fixed the issue reported by users that the TORCH_env variable did not work

The document also mentions the deprecation of the nn.transformer module, and states that a series of tutorials and use cases will be released to show how to build Transformer using modules such as torch.compile, sdpa, NJT, FlexAttention, custom_op, torchao, etc.

distributed

LLM pre-training usually spans dozens or even thousands of GPUs, and as the parameter scale of the model gradually increases, inference and fine-tuning are also difficult to complete with a single GPU.

Therefore, PyTorch's next step in "distributed" layout comprehensively covers the three links of training, reasoning, and fine-tuning, proposing to achieve ultra-large-scale distributed training, high-memory efficiency fine-tuning, and multi-host distributed reasoning.

train

The parallel modes natively supported by PyTorch mainly include the following:

- Fully sharded data parallel (FSDP)

- Hybrid sharding data parallel (HSDP)

- Tensor parallel (TP)

- Pipeline parallel (PP)

- Sequence parallel (SP)

- context parallel (CP)

PyTorch hopes to further modularize various parallel methods in TorchTitan so that developers can freely combine them to achieve N-dimensional parallelism as needed.


The document specifically mentions that support needs to be added for two emerging architectures: MoE and multimodality, such as expert parallelism and optimization of routing algorithms.

In addition to the updates to TorchTitan itself, the distributed team also needs to work closely with the compiler team to better integrate with the torch.compile module to bring additional performance improvements to large-scale distributed scenarios.

Fine-tuning and inference

Fine-tuning: Cooperate with torchtune to put FSDP2 LoRA/QLoRA scheme into use, and support NF4 quantization of model state dictionary

Reasoning: PP and DP have become the core of distributed APIs. The next step is to focus on torchtitan's distributed reasoning, which supports large-model PP+asynchronous TP. Case studies will be given.

The document also mentions that HuggingFace's inference API will be migrated from PiPPy to PyTorch (completed by HuggingFace).

torchtune、TorchRec、TorchVision

torchtune

The launch of torchtune is intended to help users fine-tune LLM more conveniently. This is also the official solution for fine-tuning the Llama model.

The scope of "fine-tuning" defined by torchtune is very wide, which can be summarized into three main scenarios:

- Model adaptation to specific domain datasets or downstream tasks

- Reward and preference modeling, such as RLHF, DPO, etc.

- Training process including distillation and quantization

Updates in the second half of the year will support fine-tuning for agent workflows, with a focus on improving fine-tuning performance.

The team will collaborate with compile, core, and distributed modules to provide efficient fine-tuning and establish representative fine-tuning performance benchmarks within the PyTorch ecosystem.

Since torchtune is also a relatively new open source library, interaction with the open source community is also essential.

The document proposes ways to improve user understanding, such as publishing blog posts and tutorials, and holding technical seminars; it will also define quantitative indicators to measure torchturn's contribution to the LLM ecosystem.

In addition to the open source community, TorchTune will also integrate with at least one partner and participate in their communities to promote the use of TorchTune.

TorchVision

TorchVision is the absolute leader in the CV field and its technology is relatively mature, so there are few updates proposed in the roadmap.

The team will continue to work on preprocessing, support more formats (such as WebP, HEIC) and platforms (such as CUDA) in the image encoding/decoding space, and improve the encoding/decoding performance of the JPEG format on the GPU.

TorchRec

TorchRec aims to provide sparsity and parallelism primitives commonly used in large-scale recommendation systems, and will release its first stable version, TorchRec 1.0, in the fall.

Edge

Currently, the open source library ExecuTorch has launched an Alpha version, which mainly relies on torch.compile and torch.export to support model analysis, debugging, and reasoning on mobile devices and edge devices (such as AR/VR and wearable devices).

In the second half of the year, the Edge team will launch the Beta version of xecuTorch, and provide solutions within the PyTorch ecosystem for Meta's Llama series models and other open source models.

The key goals mainly cover two directions. The first is to provide basic functions and reliable infrastructure for on-device AI, including:

- Ensure API stability for C++ and Python

- Implement a series of core functions: support model compression, proxy cache location management, data and program separation

The second is to protect this nascent code base, cultivate influence within the open source community, and maintain good cooperative relations with companies such as Arm, Apple and Qualcomm.

The goal of community influence is even quantified, requiring the code to get 3k stars and 500 clones (forks) on GitHub. Interested people can continue to pay attention to see if the team can complete this OKR by the end of the year.

Data loading

The HuggingFace datasets library based on the Apache Arrow format has emerged in recent years with its high-speed loading/storage without memory limitations, and seems to have stolen the limelight from PyTorch-related functions.

The data loading document begins with an ambitious goal to make the TorchData library great again and re-establish PyTorch's dominance in data loading.

To achieve this goal, it is necessary to make the relevant functions flexible, scalable, high-performance, and memory-efficient, while achieving fool-proof operation and supporting multimodal training of various scales.

The specific update goals include the following aspects:

- DataLoader's function development and interface will follow the principle of GitHub first, DataPipes and DataLoader v2 will be gradually deprecated and deleted

- Ensure clear boundaries and good interoperability between TorchTune, TorchTitan, HuggingFace, and TorchData, and support multi-dataset and multi-modal data loading

- HuggingFace uses the StatefulDataLoader API to ensure compatibility and update samples and test cases in a timely manner

Compiler core and deployment

After years of development, the core functions of PyTorch's compiler have become more complete. What needs to be made up now is deeper integration and more optimization support in the fields of LLM and GenAI.

The roadmap proposes to bring the torch.compile() function to all aspects of the LLM and GenAI usage lifecycle (inference, fine-tuning, pre-training), so that important models can be equipped with native PyTorch compilation when released.

To achieve this goal, the document proposes many specific measures, such as working with torchtune and TorchTitan teams to improve compilation performance, and releasing native PyTorch compiled versions of at least two high-profile models in the second half of the year.

In addition, the compiler may add visualization capabilities to generate model graphs that express the forward computation/backward propagation process in non-eager training mode.

There are also many plans for user support, such as improving the monitoring and observability of the system to help users debug compilation problems by themselves. Key goals also include establishing a user support team to solve problems posted by developers on platforms such as GitHub in several key areas (data classes, context management, etc.).

References:

https://dev-discuss.pytorch.org/t/meta-pytorch-team-2024-h2-roadmaps/2226

https://x.com/soumithchintala/status/1811060935211049046

https://www.assemblyai.com/blog/pytorch-vs-tensorflow-in-2023/