PaddlePaddle Framework 3.0! An article on five new features including "Large Model Training and Push Integration"

PaddlePaddle 3.0: Five new features including "Large model training and push integration"

2024-08-01

As basic software, the deep learning framework not only promotes the rapid progress of deep learning technology, but also lays a solid foundation for the widespread application of artificial intelligence technology.

The deep learning framework provides developers with convenient and easy-to-use development interfaces. These interfaces highly abstract data and operations, allowing developers to focus more on the design of algorithms and models without having to get bogged down in the details of processing underlying data. Through these interfaces, developers do not need to directly perceive and deal with complex underlying hardware development details, which greatly improves development efficiency and experience. Secondly, the deep learning framework also provides a powerful function called automatic differentiation. Developers usually only need to write the code for the forward propagation network, while the tedious back-propagation network is automatically completed by the framework.

As China's first independently developed, feature-rich, open-source deep learning platform, PaddlePaddle has evolved from version 1.0 that uses static graphs by default to version 2.0 that uses dynamic graphs by default and can achieve both static and dynamic as well as training and push integration. The PaddlePaddle framework can perfectly integrate the flexibility of dynamic graphs with the efficiency of static graphs, and supports hybrid parallel training of models. Recently, version 3.0, which is designed for the era of large models, has been officially released! PaddlePaddle has officially opened the road to innovation of a new generation of framework technology!

Design Thoughts

The design of the deep learning framework is crucial to promoting the development of artificial intelligence technology. Its core design goal is to make the innovation and application of deep learning technology easier.

How to do this?

The framework needs to fully consider the needs of developers and hardware manufacturers.

From the user's perspective, an excellent deep learning framework should provide developers with the ultimate development experience. This not only means providing a user-friendly development environment, but more importantly, it should be able to significantly reduce the learning and time costs of developers, while significantly improving the convenience of development. To this end, the PaddlePaddle framework proposes the concept of "dynamic and static unification, integrated training and push, and automatic parallelization", which greatly improves development efficiency.

From the perspective of hardware adaptation, modern deep learning applications often need to run on a variety of hardware platforms. Therefore, the framework must be compatible with and adaptable to a variety of different hardware devices. This requires the framework to be able to intelligently isolate the differences between different hardware interfaces and achieve a wide range of hardware adaptability. At the same time, in order to give full play to the performance of the hardware, the framework also needs to have the ability to work together between software and hardware to ensure that the best performance can be achieved when using hardware resources.

At the same time, a good framework also needs to take into account the overall development trend of AI technology and the actual application needs of the industry.

In terms of technology development, cutting-edge technologies such as Large Language Model (LLM), MOE (Mixture of Experts), multimodality, and AI for Science have gradually become new research hotspots. As the complexity of the model increases, problems such as computing bottlenecks, storage bottlenecks, memory access bottlenecks, and communication bottlenecks have gradually become prominent, and the demand for distributed training and general performance optimization has become increasingly urgent.

At the industrial level, the framework needs to be able to support the full process of training, compression, and reasoning. This means that from model training to optimization, and then to actual deployment and reasoning, the framework should provide a complete and efficient solution to meet the actual needs of the industry for deep learning technology.

Only a framework that keeps up with trends and can withstand polishing can provide continuous and stable support to developers from all walks of life in industry, academia and research.

Design concept and main features of PaddlePaddle 3.0

In summary, PaddlePaddle will provide developers with a deep learning framework that is "unified dynamic and static, integrated training and push, automatic parallelization, automatic optimization, and wide hardware adaptation". Developers can write distributed code like single-machine code, and can develop large models without having to perceive complex communication and scheduling logic; they can write neural networks in Python like writing mathematical formulas, and can achieve efficient operation without using hardware development languages to write complex operator kernel code.

The 3.0 version of the PaddlePaddle framework came into being, continuing the design concept of unified static and dynamic, integrated training and push of the 2.x version, and its development interface is fully compatible with the 2.x version. This means that the code developed using the 2.x version can be run directly on the 3.0 version without modification in most cases. Four new features are introduced: automatic parallelization of unified static and dynamic, automatic optimization of the compiler, integrated training and push of large models, and multi-hardware adaptation of large models. These features have been developed in the 2.6 version or earlier of the PaddlePaddle framework, and have now reached the stage where they can be tried externally. These new features have brought significant improvements in user experience, performance, convenience of secondary development, and hardware adaptability. PaddlePaddle officially released version 3.0. This version includes improvements to some existing functions of the 2.x version of the framework, and is mature and stable without using new features.

Framework Architecture Overview

In order to achieve the above characteristics of the deep learning framework, the architecture of the framework must be carefully designed to ensure that it can support the construction of various complex models and seamlessly connect with a variety of chips. Next, through an intuitive architecture diagram, we will show in detail the functional modules covered in the new generation of PaddlePaddle framework, as well as the interactions and connections between these modules. The following is the architecture diagram of the PaddlePaddle framework 3.0.

PaddlePaddle Framework 3.0 Architecture Diagram

Rich interfaces: The PaddlePaddle framework provides a variety of deep learning-related development interfaces, such as tensor representation, mathematical calculations, model networking, optimization strategies, etc. Through these interfaces, developers can easily build and train their own deep learning models without having to delve into the underlying technical details.

Under the development interface, the PaddlePaddle framework can be divided into four levels: presentation layer, scheduling layer, operator layer, and adaptation layer.

Representation layer: focuses on the expression and conversion of computational graphs. Through the highly scalable intermediate representation PIR, it provides solid support for core functions such as dynamic-to-static conversion (dynamic graphs are converted to static graphs), automatic differentiation, automatic parallelization, operator combination, and computational graph optimization.

Scheduling layer: responsible for intelligent arrangement and efficient scheduling of code or computational graphs, and can optimize the management of video memory and memory according to actual needs, supporting efficient execution of dynamic and static graphs. Regardless of whether developers choose to use dynamic or static graphs for model development, the PaddlePaddle framework can provide an efficient execution environment while ensuring optimal resource utilization.

Operator layer: It is composed of the neural network compiler CINN and the operator library PHI, covering key functions such as tensor definition, operator definition, automatic operator fusion and operator kernel implementation.

Adaptation layer: It is used to achieve adaptation with the underlying chip, including device management, operator adaptation, communication adaptation, and compilation access.

The following will focus on the new major upgrade of the PaddlePaddle 3.0 architecture, which mainly includes the following modules:

1) Highly scalable intermediate representation (PIR), by creating a unified intermediate representation for the entire architecture, breaking through the barriers of each module in the framework layer, and enhancing the potential of PaddlePaddle in scientific computing, compilation optimization, and large models;

2) Automatic optimization of the neural network compiler, which greatly improves the end-to-end performance of the model through automatic fusion and strategy tuning;

3) Automatic parallelization reduces the cost of model development and performance optimization for large model scenes, and significantly improves the user experience of large model scenes.

High Extended Intermediate Representation (PIR)

The intermediate representation (IR) of the computational graph is an important cornerstone for deep learning framework performance optimization, reasoning deployment, compilers, and other directions. In recent years, more and more frameworks and researchers have introduced compiler technology into the optimization of neural network models for deep learning, and on this basis, they have used the concepts, technologies, and tools of compilers to automatically optimize neural networks and generate code. In the era of large models, there are higher requirements for IR in terms of flexibility, scalability, and completeness.

Therefore, in version 3.0, PaddlePaddle standardized the definition of intermediate representation IR at the infrastructure level, achieved unified representation of the entire architecture, and realized sharing of development results in all directions upstream and downstream. PaddlePaddle's new generation of IR architecture focuses on two important dimensions: high flexibility and high scalability. It supports complex semantics through more complete and robust semantic expression capabilities, unified representation of the entire architecture for training and promotion, and efficient and pluggable performance optimization strategy (Pass) development mechanism, more conveniently supports rich segmentation strategies under automatic parallelization of large models, and seamlessly connects to the neural network compiler to achieve automatic performance optimization and multi-hardware adaptation.

PaddlePaddle Intermediate Representation (PIR) abstracts a set of highly scalable basic components at the bottom layer, covering Type, Attribute, Op, Trait, and Interface, and introduces the concept of Dialect, giving developers the ability to flexibly expand and freely customize, thereby providing comprehensive and robust semantic expression capabilities. At the model representation layer, through the modular management of multiple Dialects and unified multi-terminal representation, a unified representation of the entire architecture that integrates training and reasoning is achieved, and the seamless connection between operators and compilers is achieved, supporting automatic optimization and multi-hardware adaptation. At the graph transformation layer, by unifying the underlying modules and simplifying the basic concepts, users are provided with a low-cost, easy-to-use, and high-performance development experience, as well as a rich and pluggable Pass optimization mechanism. PaddlePaddle PIR adheres to the static single assignment (SSA) principle, ensuring that the model is equivalent to a directed acyclic graph, and uses Value and Operation to abstract the computational graph, where Operation represents nodes and Value represents edges.

Operation represents a node in the computational graph: each Operation represents an operator and contains zero or more Regions. Region represents a closure, which can contain zero or more Blocks. Block represents a basic block that complies with the static single assignment (SSA) principle, which contains zero or more Operations. These three can be nested in a loop to construct an arbitrarily complex grammatical structure.

Value represents a directed edge in the computational graph: it is used to connect two Operations, thus describing the Use-Define chain (UD chain) in the program. Among them, OpResult is used as the definition end to define a Value; and OpOperand is used as the use end to describe the use of a Value.

PaddlePaddle provides two Pass development mechanisms: PatternRewriter and Declarative Rewrite Rule (DRR for short), which take into account the flexibility of customization and the ease of development. The three-stage Pass development method allows developers to focus more on the processing of Pass logic without having to pay attention to the details of the underlying IR. Using the PIR Pass development mechanism, the Pass development cost was reduced by 58%; applied to reasoning scenarios, more than 84% of model reasoning was accelerated by more than 10%.

Neural Network Compiler Automatic Optimization

There are three reasons why we need to develop compiler technology:

1) Hardware development trend: Combining the history of hardware development and the characteristics of technological evolution, the computing power development speed is much faster than memory access performance, CPU performance and bus bandwidth; among them, memory access performance affects the performance of memory-intensive operators (norm class, activation, etc.), and CPU performance and bus bandwidth affect scheduling performance. The general optimization technology based on the automatic fusion of the compiler can fuse multiple operators into a large operator, which can greatly improve the model performance by reducing the amount of memory access and the number of operators. Compiler technology will become a standard component of the deep learning framework.

2) Model development trend: Model structures are diverse, and the demand for diversity is highly dependent on the general optimization of the compiler.

3) Multi-hardware optimization: There are many types of hardware on the market. Different hardware platforms have different characteristics and optimization requirements. Each hardware requires a lot of manpower to optimize. With the help of compiler technology, the cost of such optimization technology can be greatly reduced.

Let's use an example to illustrate this point. Let's take RMS Normalization (Root Mean Square Layer Normalization) which is often used in the Llama model as an example. Its calculation formula is relatively simple and clear.

Assuming that we need to implement the RMS Normalization calculation, the easiest way is to use the tensor operation development interface provided by the PaddlePaddle framework to call square, sum, division, square root and other operations to complete it. The code is as follows:

The above code is easy to develop, but has poor performance and occupies a large proportion of video memory. Developers can implement FusedRMSNorm, but it requires higher requirements from developers and is more costly.

With the help of neural network compiler technology, we can achieve significant performance improvements while maintaining high flexibility and ease of use. The performance test results of the RMSNorm operator on the A100 platform are a clear proof: compared with the implementation method using a combination of Python development interfaces, the compiled and optimized operator runs 4 times faster; even compared with the manual operator fusion method, a 14% performance improvement was achieved. This achievement fully demonstrates the ideal balance that the PaddlePaddle framework has found between flexibility and performance.

To this end, PaddlePaddle has taken neural network compiler technology as an important research and development direction. Below is the overall architecture diagram of the PaddlePaddle compiler.

At the representation layer, with the help of PIR's extensibility, the CINN front-end module is implemented to process layer-related transformations, including operator splitting, recalculation, subgraph partitioning, dimension derivation modules, and other modules, and finally multiple subgraphs that can be generated and optimized by the compiler backend are obtained. At the compiler backend, for these fused subgraphs, the compiler will further call the Compute function to convert them into a low-level intermediate representation (IR) composed of an abstract syntax tree (AST), and perform loop fusion on this basis to ensure that they can be fused into a kernel; on the CINN underlying IR, performance tuning analysis will be performed to obtain the optimal configuration; finally, the underlying IR will be further carefully converted into specific code implementations.

Experimental results on the generative large language model Llama and the literary graph model Stable Diffusion show that by using the compiler's optimization technology, the inference speed is improved by 36% and 30% respectively compared to the basic version without manual performance optimization.

Dynamic and static integration and automatic parallelization

Why do we need automatic parallelization?

The current mainstream training method for large models uses a variety of parallel strategies. These parallel strategies are based on the "manual" parallel method implemented in the dynamic graph mode, that is, on the basis of a single card, manually handle splitting (splitting Tensor, calculation graph), communication (adding communication operators), video memory optimization (video memory sharing, Re-Compute), scheduling optimization (pipeline orchestration, computing and communication asynchronous) and other strategies. Developers must be familiar with the model structure and have a deep understanding of parallel strategies and framework scheduling logic, which makes the development and performance optimization threshold of large models very high. In addition to having a dedicated algorithm team responsible for model algorithm innovation, there must also be a team dedicated to model parallel optimization, which brings many obstacles to the innovation and iteration of large models.

Let's take a simple example to illustrate the difference between large model development and single card logic. Since the parallel strategy will cause the shape of the Tensor to change during runtime, all operators related to shape processing need to consider whether they will be affected by the parallel strategy. For example, in the following reshape processing, the split strategy causes the input shape to change, so the output shape needs to be reasonably adjusted according to the split strategy:

To this end, we proposed a unified automatic parallel solution for both dynamic and static operations. Developers only need to make a few tensor segmentation annotations, and the framework can automatically deduce the distributed segmentation status of all tensors and operators, and add appropriate communication operators to ensure the correctness of the results; finally, it will automatically find the most efficient distributed parallel strategy based on the model structure and cluster information, combined with the optimization of the video memory and scheduling layer.

In automatic parallel design, developers only need a small amount of tensor splitting annotations. We abstract the splitting methods and need two types of splitting methods: splitting tensors (parameters, inputs) and splitting computational graphs (pipelines). To implement these two types of splitting methods, the framework needs a mechanism to describe the mapping relationship between distributed tensors and computing devices. To this end, we introduce two distributed concepts, ProcessMesh and Placements. ProcessMesh maps a GPU card to a process and maps multiple devices to a one-dimensional or multi-dimensional array composed of multiple processes. The following figure shows two different ProcessMesh abstract representations consisting of 8 devices.

Placements is a list of three distributed tags: Replicate, Shard, and Partial. Its length is consistent with the dimension of ProcessMesh. It is used to indicate the distributed tags used to split the distributed tensor in the dimension of the corresponding computing device. The detailed descriptions of these three distributed tags are as follows:

As shown in the following figure, Replicate means that the tensor exists in the form of copies on different devices; Shard means that the tensor is split according to a specific dimension on different devices; Partial means that the tensor on the device is incomplete and needs to be completed after performing different operations such as Reduce Sum or Reduce Mean.

After completing the distributed tagging abstraction, we call
paddle.distributed.shard_tensor() interface, to achieve the marking of tensor segmentation. Through the marking and automatic derivation of tensor segmentation, we can express complex distributed hybrid parallelism. The following figure shows a specific example of hybrid parallelism consisting of data parallelism, tensor model parallelism, and pipeline parallelism.

The following code shows a specific example of mixed parallelism.

By adopting an automatic parallel development approach, developers no longer need to consider complex communication logic. Taking the Llama task as an example, the amount of distributed training core code has been reduced by 50%, which greatly reduces the difficulty of development; from some of our experiments, we can see that with the help of global analysis and other optimizations, the performance is also better than the performance of manual parallelization of dynamic graphs.

In the future, we will further explore fully automatic parallelization without the need for tensor splitting markers, allowing developers to write distributed code like writing stand-alone code, further improving the development experience of large models.

Industrial advantages

In general, the new generation of PaddlePaddle framework, PaddlePaddle Framework 3.0-Beta, is designed specifically for large models and heterogeneous multi-cores. It adapts to heterogeneous multi-cores downward to fully release the hardware potential; it supports the training and reasoning of large models in an integrated manner upward. It also has four major capabilities: dynamic and static unified automatic parallelism, compiler automatic optimization, large model training and push integration, and large model multi-hardware adaptation, which comprehensively improves the ability to serve the industry.

Unified automatic parallelism of dynamic and static: This feature greatly reduces the cost of industrial development and training. Users only need to perform a small amount of tensor segmentation marking on a single card, and the PaddlePaddle framework will automatically complete the derivation of distributed segmentation information and add communication operators to ensure the correctness of the logic. At the same time, based on the model structure and cluster information, combined with the optimization of the video memory and scheduling layer, PaddlePaddle can automatically find the most efficient distributed parallel strategy, thereby greatly reducing the development cost of hybrid parallel training, allowing developers to focus more on model and algorithm innovation.

Compiler automatic optimization: This feature significantly reduces the cost of performance optimization. PaddlePaddle's compiler is designed to be integrated with the framework, which can support efficient training and variable shape reasoning of multiple models such as generative models and scientific computing models, providing a good balance between computing flexibility and high performance. Through automatic operator fusion and code generation technology, the reasoning performance of generative models such as Llama2 and Stable Diffusion has been improved by more than 30%.

Integrated training and inference of large models: This feature provides the industry with an ultimate development experience. It enables the reuse of training and inference capabilities, providing a unified development experience and ultimate training efficiency for the entire process of large models. Through dynamic and static work, training and inference work can be seamlessly connected. The generation calculations in the RLHF (human feedback reinforcement learning) training process can reuse inference optimization to achieve a 2.1-fold acceleration. At the same time, the distributed automatic parallel strategy of inference quantization scenarios reuses training, which improves efficiency by 3.8 times.

Large model multi-hardware adaptation: One of the important features of PaddlePaddle is to adapt to heterogeneous multi-core and fully unleash the potential of hardware. In terms of access mechanism, PaddlePaddle provides a simple and efficient abstract interface and basic operator system, reducing the adaptation cost. In terms of operation mechanism, it optimizes mechanisms such as scheduling orchestration and storage sharing, and improves scheduling efficiency. From the perspective of operator kernels, PaddlePaddle provides a compiler automatic fusion tuning solution to improve end-to-end performance. At the same time, PaddlePaddle has also built R&D infrastructure such as code integration, continuous integration, and model regression testing for new hardware manufacturers. These mechanisms ensure that new hardware is included in the normal release system of PaddlePaddle, and users can directly install and try it out without compiling. PaddlePaddle's complete and low-cost access mechanism has attracted hardware manufacturers to contribute 3,456 PRs to PaddlePaddle, including more than 25,000 commits.

This is PaddlePaddle's new generation framework 3.0. Currently, the 3.0-Beta version is open to developers, and all development interfaces are fully compatible with 2.0. Developers are welcome to use it and provide feedback.

news