another chip, challenging gpu

another chip to challenge gpu

2024-10-04

summary

for a 3 billion parameter llm, a research prototype inference appliance with 16 ibm aiu northpole processors delivered a massive 28,356 token/second system throughput and less than 1 ms/token (per user) latency compared to 16 each northpole card consumes only 672 w in a compact 2u form factor. focusing on low latency and high energy efficiency, northpole (12 nm) is compared with a set of gpus (7/5/4 nm) at various power consumptions.at the lowest gpu latency, northpole provides 72.7 better energy efficiency metrics (token/s/w) while providing better latency.

introduce

large language models (llms) have achieved significant performance benchmarks in different ai tasks, such as assisting programming by providing code suggestions, performing well on standardized tests, and aiding content creation of articles, blogs, images, and videos.

in the large-scale deployment of llms, especially in the large-scale deployment of artificial intelligence, two main and conflicting challenges arise, namely: energy consumption and response latency.

first, since llm requires substantial energy resources for both training and inference, a sustainable future computing infrastructure is needed to achieve its efficient and widespread deployment. as data center carbon footprints expand and they become increasingly energy constrained, data center energy efficiency becomes increasingly important. according to a report from the world economic forum:

"currently, the environmental carbon footprint of data centers is mainly divided into two parts: training accounts for 20%, and inference accounts for 80%. as artificial intelligence models develop in different fields, the demand for inference and its environmental footprint will escalate."

second, many applications, such as interactive conversations and autonomous workflows, require very low latency. within a given computing architecture, reducing latency can be achieved by reducing throughput, but this results in reduced energy efficiency. to paraphrase a classic system maxim:

"the throughput problem can be solved with money, but the delay problem is more complicated because the speed of light is fixed." (paraphrased from [10], replacing "bandwidth" with "throughput".)

gpus can achieve lower latency by using smaller batch sizes, but at the expense of throughput and energy efficiency. in addition, gpu sharding reduces latency by using data parallelism across multiple gpus, but again at the expense of energy efficiency. sharding or not, gpus seem to hit a hard limit with lower latency. the gpu trade-off between energy efficiency and latency is shown in figure 1.

figure 1: northpole (12 nm) performance relative to current state-of-the-art gpus (7/5/4 nm) on energy and system latency metrics, where system latency is the total latency experienced by each user. at the lowest gpu latency (h100, point p2), northpole provides 72.7x better energy efficiency metrics (tokens/second/w). at the best gpu energy efficiency index (l4, point p1), northpole provides 46.9 times lower latency.

therefore, a key research question explored in this paper is how to simultaneously achieve the two conflicting goals of low latency and high energy efficiency.

northpole is an ecosystem of inference accelerator chips and software co-designed from first principles to deliver superior efficiency for neural network inference. although northpole was not specifically designed for llm, surprisingly, this paper demonstrates that the new northpole architecture can achieve low-latency, energy-efficient llm inference (figure 1, figure 2, and table 1).

table i: performance measurements

measured performance of northpole and gpu systems on a per-card basis. for each metric, # means lower is better, while " means higher is better. for northpole 16-card devices, power consumption is measured per card and total system throughput is divided by 16 cards. northpole latency across all 16 card for measurement. p1, p2, p3, and p4 refer to the points marked in figure 1 and figure 2, respectively, indicating the highest gpu energy efficiency index, the lowest overall gpu latency, the highest gpu space index, and the lowest energy efficiency gpu latency.

the main research results of this article are as follows:

for a large language model (llm) with a parameter size of 3 billion, whose model structure is derived from the ibm granite-8b-code-base model and is consistent with llama 3 8b and mistral 7b [14], this paper demonstrates a configuration research prototype inference device with 16 northpole processors.

in terms of absolute performance, the device delivers 28,356 tokens/sec of system throughput and single-user latency of less than 1 millisecond, while consuming 672 watts of power across 16 northpole cards in a 2u model.

in terms of relative performance, comparing the 12nm northpole with a range of gpus (7/5/5/4nm a100/l4/l40s/h100 respectively) at different power consumptions, it can be seen from figure 2(a) and as can be seen in figure 2(c): at the lowest gpu latency (point p2), northpole provides 72.7 times better energy efficiency metrics (tokens / second / w) and 15.9 times better space metrics (tokens / second / transistor), while the latency is still less than 2.5 times; at the best gpu energy efficiency indicator (point p1), northpole provides 46.9 times lower latency and 2.1 times better space indicators, while still providing 2.2 times better energy efficiency metrics; at the best gpu space metric (point p3), northpole provides 20.3x lower latency and 5.3x better energy efficiency metrics, while still providing 1.4x better space metrics.

in particular, when comparing the 12nm northpole with the 5nm l4 gpu for comparable power consumption, it can be seen from figure 2(e) that at the highest l4 throughput (less than 50ms per token, point p1) hour,northpole provides 46.9 times lower latency while improving throughput by 1.3 times; and at the lowest l4 latency (point p4), northpole provides 36.0 times higher throughput (tokens/second/card) while improving latency still below 5.1x.

figure 2: (a)–(d) panels show the performance of 12nm northpole relative to current state-of-the-art gpus (7/5/4nm) on energy efficiency, space, and system latency metrics, where system latency is per the total latency experienced by the user.

panel (a) is the same as figure 1, with the added labeling of point p3. panels (a) and (c) use a single gpu, while panels (b) and (d) use sharding technology, which may reduce latency, but only at the expense of energy and space efficiency. at the lowest gpu latency (h100, point p2), northpole provides 72.7x better energy efficiency metrics (tokens/second/w) and 15.9x better space metrics (tokens/second/transistor) while still having low latency more than 2.5 times; at the best gpu energy efficiency index (l4, point p1), northpole provides 46.9 times lower latency and 2.1 times better space index, while still providing 2.2 times better energy efficiency index; at the best when it comes to gpu spatial metrics (a100, point p3), northpole provides 20.3x lower latency and 5.3x better energy efficiency metrics, while still providing 1.4x better spatial metrics.

panel (e) shows the performance of the 12nm northpole relative to the 5nm l4 gpu on throughput (tokens/second/card) and system latency metrics. at the lowest l4 latency (point p4), northpole provides 36.0 times higher throughput; at the highest l4 throughput (less than 50 milliseconds per token, point p1), northpole provides 46.9 times lower latency . the gpu power consumption used to calculate each energy efficiency metric is shown in table i. since there is no instrumentation available to measure the actual power consumption for different batch sizes, the same power is used for all batch sizes, which may underestimate the energy efficiency metric, but the qualitative results still hold.

northpole architecture

as shown in figure 3, the northpole processor is manufactured using 12-nanometer process technology, has 22 billion transistors, and has an area of 795 square millimeters. its architecture is inspired by the brain, optimized for silicon, and derived from ten complementary design axioms covering computing, storage, communication and control, enabling northpole to significantly outperform other architectures in standard ai inference tasks.it performs well even when compared to processors manufactured with more advanced process technologies.

for detailed axioms of the northpole architecture, see [11], [12]. simply put, northpole arranges 256 modular cores in a 16×16 two-dimensional array. each core contains a vector-matrix multiplier (vmm) that performs 2048, 4096, and 8192 operations per cycle at int8, int4, and int2 precision, respectively. the core computation also includes a 4-way, 32-slice fp16 vector unit and a 32-slice activation function unit. the core array has a total of 192 mb of sram, with each core equipped with 0.75 mb of sram. on-chip memory is tightly coupled to the computing unit and control logic, with a total bandwidth of 13 tb/s between core memory and computing. in addition, each core has 4096 wires crossing horizontally and vertically for passing parameters, instructions, activation values and partial sums through four dedicated networks on a chip (nocs).to prevent stalls, an on-chip frame buffer is equipped with 32 mb of sram, decoupling off-chip communication of input and output data from the core array's on-chip computation.

figure 3: northpole processor: silicon (left), die (middle), packaged module (right).

equipment

northpole has prototyped the design in a pcie gen3 × 8 card, shown in figure 4, with 16 cards installed in an off-the-shelf 2u server to form a research prototype inference device, shown in figure 5. the server contains two intel xeon gold 6438m processors, each with 32 cores and 60 mb cache, clocked at 2.2 ghz. the system also comes with 512 gb of 4800 mhz ddr5 memory. two pcie gen5 × 16 buses are connected to each server processor, providing a total of 256 gb/s of pcie bandwidth (bidirectional). these four buses are extended to the system's 16 pcie slots via pcie bridges, with a northpole card installed in each slot. these 16 northpole cards use up to half of the available 256 gb/s pcie bandwidth.

figure 4: northpole pcie card.

figure 5: exploded view of the research prototype device showing the installation of 16 northpole pcie cards. northpole cards can communicate with the host through the standard pcie endpoint model, or directly and more efficiently with each other through additional hardware capabilities on each card.

the system runs red hat enterprise 8.9, and northpole uses a built-in vfio kernel driver so that user-space software can manage the hardware. the system uses iommu for address translation management and enables security features such as device isolation and virtualization to run applications using virtual machine or container technology.

each northpole card receives and transmits data via a dma engine that resides on each card. these dma engines work independently and can simultaneously receive and transmit tensors in multiple ways. the first method is the standard pcie endpoint model, where the host program reads the input from the host memory through the dma engine and writes the tensors back to the host memory after the calculation is completed. the second approach leverages additional hardware capabilities on each card to allow northpole cards to communicate directly with each other over pcie without the need for transfers between host memory or additional software management at runtime. direct inter-northpole communication enables larger models to span multiple northpole chips while reducing communication latency and overhead caused by a purely software management system.

mapping llms to northpole devices

the strategy for mapping llms, illustrated in figure 6, is inspired by three key observations. first, for sufficiently large models, the entire transformer layer can fit entirely in the memory of a single northpole chip ("w4a4") using weights, activations and kv buffers in int4 format, while the output layer can fit on two on the chip. second, if the weight and kv caches reside entirely on-chip, the runtime only needs to transfer small embedded tensors between layers, which is within the bandwidth of pcie gen3 × 8. third, prototype northpole devices can be easily assembled by installing 16 northpole pcie cards in an off-the-shelf server.

this suggests a strategy of mapping each transformer layer to its respective northpole card, employing gpipe-style pipeline parallelism, and splitting the output layer across the two northpole cards, using tensor parallelism, via pcie gen3 × 8 sends the embedding tensor between layers.during inference, a small batch of user requests (e.g., n requests) is divided into m equal micro-batches and pipelined through 16 northpole cards.

while pipeline parallelism has been exploited in llms training (without latency constraints), its use in inference has been limited by the batch size required to reduce the idle time of each pipeline stage or pipeline bubbles. for example, some studies have found that efficient training requires the number of micro-batches m to be approximately four times the number of pipeline stages. the mini-batch size n is limited by (a) the per-token latency required by the system, and (b) the available memory for the kv cache to store the entire mini-batch. low-latency compute and 13 tb/s of on-chip memory bandwidth enable northpole to achieve extremely low per-token latency, so the limiting factor when choosing n is the memory used to store the entire kv cache on-chip. furthermore, we find that the number of micro-batches m equal to the number of pipeline stages is sufficient to make the pipeline idle time negligible.

in the experiments reported in this paper, we chose a mini-batch size of n = 28, divided into m = 14 equal micro-batches, resulting in a micro-batch size of 2 for each northpole card calculation. our architectural design choices for efficient computation at such small batch sizes are key to achieving the efficiencies shown in figure 1 and table i.

llm model and training method

llm model

the model used to test our system is based on the open source ibm granite-8b-code-base model, which is an 8 billion parameter transformer-decoder containing 36 transformer layers with a hidden layer size of 4096 and an ffn intermediate layer size is 14,336, the number of attention heads is 32, the number of key-value heads using grouped query attention (gqa) is 8, and the vocabulary size is 49,152. to fit into a single server with 16 northpole cards, we used a 3 billion parameter version of the model with 14 transformer layers and an output layer, quantized to w4a4 accuracy, but otherwise the structure remained unchanged.

notably, this model configuration matches llama 3 8b [13] and mistral 7b [14] on a per-layer basis, differing only in the number of layers, model vocabulary size, and training data used.

training with full accuracy

to restore the original model's task accuracy after quantization, the following procedure was adopted to create model weights. first, a baseline model is trained from scratch based on 1 trillion code tokens in 116 languages, using full fp16 accuracy, following the recipe of [4]. next, the output layer weights and inputs of the baseline model, and silu activations were int8 quantized, while all other weights, linear layer inputs, and matrix multiplication inputs were int4 quantized. finally, the post-recovery quantification accuracy was quantified by performing quantization-aware training on a further 8.5 billion tokens from the python language subset of the training data, with a learning rate of 8×10⁻⁵ and a batch size of 128, using the lsq algorithm. the step size that activates the quantizer is trained using a warm start, which increases the learning rate by a factor of 200 in the first 250 steps of training to help quickly adapt to the data.

the baseline fp16 model running on gpu and the quantized model running on northpole achieved pass@10 accuracy on humanevalsynthesize-python within 0.01 (0.3001 gpu vs. 0.2922 northpole. comparable to the granite-8b-code-base model than, overall training is reduced to focusing on hardware performance characterization rather than pushing the boundaries of task accuracy.

runtime application

during inference, as shown in figure 6, tokens are generated by a highly pipelined user application running on the host cpu, which preprocesses text into input tensors by using tokenizers and embedding layers, and puts the input tensors into the first northpole card in the device, receives the resulting output tensor from the last northpole card in the device, post-processes the output tensor using a decoder and detokenizer, and loops the resulting token as the next input. the user application is also responsible for the user interface as well as more advanced optimizations such as prompt pre-population.

to offload the neural network workload to northpole, the user application calls a user-space runtime library with a simple api, configures the northpole card's layer weights and kv cache at initialization time, and sends and receives input and output tensors at runtime.the weights and kv cache are configured to remain in on-chip memory and do not need to be streamed off-chip at runtime. the runtime library also manages the on-chip frame buffer to prevent the northpole core from stalling due to lack of input data or output data receivers. intermediate tensors are passed between cards without host intervention, as described in section 4.

performance results

the northpole 16-card device achieved a throughput of 28,356 tokens/second on a 3 billion parameter llm. the sequence length of this llm is configured as 2048 (1024 hint length, 1024 tokens generated), and the decoder uses greedy sampling.

for comparison with gpus, we measured the single-card performance of two gpus for low-power inference (l4 and l40s) and two gpus for high-throughput training (a100 and h100).all systems run the same llm model and configuration, with northpole running at w4a4 precision and the gpu running at optimal w4a16 precision since, to our knowledge, there are no w4a4 cuda cores available.in our gpu experiments, we leveraged the gptq quantization model and benchmarked it using the vllm (version 0.5.4) marlin core for comparison with northpole. using gptq quantization provides optimal model inference performance on the gpu by reducing weight precision while maintaining acceptable accuracy. additionally, marlin cores are used to optimize matrix operations, especially when dealing with sparse and dense matrix multiplications. benchmarking the vllm runtime allows us to evaluate throughput and latency, ensuring optimal model performance for a given hardware configuration. in experiments with multiple gpu cards, tensor parallelism equal to the number of cards available was employed to effectively obtain the smallest possible latency over nvlink. our experiments show that although sharding technology reduces latency, it leads to a decrease in gpu throughput per card. it is worth noting that northpole's superior performance mainly comes from its huge on-chip memory bandwidth, and secondarily from its lower accuracy.

table i shows the measured performance results for northpole and gpu systems on a per-card basis. basic metrics include throughput, latency, space, and energy metrics, defined below.

the total number of tokens generated for small batches of input prompts is:

among them, mmm is the number of micro batches, and tok_seq_len is the number of output tokens generated by a single user. system throughput is the total number of tokens generated in response to input prompts (tokens gen), divided by the total time required to process the prompt, including prompt prefill time (prompt time) and token generation time (token gen time):

throughput is compared on a per-card basis by dividing the system throughput by the number of processing cards in the system:

latency is a measure of the average time between output tokens generated by a specific user and is the sum of the time it takes for an embedded token to flow through the processing pipeline, plus the prompt prepopulation time amortized over the total number of tokens generated:

similarly, combining equations 1, 2, and 4:

where mini-batch size = mini-batch size note, this is the system latency seen by each user.

normalized by the number of cards in the system, we extend the space and energy metrics defined in [11] to be able to compare systems with different numbers of cards. the resulting space and energy metrics are the throughput per card, normalized by the number of processor transistors per card and the power per card respectively:

if system throughput scales proportionally to the number of pipeline cards in the system, card normalization will be offset, leaving space and energy metrics constant with the number of cards in the system. typically, system throughput scales sublinearly with the number of cards due to communication and synchronization overhead.

in conclusion

we make the following contributions:

we demonstrated a research prototype of the doka northpole device.

we show that large neural network models like llm can be efficiently split across multiple northpole processors, extending our previous work that showed a single northpole processor performs better on visual inference tasks (resnet50, yolo-v4). outperforms other architectures.

we demonstrate that northpole's unique architecture is well suited for llm inference, enabling it to significantly outperform edge and data center gpus on the twin goals of low latency and high energy efficiency.

because the northpole device must be used as a unit, it is most efficient for high-throughput applications.

this preliminary paper provides a springboard for further research into energy efficiency optimization, mapping of larger llms on correspondingly larger northpole devices, new llm models co-optimized with the northpole architecture, and future system and chip architectures.

news

another chip to challenge gpu

introduction

my contact information