news

Revealed! 47 pages of documents dismantle Apple's intelligence, from architecture, data to training and optimization

2024-07-31

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Machine Heart Report

Synced Editorial Department

At the 2024 Worldwide Developers Conference, Apple launched Apple Intelligence, a new personalized intelligent system that provides practical intelligent services covering iPhone, iPad and Mac, and is deeply integrated into iOS 18, iPadOS 18 and macOS Sequoia.

Cook once said that Apple Intelligence is a new chapter in Apple's innovation and will change the way users use products. He emphasized that Apple's unique approach combines generative artificial intelligence and users' personal information to provide truly useful intelligent services. In addition, Apple Intelligence can access information in a completely private and secure way to help users accomplish what is most important to them. This is Apple's unique AI experience.

Now, more than a month has passed since Apple Intelligence was officially announced. This technology has finally been implemented in smart devices, and the relevant technical documentation has finally been released.

Just in the past day, users with iPhone 15 Pro or iPhone 15 Pro Max can download the iOS 18.1 development beta and experience the features of Apple Intelligence.

With the release of this 47-page technical report, we can gain a deeper understanding of the secret weapon behind Apple Intelligence.



Report address: https://machinelearning.apple.com/papers/apple_intelligence_foundation_language_models.pdf

The report details two of these models:AFM-on-device, AFM stands for Apple Foundation Model, a language model with about 3 billion parameters, and a larger server-based language modelAFM-server, can perform specialized tasks efficiently, accurately, and responsibly (Figure 1).

These two base models exist as part of Apple's larger family of generative models.



Structure and training

The AFM base model is a dense decoder model built on the Transformer architecture, with the following design:

Share input/output embedding matrices to reduce memory usage for parameters.

Use RMSNorm for pre-normalization to improve training stability.

Query/key normalization to improve training stability.

Grouped Query Attention (GQA) with 8 key-value heads to reduce KV cache memory footprint.

SwiGLU activated to improve efficiency.

RoPE position embedding, the base frequency is set to 500k to support long context.



The AFM pre-training process plays a key role in developing high-performance language models to power a range of Apple Intelligence features. The research team focuses on efficiency and data quality to achieve a high-quality end-to-end user experience.

In terms of post-training, the research team found that improving universal post-training can improve the performance of all functions of Apple Intelligence because the model will have stronger capabilities in following instructions, reasoning, and writing.

To ensure that these model functions are consistent with Apple's commitment to protecting user privacy and Apple's Responsible AI principles, post-training work includes a series of data collection and generation, instruction adjustment, and alignment innovations. The post-training process consists of two stages: supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). The research team proposed two new post-training algorithms: (1) a rejection sampling fine-tuning algorithm with a teacher committee (iTeC), and (2) an RLHF algorithm for reinforcement learning iteration with mirror descent policy optimization and leave-one-out advantage estimator (MDLOO), which significantly improves the model quality.

Apple Intelligence Features

The base model is designed specifically for Apple Intelligence, the personal intelligence system that powers iPhone, iPad, and Mac.

Apple found that they could bring small models to state-of-the-art performance by fine-tuning them for specific tasks, and they developed an architecture based on runtime-swappable adapters that enables a single base model to be specialized for dozens of such tasks. Figure 2 shows a high-level overview.



Adapter Architecture

Apple uses LoRA adapters to fine-tune models for specific tasks. For each task, the researchers adjust all linear projection matrices in the AFM self-attention layer and the fully connected layers in the point-by-point feedforward network. By only fine-tuning the adapter, the original parameters of the base pre-trained model remain unchanged, which can preserve the general knowledge of the model while customizing the adapter to support specific tasks.

Quantification

In order to incorporate AFM into edge devices with limited memory budget and reduce the inference cost, quantization techniques need to be considered. Previous studies have found that the loss of 4-bit quantized models is small compared to the original 32/16-bit floating point.

In order to achieve the best balance between model capacity and inference performance, Apple has developed a state-of-the-art quantization method and a framework that uses accuracy-recovery adapters. This allows the model to achieve nearly lossless quantization with an average of less than 4 bits per weight, and provides flexible quantization options.

method

After post-training, the model is compressed and quantized to obtain an average weight of less than 4 bits. Quantized models usually show moderate quality loss. Therefore, Apple will not use the quantized model directly for feature development, but will attach a set of parameter-efficient LoRA adapters for quality restoration.

It is worth noting that the training accuracy-recovery adapter is sample-efficient and can be seen as a mini-version of the trained base model. During the pre-training phase of the adapter, only about 10 billion tokens (about 0.15% of the base model training) are needed to fully recover the capabilities of the quantized model.

Since application adapters will be fine-tuned from these accuracy-recovery adapters, they do not incur any additional memory usage or inference cost. Regarding adapter size, Apple found that an adapter rank of 16 provides the best trade-off between model capacity and inference performance.

However, for flexibility, Apple provides a set of accuracy-recovery adapters with different ranks {8, 16, 32} for application teams to choose from.

Mixed Precision Quantization

Every transformer block and every layer in AFM has residual connections. Therefore, it is unlikely that all layers are equally important. Based on this intuition, Apple further reduces memory usage by pushing some layers to use 2-bit quantization (the default is 4-bit). On average, AFM-on-device can be compressed to only about 3.5 bits per weight (bpw) without significant quality loss.

Evaluate

The research team used common open source evaluation tools and benchmarks to evaluate the AFM pre-trained model. Table 2 shows the results of AFM-on-device and AFM-server on HELM MMLU v1.5.0.



These benchmarks demonstrate that the AFM pre-trained model has strong language and reasoning capabilities, providing a solid foundation for post-training and feature fine-tuning.





The comparison results of AFM with open source models (Phi-3, Gemma-1.1, Llama-3, Mistral, DBRX-Instruct) and commercial models (GPT3.5 and GPT-4) are shown in Figure 3 below. Compared with other models, AFM models are more favored by human evaluators. In particular, AFM-on-device achieved a 47.7% win rate compared to Phi-3-mini despite being 25% smaller in size, even outperforming the open source strong baselines Gemma-7B and Mistral-7B.



To measure the model’s ability to generate responses that follow the instructions in the prompt, the research team evaluated AFM-on-device and AFM-server on the IFEval benchmark. The results are shown in Figure 4 below:



As shown in Figure 5, AFM-server achieves the best overall accuracy, outperforming Gemini-1.5-Pro-Preview-0514 and GPT-4.



Apple compared AFM with some of the best models as well as smaller open source models. As shown in Figure 6, AFM-on-device can achieve comparable or better performance than Gemma-7B and Mistral-7B. AFM-server significantly outperforms DBRX-Instruct and GPT3.5, and is comparable to GPT4.



Figure 7 compares the performance of the post-trained AFM on a math benchmark. It is found that the AFM-on-device significantly outperforms Mistral-7B and Gemma-7B, even though it is less than half their size.



The figure below shows the quality of AFM-on-device adapter, Phi-3-mini, Llama-3-8B and Gemma-7B evaluated by human raters on the summary task. Figure 8 shows that AFM-on-device-adapter outperforms other models overall.



Responsible AI

Apple Intelligence is developed and designed with user privacy in mind.

Figure 9 summarizes the violation rates given by human raters on different models, where lower is better. Both AFM-on-device and AFM-server are robust to adversarial cues, with significantly lower violation rates than open source and commercial models.



Figure 10 shows that the AFM model is more favored by human raters than the other models.