news

How is Apple Intelligence developed? Here is the most complete explanation

2024-07-31

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Written by Ma Xuewei

Siri has finally transformed into "AI Siri", and the much-anticipated Apple Intelligence is here.

Along with the launch of Apple Intelligence on iOS 18, iPadOS 18 and macOS Sequoia, Apple also released a technical report on its own large model, announcing a large number of technical details, which attracted much attention from the industry.

According to reports, Apple Intelligence includes multiple high-performance generative models that are fast, efficient, designed for users' daily tasks, and can instantly adapt to users' current activities. The basic models built into Apple Intelligence have been optimized for user experience, such as writing and polishing text, prioritizing and aggregating notifications, creating interesting pictures for conversations with family and friends, and taking in-app actions to simplify cross-app interactions.

In the technical report, the Apple team detailed how two of the models - a language model AFM (Apple Foundation Model) with approximately 3 billion parameters, and a larger, server-based AFM-server language model - were built and adapted to perform professional tasks efficiently and accurately.

Figure|AFM model overview

These two foundational models are part of a larger family of generative models Apple has created to support users and developers; this includes a programming model based on the AFM language model, used to build intelligence in Xcode, and a diffusion model that helps users express themselves visually, such as in the Messages app.

How does AFM perform?

AFM underwent rigorous evaluation during the development process, and the evaluation results showed that the model performed well in pre-training, post-training, and specific tasks, and was in line with Apple’s core values ​​and responsible AI principles.

1. Pre-training evaluation

The Apple team used public evaluation benchmarks such as HELM MMLU, HELMLite, and OpenLLM to evaluate the language understanding and reasoning capabilities of the AFM model. The results showed that the AFM model achieved excellent results in multiple evaluation indicators, demonstrating strong language understanding and reasoning capabilities, laying the foundation for subsequent post-training and specific task applications.

2. Post-training evaluation

The Apple team evaluated the AFM model on both general and specific capabilities, such as instruction following, tool use, and writing, using a combination of human evaluation and automated evaluation benchmarks.The evaluation results are as follows:

  • Human Evaluation:The AFM model performs on par with or better than other open source and commercial models on multiple tasks, demonstrating that the model is able to understand and follow complex instructions and generate high-quality text.

Figure|Compared with other open source models and commercial models, human raters prefer the AFM model.

The research team evaluated MAIA on the neuron description paradigm. The study showed that MAIA achieved excellent description results on both real models and synthetic neuron datasets, with predictive capabilities that were better than baseline methods and comparable to human experts.

  • Instructions follow the assessment:The AFM model achieves excellent results on benchmarks such as IFEval and AlpacaEval 2.0 LC, demonstrating that the model can effectively understand and follow instructions.

Figure | Comparison of instruction following capabilities of the AFM model and related models, measured using IFEval, where higher values ​​indicate better capabilities.

  • Tool Usage Evaluation:The AFM model achieved the best overall accuracy on the Berkeley Function Calling Leaderboard benchmark, indicating that the model is able to use the tool effectively.

Figure|AFM-server achieves the best overall accuracy, outperforming Gemini-1.5-Pro-Preview-0514 and GPT-4.

  • Writing Assessment:The AFM model performs well on internal summarization and writing benchmarks, demonstrating that the model is capable of generating fluent and high-quality text.

Figure | AFM compared to some of the most prominent models as well as smaller open source models. AFM-on-device can achieve comparable or better performance than Gemma-7B and Mistral-7B. AFM-server significantly outperforms dbrx-directive and is comparable to GPT-3.5 and GPT-4.

  • Mathematics Assessment:The AFM model achieves excellent results on benchmarks such as GSM8K and MATH, indicating that the model can effectively solve mathematical problems.

Figure | The research team compared the performance of trained AFM on mathematical benchmarks, including GSM8K and math. AFM-on-device performs significantly better than Mistral-7B and Gemma-7B.

In addition, the research team also conducted task-specific and security evaluations on the model. They used human evaluation and task-specific evaluation benchmarks to evaluate the performance of the AFM model on specific tasks, such as email summaries, message summaries, and notification summaries. According to the evaluation results, the AFM model outperformed other models in many aspects, such as accuracy, completeness, and readability, in terms of email summaries, message summaries, and notification summaries.

In terms of security, the research team used adversarial datasets and human evaluations to evaluate the AFM model's resistance to harmful content and sensitive topics. The evaluation results showed that the AFM model showed good resistance to adversarial data and sensitive topics, and to a certain extent avoided harmful or inappropriate responses.

How to Practice AFM

Architecture

Like most mainstream models, the AFM model is based on Transformer architecture, but also adopts some specific design choices to improve efficiency and performance.The main components are as follows:

  • Transformer module: AFM uses the standard Transformer module, including multi-head attention mechanism and feedforwardNeural Networks

  • Shared input/output embedding matrices: This design reduces the number of model parameters and improves memory efficiency.

  • Prenormalization and RMSNorm: These techniques improve the stability of training and help the model learn more complex patterns.

  • Query/Key Normalization: This technique further improves the stability of training.

  • Grouped Query Attention (GQA): The GQA mechanism reduces memory usage and improves computational efficiency.

  • SwiGLU activation function: This activation function improves the efficiency of the model.

  • RoPE Position Embedding: The RoPE mechanism supports encoding of long texts and improves the model’s ability to represent context.

Figure | AFM-on-device has 3072 parameters and is suitable for on-device reasoning. It uses 26 Transformer layers, each with 128 heads, 8 query/key heads, and 24 query heads.

Pre-training

The pre-training process of the AFM model is designed to train a powerful language model to support various functions of the Apple Intelligence system. The AFM model is trained on a Cloud TPU cluster using the AXLearn framework, which supports training of large-scale models and sequence lengths and provides efficient training and inference performance.

The AFM pre-training dataset consists of various types of high-quality data, including:

  • Web content: Publicly available information crawled using Applebot and filtered.

  • Licensed datasets: High-quality datasets obtained from publishers, providing diverse long-text data.

  • Code: Open source code data obtained from GitHub, covering multiple programming languages.

  • Mathematics: Web data containing mathematical content such as math problems, forums, blogs, tutorials, and seminars.

  • Public datasets: Publicly available datasets that have been evaluated and screened.

AFM pre-training is divided into three stages:

  • Core stage: Use the largest dataset for training, with the main goal of learning basic language knowledge and patterns.

  • Continuing phase: Based on the core phase, code and mathematical data are added, and the weight of web page data is reduced to further expand the knowledge scope of the model.

  • Context extension stage: Based on the continuation stage, longer sequence length and synthetic long text data are used to improve the model's ability to process long texts.

Post-training

AFM acquires strong language understanding capabilities during the pre-training phase, but post-training is required to apply it to specific tasks such as email summarization, message summarization, and notification summarization.include:

  • Supervised Fine-tuning (SFT):

    • Data collection: Use human-annotated data and synthetic data to ensure that the data quality is diverse and covers a variety of natural language usage scenarios.

    • Data blending: Carefully select and combine human data and synthetic data to form high-quality data blends.

    • Fine-tuning method: Use the LoRA adapter to fine-tune the model, adjusting only the adapter parameters and retaining the general knowledge of the model.

  • Reinforcement Learning with Human Feedback (RLHF):

    • Reward Model: Use human preference data to train a reward model and evaluate the quality of the model’s responses.

    • Iterative Teaching Committee (iTeC): Iteratively improves the model using multiple preference optimization algorithms including rejection sampling, direct preference optimization, and online reinforcement learning.

    • Online RLHF algorithm (MDLOO): Uses Mirror Descent policy optimization and Leave-One-Out advantage estimator to maximize rewards and improve model quality.

Advantages of post-training:

  • Improved model quality: Post-training significantly improves the quality and performance of the AFM model, making it excel in specific tasks.

  • Compliance with Apple’s core values ​​and responsible AI principles: The post-training process fully considers data quality, security, and filtering of harmful content to ensure that the model complies with Apple’s core values ​​and responsible AI principles.

  • Scalability: The post-training method can be extended to other tasks, enabling the AFM model to support more Apple Intelligence features.

Inference Optimization

AFM not only needs to have strong language understanding capabilities, but also needs to be able to run efficiently on devices such as iPhone, iPad and Mac, as well as Private Cloud Compute on Apple silicon servers. To achieve this goal, Apple has developed a series of optimization techniques to ensure that the AFM model runs efficiently on specific tasks while maintaining the overall model quality.

Optimization:

  • Model quantization: AFM models are quantized using 4-bit quantization technology, significantly reducing model size and inference cost.

  • Accuracy Restoration Adapter: Use the LoRA adapter to restore the accuracy of the quantized model to be close to the performance of the unquantized model.

  • Mixed Precision Quantization: Quantize each layer of the model using 4-bit and 2-bit quantization precision to further reduce memory usage while maintaining model quality.

  • Interactive model analysis: Use Talaria tools to analyze the model's latency and power consumption, guide bitrate selection, and optimize model performance.

  • Runtime replaceable adapters: Use LoRa adapters to fine-tune models, enabling them to be tuned for specific tasks while maintaining the model’s general knowledge.

Optimization case-email summary:

  • Data collection: Collect input data including emails, messages, and notification summaries, and perform data cleaning and deduplication.

  • Synthetic summary generation: Use the AFM server to generate synthetic summaries that meet product requirements and use rules and models to filter and ensure data quality.

  • Hint injection: Add the summaries generated by the AFM server to the training data to help the AFM device model better understand and generate summaries.

In addition, Apple Intelligence follows a series of responsible AI principles, including empowering users, representing users, careful design, and protecting privacy. In this technical report, Apple refuted the accusation that it used ethically questionable methods to train certain models, reiterating that it did not use private user data, but combined publicly available data and authorized data for Apple Intelligence. They emphasized that the training data for the AFM model was obtained in a "responsible" manner.