Apple AI is now available on iPhone, but the evolved version of Siri does not have ChatGPT! A 47-page technical report reveals the secrets of the self-developed model

2024-07-31

New Intelligence Report

Editor: Editorial Department

【New Wisdom Introduction】This morning, all developers were stunned by the sudden release of iOS 18.1 beta! Unexpectedly, Apple AI is now available for trial use, and a large number of reviews have flooded the entire network. What's even more surprising is that the 47-page technical report on the basic model behind Apple AI has also been released online.

Early in the morning, the long-awaited first preview version of "Apple AI" was officially pushed to developers!

The three major systems, iOS 18.1, iPadOS 18.1, and macOS Sequoia 15.1, are all equipped with Apple's latest AI capabilities.

Those users who were among the first to get the iOS 18.1 beta version are already cheering, and wave after wave of actual test sharing has flooded the entire network.

The latest preview version contains many surprises (quick preview):

New Siri: When awakened, it will light up softly at the edge of the screen; communicate with users and switch between text and voice at will; understand commands even when the speaker stumbles; and can also answer troubleshooting questions about Apple products
Writing Tools: Rewrite, proofread, and summarize text in any scenario. (Memos, documents, and third-party apps are all available)
Focus Mode (Reduce Interruptions): Only show notifications you need to see right away
Photo function: Search photos and make videos using natural language
Generate AI summaries for email, message, and voicemail transcriptions

In addition, there are some features that Apple said will be launched next year, including ChatGPT integration, image/Emoji generation, automatic photo cleanup, and super Siri with screen perception.

By the way, currently, the iOS18.1 beta version (including iPadOS and macOS) is only available in the United States and has not yet been launched in China.

Moreover, only iPhone 15 Pro and iPhone 15 Pro Max support the new system.

According to the system introduction, the iOS18.1 beta version occupies a total of 15.44GB of memory space, of which the iOS system capacity is 12.58GB, and Apple AI only occupies 2.86GB.

This is because the model used by Apple on its end devices has only 3 billion parameters.

A more detailed introduction to the model is hidden in the freshly released Apple AI technology report.

The 48-page paper covers the design and evaluation of Apple LLM, including architecture, data management, pre-training and post-training recipes, optimization, functional adaptation, and evaluation results.

Paper address: https://machinelearning.apple.com/papers/apple_intelligence_foundation_language_models.pdf

Specifically, Apple has developed two new basic language models that form the core of Apple AI:

One is the end-side model AFM-on-device, which has approximately 3 billion parameters. After optimization, it can run on the iPhone and other terminal devices with higher efficiency and responsiveness.

The other is a larger model that can run in Apple's cloud servers, called AFM-server, designed for intensive tasks and using Private Cloud Compute's system to protect user data.

Remember last month at the WWDC conference, Cook announced to the world the powerful capabilities of Apple’s AI, giving the entire Apple product family an epic upgrade.

The whole network believes that AI is no longer popular and we still have to look at Apple AI.

Generally speaking, Apple usually releases the iOS18 main system first.

However, I didn’t expect that this time Apple would send the beta version to the first batch of developers in such a short time.

Bloomberg's latest report pointed out that Apple broke its usual software release rhythm because Apple AI still needs more testing time.

I wonder what new worlds the first batch of early adopters discovered?

Netizens tested

Apple technology blogger Brandon Butch immediately produced a video explanation of the most comprehensive Apple AI features in the iOS18.1 beta.

No matter how ugly the words are, they can be harmonious and pleasant to the ear.

He said Apple AI helped him find a better way to express what he wanted to say.

In the message interface, write down what you want to say in the input box.

Then select all and click the Apple AI button to use the "Friendly" feature in the writing tool. The AI will immediately make the tone of the passage more tactful.

Let’s look at another netizen who deliberately wrote a swear word and felt much better after having the AI rewrite it.

Grammar and spelling correction

In addition, Butch exclaimed that Grammarly has been killed, and this is the real Apple AI.

Just look at the following passage: informutive is spelled incorrectly, the first letter of what is not capitalized, and there should be a question mark at the end of what do you think, instead of a period.

It can be seen that Apple AI has corrected all of them for you.

And Apple’s AI capabilities in emails are crazy to hear.

It also supports the capabilities of the writing tools in memos and messages, including proofreading, rewriting, etc.

The summary of an email will be displayed at the top.

The animations in Apple’s AI writing tool are very Apple-like. Compared to the dense stream of tokens that the model responds to, everything seems so smooth.

The new Siri is super smooth

Looking at the screen edge effect when calling Siri, I have to say that Apple understands design the best.

Let’s take a look at the iPad version of Siri.

Humane's AI engineer, a former Apple engineer, praised Siri after testing it, saying that Apple's AI is very, very fast.

Ask Siri how tall is the Eiffel Tower and where is it located?

By the way, let it push some recent news about the Paris Olympics and how to watch the Olympic events.

In a short while, Apple AI answered all the questions.

AI transcription summary, no need to miss important phone calls

In addition, Apple AI can also help you transcribe phone calls into notes to record what you talked about.

If the record button is pressed, both the caller and the receiver will hear a tone prompting the call to be recorded.

After the recording is completed, you can directly enter the notification pop-up window to view the recording content.

Focus Mode

Use Apple AI to automatically analyze notification content and detect important notifications!

Notifications from important people will be pinned at the bottom of the screen.

Photo search, many complaints

Of course, the reason why iOS18.1 was launched first is to allow developers to test more, discover and report problems, and better improve Apple's AI capabilities.

Just recently, a YouTube blogger discovered that Siri was still "retarded" when testing the photo function.

The blogger first asked, "Siri, show me photos of my 2022 Thanksgiving trip." Siri replied: The number of times you open the Health app...

Then he repeated the question, "Siri, find photos of Thanksgiving in Photos."

What’s funny is that Siri searched a lot of Thanksgiving-related pictures directly from the Internet.

When he asked again, "Siri, show me photos of my trip to Taiwan," Siri misunderstood the original words as keywords and searched "My Trip to Twaiwan" on the Internet.

Then he continued to ask, but Siri still didn't understand.

Stubborn blogger, broken Siri, it's so funny...

As mentioned at the beginning, the ability to install Apple AI into terminal devices is based on the team's self-developed basic model, which is shining.

The iPhone's AI Revolution: 3 Billion Parameters in Your Pocket

Specifically, AFM is a decoder-only dense model based on the Transformer architecture.

The design ideas are as follows:

Share input/output embedding matrices to reduce parameter memory usage
Use RMSNorm pre-normalization to improve training stability
Query/key normalization to improve training stability
Grouped Query Attention (GQA) with 8 key-value headers, reducing the memory footprint of the KV cache
More efficient SwiGLU activation
RoPE position embedding with base frequency of 500k, supporting long context

Adapter Architecture

By using a LoRA adapter, Apple's base model can be dynamically specialized on the fly based on the task at hand.

These small neural network modules can be plugged into the layers of a base model to fine-tune the model for a specific task.

To facilitate the training of adapters, Apple has also created an efficient infrastructure that enables adapters to be quickly added, retrained, tested, and deployed when the base model or training data is updated or new features are required.

optimization

In order to meet the daily needs of users, the team adopted a variety of optimization and quantization techniques to significantly reduce memory usage, latency, and power consumption while maintaining model quality.

method

During the post-training phase, Apple compressed and quantized the model, reducing each weight to less than 4 bits on average.

The quantized model usually has a certain degree of quality loss. Therefore, instead of directly handing the quantized model to the application team for function development, the R&D team attached a set of parameter-efficient LoRA adapters to restore the model quality.

Each product team then fine-tuned the LoRA adapter for their specific functionality by initializing the adapter weights from the accuracy-recovery adapters, while keeping the quantized base model unchanged.

It is worth noting that training the accuracy recovery adapter is sample efficient and can be viewed as a mini version of the trained base model.

Among them, during the pre-training stage of the adapter, only about 10 billion tokens (about 0.15% of the basic model training) are needed to fully restore the capabilities of the quantized model.

Since the application adapters will be fine-tuned from these accuracy recovery adapters, they will not incur any additional memory usage or inference cost.

Regarding the size of the adapter, the team found that a rank-16 adapter provided the best balance between model capacity and inference performance.

However, to provide more flexibility, Apple provides a set of precision recovery adapters of different ranks for application teams to choose from.

Quantification

Another benefit brought by precision recovery adapters is that they allow for more flexible choice of quantization schemes.

In the past, when quantizing large language models, one would usually divide the weights into small blocks, normalize each block by its corresponding maximum absolute value to filter out outliers, and then apply the quantization algorithm on a block basis.

Although a larger block size reduces the number of effective bits per weight and improves throughput, the quantization loss also increases. To balance this trade-off, the block size is usually set to a smaller value, such as 64 or 32.

But in Apple’s experiments, the team found that the precision recovery adapter can significantly improve the Pareto front of this trade-off.

For more aggressive quantization schemes, more errors will be recovered. Therefore, Apple is able to use efficient quantization schemes for AFM without worrying about the loss of model capacity.

Mixed Precision Quantization

There are residual connections in every Transformer block and every layer of AFM. Therefore, it is unlikely that all layers have the same importance.

Based on this intuition, Apple further reduced memory usage by pushing some layers to 2-bit quantization (the default is 4-bit).

On average, models on the AFM device can be compressed to approximately 3.5 bits per weight (bpw) without significant loss of quality.

In production, Apple chooses to use 3.7bpw because this already meets the memory requirements.

evaluation result

Pre-training

Table 2 shows the results of AFM-on-device and AFM-server on HELM MMLU v1.5.0, which tests 5-sample multiple-choice questions on 57 subjects.

Tables 3 and 4 show the results of AFM-server on the HuggingFace OpenLLM leaderboard V1 and HELM-Lite v1.5.0 benchmarks, respectively.

It can be seen that the AFM pre-training model has strong language and reasoning capabilities, which provides a solid foundation for post-training and feature fine-tuning.

Post-training Human Evaluation

For Apple AI application scenarios, human evaluation is closer to user experience.

To evaluate the general capabilities of the model, the team collected a comprehensive set of 1,393 prompts.

The prompts are quite comprehensive, covering different categories and difficulty levels, including: analytical reasoning, brainstorming, chatbot, classification, closed-ended question answering, coding, extraction, mathematical reasoning, open-ended question answering, rewriting, security, summarizing, and writing.

Figure 3 shows the comparison of AFM with open source models (Phi-3, Gemma-1.1, Llama-3, Mistral, DBRX-Instruct) and commercial models (GPT-3.5 and GPT-4).

It was found that human evaluators preferred the AFM model over competitor models.

In particular, despite having a 25% smaller model size, AFM-on-device achieves a 47.7% win rate over Phi-3-mini, and even outperforms the open-source strong baselines Gemma-7B and Mistral-7B, which have more than twice the number of parameters.

Compared with closed-source models, AFM-server also showed certain competitiveness, with a win rate of over 50% and a draw rate of 27.4% against GPT-3.5.

Instructions to follow

Instruction following (IF) is a core capability that the Apple team has high hopes for in language models, because real-world prompts or instructions are often complex.

Here, the team used the public IFEval benchmark, which evaluates whether large language models can accurately follow the instructions in the prompt when generating responses. These usually include specific requirements for the length, format, and content of the response.

As shown in Figure 4, AFM-on-device and AFM-server perform well in both instruction-level and hint-level accuracy.

In addition, the Apple team also benchmarked the AFM model on the AlpacaEval 2.0 LC benchmark to measure its general instruction following capabilities, and the results showed that its model is very competitive.

Tool Usage

In the tool usage scenario, after receiving a user request and a list of potential tools with descriptions, the model can choose to call a specific tool by providing structured output, specifying the tool name and parameter values.

The team evaluated the model on the public Berkeley Function Calling Leaderboard benchmark using AST metrics with native support for function calls.

As shown in Figure 5, AFM-server performs best in overall accuracy, surpassing Gemini-1.5-Pro-Preview-0514 and GPT-4.

writing

Writing is one of the most important capabilities of a large language model as it enables a variety of downstream applications such as changing tone, rewriting, and summarizing.

The team evaluated AFM’s writing capabilities on internal summarization and writing benchmarks. Following the LLM-as-a-judge approach, scoring instructions were designed for each summarization and writing task, prompting GPT-4 Turbo to score the model response on a scale of 1 to 10.

As shown in Figure 6, AFM-on-device shows comparable or better performance than Gemma-7B and Mistral-7B. AFM-server significantly outperforms DBRX-Instruct and GPT-3.5, and is even comparable to GPT-4.

It is worth noting that there are some limitations and biases in using LLM scores, such as length bias.

math

In Figure 7, the team compared the performance of AFM on a mathematical benchmark test.

Among them, the researchers used 8-shot CoT hints for GSM8K and 4-shot CoT hints for MATH.

The results show that the AFM-on-device significantly outperforms Mistral-7B and Gemma-7B even at less than half the size of both.

Summary function

The product team developed a custom set of guidelines, metrics, and a specialized scoring rubric for summaries of emails, messages, and notifications to assess summary quality, using a variety of open source, licensed, and proprietary datasets.

According to the predefined product specifications, if any sub-dimension is rated as “poor”, the summary is classified as “poor”. Similarly, the summary is classified as “good” only when all sub-dimensions are rated as “good”.

Figure 8 shows that the overall performance of the AFM-on-device+adapter is better than that of Phi-3-mini, Llama-3-8B, and Gemma-7B.

assess safety

Figure 9 shows the evaluation results of human reviewers on model violations, where lower values are better.

It can be seen that AFM-on-device and AFM-server show strong robustness in dealing with adversarial cues, with violation rates significantly lower than open source and commercial models.

Figure 10 shows the preferences of human reviewers for safety assessment prompts.

The AFM model once again won by providing a safer and more helpful response.

The above is a key glimpse into Apple’s AI model.

When will everyone be able to use Apple's AI capabilities?

Every year, Apple launches new products at its fall conference, and the initial version of iOS 18 will be released simultaneously with the iPhone 16.

However, everyone will have to wait until October to experience it.

References:

https://machinelearning.apple.com/papers/apple_intelligence_foundation_language_models.pdf

https://x.com/BrandonButch/status/1817982978540404776

news

Apple AI is now available on iPhone, but the evolved version of Siri does not have ChatGPT! A 47-page technical report reveals the secrets of the self-developed model

Introduction

my contact information