Apple Intelligence has a major security flaw that can be broken with just a few lines of code! Karpathy issued a warning

2024-08-15

New Intelligence Report

Editor: Ear Qiao Yang

【New Wisdom Introduction】Apple Intelligence is about to be launched, and a guy exposed the security flaws of Apple Intelligence with a few lines of code.

At the 2024 Worldwide Developers Conference (WWDC), Apple released Apple Intelligence, an AI feature that will be included in iOS 18.1.

As the official launch is about to take place in October, a "civilian expert" has discovered a major flaw in the Beta version of Apple Intelligence provided by MacOS 15.1.

Developer Evan Zhou successfully manipulated Apple Intelligence using prompt injection, bypassing intended instructions and allowing the AI to respond to arbitrary prompts.

It turns out that it, like other AI systems based on large language models, is vulnerable to a "cue word injection attack." Developer Evan Zhou demonstrated this vulnerability in a YouTube video.

What is a hint word injection attack?

There is an organization called OWASP, which stands for Open World Application Security Project, and they analyzed the main vulnerabilities that large language models may face. Guess what they ranked first? That’s right, prompt word injection.

Prompt Injection Attack is a new type of attack method with different forms, including prompt word injection, prompt word leakage and prompt word jailbreak.

This attack occurs when an attacker manipulates an AI to cause the model to perform unintended actions or leak sensitive information. Such manipulation can cause the AI to misinterpret malicious input as legitimate commands or queries.

With the widespread use of Large Language Models (LLMs) by individuals and businesses, and the continued advancement of these technologies, the threat of hint injection attacks is increasing significantly.

So how does this happen in the first place? Why are systems vulnerable to this type of attack?

In fact, in traditional systems, developers will pre-set programs and instructions, and they will not change.

Users can enter their information, but the program's code and the input remain separate.

However, this is not the case for large language models. That is, the boundary between instructions and inputs becomes blurred because large models often use the inputs to train the system.

Therefore, the encoding and input of the large language model do not have clear and definite boundaries as in the past. This gives it great flexibility, but it also makes it possible for the model to do something it shouldn't do.

Bruce Schneier, a technology security expert and lecturer at Harvard Kennedy School, published an article in May in Communications of the ACM detailing this security issue with LLM. In his words, this stems from "not separating the data and control paths."

Hint injection attacks can lead to data leakage, generation of malicious content, and spread of false information.

A prompt injection attack occurs when an attacker cleverly constructs input instructions to manipulate an AI model, thereby tricking it into revealing confidential or sensitive information.

This risk is particularly acute in models trained on datasets containing proprietary or personal data. Attackers can exploit the model’s natural language processing capabilities to craft instructions that appear harmless on the surface but are actually designed to extract specific information.

With careful planning, attackers can trick the model into generating responses that contain personal details, internal company operations, or even security protocols embedded in the model’s training data.

Such data breaches not only violate personal privacy, but also pose a significant security threat, which may lead to potential financial losses, reputational damage, and legal disputes.

Back to Zhou's case, Zhou's goal was to manipulate the "rewrite" function of Apple Intelligence, that is, to rewrite and improve user-input text.

During the operation, Zhou discovered that a simple "ignore previous instructions" command failed.

If this is an "airtight" LLM, it would be relatively difficult to continue digging. But coincidentally, the Apple Intelligence prompt template was recently dug out by a Reddit user.

From these templates, Zhou discovered a special token that was used to separate the AI system role from the user role.

Using this information, Zhou created a prompt that overwrote the original system prompt.

He terminated the user character prematurely, inserted a new system prompt, instructed the AI to ignore the previous instructions and respond to the following text, and then triggered the AI's response.

After some experimentation, the attack worked: Apple Intelligence responded with information Zhou had not requested, meaning the tip injection attack worked. Zhou published his code on GitHub.

Twitter users break GPT-3

The prompt injection issue has been known since at least the release of GPT-3 in May 2020, but has not been resolved.

Remoteli.io, a bot based on the GPT-3 API, fell victim to this vulnerability on Twitter. The bot was supposed to automatically post remote jobs and respond to remote work requests.

Yet, with the above prompt, the Remoteli bot became the butt of jokes among some Twitter users: they forced the bot to say sentences it would not have said based on its original instructions.

For example, the bot threatens users, taking full responsibility for the Challenger space shuttle disaster, or defaming a U.S. congressman as a serial killer.

In some cases, the bot spread false news or posted content that violated Twitter's policies and should have led to its expulsion.

Data scientist Riley Goodside was the first to recognize the problem and describe it on Twitter.

By inserting prompts into the sentences being translated, Goodside showed how vulnerable GPT-3-based translation robots are to attack.

British computer scientist Simon Willison discussed this security issue in detail on his blog, naming it "prompt injection."

Willison found that the hint injection instructions for large language models could cause all sorts of weird and potentially dangerous things. He went on to describe various defense mechanisms, but ultimately dismissed them. At this point, he doesn't know how to reliably close the security hole from the outside.

Of course, there are ways to mitigate these vulnerabilities, for example, using rules that search for dangerous patterns in user input.

But there is no such thing as 100% security. Willison said that every time a large language model is updated, the security measures taken must be re-examined. In addition, anyone who can write a language is a potential attacker.

“Language models like GPT-3 are the ultimate black box. No matter how many automated tests I write, I can never be 100% sure that a user won’t come up with some prompt word I didn’t anticipate that will subvert my defenses,” Willison wrote.

Willison believes that separating instruction input from user input is a possible solution, which is what the ACM article mentioned above refers to as "separation of data and control paths." He believes that developers will eventually be able to solve the problem, but wants to see research to prove that the method is indeed effective.

Some companies deserve credit for taking steps to make prompt injection attacks relatively difficult.

When Zhou cracked Apple Intelligence, he also needed to find a special token through the backend prompt template; in some systems, prompt injection attacks can be as simple as inserting a corresponding length of text in a chat window or in an input image.

In April 2024, OpenAI introduced the instruction hierarchy as a countermeasure. It assigns different priorities to instructions from developers (highest priority), users (medium priority), and third-party tools (low priority).

The researchers distinguished between "aligned instructions" (which match higher-priority instructions) and "misaligned instructions" (which contradict higher-priority instructions). When instructions conflict, the model follows the highest-priority instruction and ignores conflicting lower-priority instructions.

Even with countermeasures in place, systems like ChatGPT or Claude can still be vulnerable to prompt injection in some cases.

LLM also has a SQL injection vulnerability

In addition to the prompt word injection attack, Andrej Karpathy recently pointed out on Twitter another security vulnerability in LLM, which is equivalent to the traditional "SQL injection attack".

The LLM tokenizer parses the special tokens in the input string (such as<|endoftext|>etc.), although direct input seems convenient, it may cause trouble at best and security issues at worst.

Always remember that user input strings cannot be trusted!!

Just like SQL injection attacks, hackers can cause models to behave in unexpected ways through carefully constructed inputs.

Karpathy then provided a set of examples on Huggingface using the default values of the Llama 3 tokenizer and found two strange things:

1、<|beginoftext|>token (128000) is added to the front of the sequence;

2. Parse from the string<|endoftext|> Marked as a special token (128001). Text input from the user may now disrupt the token specification, making the model output uncontrolled.

Karpathy gave two suggestions:

Always use the two additional flag values, (1) add_special_tokens=False and (2) split_special_tokens=True, and add special tokens yourself in your code.

For the chat model, you can also use the chat template apply_chat_template.

According to Karpathy's method, the output word segmentation results look more correct.<|endoftext|> is treated as an arbitrary string rather than a special token, and is broken down by the underlying BPE tokenizer like any other string:

In summary, Karpathy believes that encode/decode calls should never handle special tokens by parsing strings, and this functionality should be deprecated completely and only added explicitly programmatically through a separate code path.

Currently, these issues are difficult to find and poorly documented, and it is estimated that about 50% of the code currently has related issues.

In addition, Karpathy discovered that even ChatGPT has this bug.

In the best case it will just delete the token spontaneously, in the worst case LLM will not understand what you mean and will not even be able to repeat the output as instructed<|endoftext|> This string:

Some netizens raised questions in the comment area. If the code is written correctly, but the training data is input<|endoftext|> what happens?

Karpathy responded that if the code was correct, nothing would happen. The problem is that a lot of code might not be correct, which would quietly break their LLM.

Finally, in order to avoid security issues caused by LLM vulnerabilities, Karpathy reminds everyone: Be sure to visualize your tokens and test your code.

References:

https://the-decoder.com/apple-intelligence-in-macos-15-1-beta-1-is-vulnerable-to-a-classic-ai-exploit/

news

Apple Intelligence has a major security flaw that can be broken with just a few lines of code! Karpathy issued a warning

Introduction

My contact information