news

Llama 3.1 405B VS Mistral Large 2, who is the king of open source? | AI review

2024-07-27

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


Author|Salt and Pepper Rabbit
Email: [email protected]

Two large-scale AI models have been released recently.

On July 23,MetaannouncedLlama 3.1 405Bmodel, which not only supports8 typesHuman language,Proficient in multiple computer languages,As shown below:


Then on July 24,MistralAIReleased the latestMistral Large2Model, this model supportsDozens ofhuman language, andProficient in more than 80 programming languages, including Python, Java, C, C++, JavaScript, and Bash. It is also proficient in some more specific languages, such as Swift and Fortran.


Base64 encodingIt is an encoding method that converts binary data into text format and is often used to transmit binary data in text protocols.Data preprocessing, model input and output, data securityIt has a wide range of applications in various fields.


Through Base64 encoding, we can evaluate the multilingual processing capabilities of AI models and test whether they can accurately understand and translate the encoded information, especially their understanding and processing capabilities of different languages ​​and encoding formats. This can then test their multilingual translation capabilities, answer accuracy, and reasoning capabilities.

Decoding is the reverse process of encoding.If an AI model can accurately interpret and process Base64 encoding or decode relevant information, it will be more adept at performing daily programming tasks, parsing network data, and even extracting information from complex files.

Today, we will use this seemingly obscureBase64 encoding and decodingTo testAIMulti-language capabilities for large models.

Next, we are going to play a puzzle game about Base64 encoding with a little bit of "detective" mood.

Although the main players areLlama 3.1 405BandMistral Large2But we also joinedQwen2-72BandGPT-4oOne is a leading open source project in China, and the other is a closed source representative. Let's seeCan we really handle these "coding challenges" as easily as we handle ordinary languages?Let’s wait and see!

game rules:

We will use Base64-encoded strings for multi-language testing, including Chinese and English. Through this test, we can understand the performance of various models in multi-language translation, answer accuracy, and reasoning ability.

- The test consists of 2 rounds, with 3 dialogues in each round. Each correct answer will be scored 1 point.

- To ensure fairness in testing, we prompt the model not to decode using code tools.

- Hint: This is a base64 message [ ], please tell me what this message is without using code tools.


First, we need to roughly know the steps and processes of Base64 encoding and decoding.

Base64 encoding converts binary data into a series of specific 64 characters (AZ, az, 0-9, +, /) to represent it. If the steps in the decoding process are not correct or the string is not a valid Base64 encoding, the decoded result may be wrong or meaningless. To check what the actual Base64 encoded string represents, you can use online tools or libraries in your programming language to decode it correctly.

1

Round 1: English decoding

This round uses English words to convert to Base64 encoding. The encoded strings are:

Justice:SnVzdGljZQo=

Bravery:QnJhdmVyeQo=

Kindness:S2luZG5lc3M=

Let's first test the English code to see how the large model performs.Llama 3.1 405BThe answers are all correct.Get 3 points.But all the answers are in English, which is not very friendly to Chinese.

However, it still comes with its own unique emoticon package. Who doesn’t like this “human touch”? The emotional value is very well given.


andMistral Large 2Decoded English Base64 informationAnswer two questions correctly, get 2 points. In the second question, the original text is brevery, and the decoded result is "brave". The most likely source of error is an error in the conversion from character to binary index, the conversion from index to binary, or in the reorganization of the binary number.

However, it is worth praising that during the decoding process it first explains the principle, then uses 5 steps to gradually analyze and reason and finally decodes it. It is detailed, clear and very easy to understand.

The picture can be slid up and down


ChatGPT-4oThe answer was as concise and quick as always, and this time the decoded content was also quite correct, so I scored 3 points.

The picture can be slid up and down


Finally, let’s take a lookQwen2-72BThe English decoding answer is 3 points. All three answers are correct, and the precautions in the actual encoding are explained. It is easy to understand and thoughtful.


1

Round 2: Chinese decoding, no one survives?

This round is more difficult and uses Chinese words to convert to Base64 encoding. The encoded strings are:

Justice:

Brave: 5YuH5pWi

Kindness: 5ZaE6Imv

Let’s take a look at the super large cup firstLlama 3.1 405BHow to answer:

After three questions, Llama 3.1 405B still answered the decoding information in English, but the English words it got were "Hello World", "Hello", and "Goodbye", which were basically all wrong.Score 0 points for this round.

At first glance, a Base64 string conversion will not usually look like the following, unless the original data looked like this.Llama 3.1 405B is wrong from the second step, "Base64 character to ASCII mapping", so all the results after that are definitely wrong.

During the decoding process, each Base64 character should be mapped to a specific 6-bit binary value. If the character-to-binary mapping is wrong during decoding, the decoded result will naturally be wrong.

But the interesting thing is,Llama 3.1 405B ItMore "human", every time I answer, there will be some small expressions in the text, and will add someModalContent like this is really becoming more and more humane.

The picture can be slid up and down


Let’s take a look at the Mistral Large 2 released today.

After three questions, I didn't answer any of the coded Chinese correctly.Score 0 points

Although the decoding reasoning process of Mistral Large 2 is very detailed and specific to each step, it is clearer to see which step went wrong.The second step is wrong, the mapping of Base64 characters to binary, then the subsequent reasoning steps are also wrong, and the result must be wrong

In this step, Base64-encoded characters are incorrectly mapped directly to ASCII characters instead of their correct binary values. For example, '5' is mapped to 'H'. This mappingIgnoring how Base64 encoding actually works, that is, each Base64 character actually represents a 6-bit binary number, not a direct ASCII character.

It seems that this ability needs to be strengthened.

The picture can be slid up and down


Let's see more people who understand ChineseChatGPT-4o, it directly gives the decoded content, all correct,This round is worth 3 points.


Let's take a look at the most resistant domesticQwen2-72BThe decoding results are also "test", "hello", and "world", which are basically all wrong, and the score is 0 in this round.

Let's take a closer look at Qwen2-72B's thinking. His answer only contains reasoning, and he omits various conversion steps and directly gets the answer. This means that the result is wrong to a great extent. In other words, Qwen2-72B's main mistakes are mainly concentrated inUnderstanding Base64 encodingandExecution of the decoding stepsuperior.

for example:directGet specific Chinese characters from Base64 encoding, which is unlikely, since that would require the correct byte sequence and encoding (such as UTF-8) to interpret the binary data.


The final score is:


Obviously, ChatGPT-4o scored 6 points, which is far ahead of other major models. Whether it is Chinese or English, Base64 code can be easily converted into the meaning we understand.

The other three models, Llama 3.1 405B and Qwen2-72B, all scored 3 points, which showed good performance in English decoding, but relatively insufficient in Chinese decoding.Llama 3.1 405B is more "human" in its response and can give people more emotional value.But the overall response tends to be in English, with relatively more Chinese language functions, unless there is a specific and mandatory requirement for it to reply in Chinese.

The bottomMistral Large 2 lost one point for one question because of the wrong English decoding, but the decoding reasoning process was very detailed and clear.It shows strong reasoning capabilities, while other models show wide variation in performance in this regard.

Through this test,We found that large models perform differently in multi-language and programming language decoding, and the current large models are slightly unbalanced in multi-language processing.The overall English responses were generally accurate and clear, but the accuracy of the Chinese responses was low.

1

at last

Coding is a series of logical transformations that humans make to information itself in order to efficiently transport information. We usually think of it as the "language of computers." However, this test shows that correct encoding and decoding has become a difficult problem for large language models. Especially in a multilingual environment, each encoding and decoding process involves multiple steps and multiple encoding rules. If there is an error in one link or even a single bit of binary calculation is wrong, it is impossible to get an accurate answer.

Overall, GPT-4o is indeed stronger. In this game alone, Qwen2-72B is evenly matched with Llama3.1 405B. It is somewhat surprising that Mistral Large2 ranks last this time.

If you like our little game, you are welcome to follow us. If you want to have further discussions with us, you are also welcome to scan the QR code below to join our community.