news

VLM collectively "blind"? GPT-4o and Claude 3.5 failed the vision test miserably

2024-07-16

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Editorial Department

【New Wisdom Introduction】Visual language models collectively "crash" on the most basic visual tasks. Even simple graphic recognition can be difficult for them. Perhaps these most advanced VLMs have not yet developed true visual capabilities?

The latest round of language models, such as GPT-4o and Gemini 1.5 Pro, were defined as "natively multimodal" when they were released, able to understand multiple forms of input, including images, audio, text, etc.

These multimodal LLMs use terms such as “visual capability” and “visual understanding” in related introductions, marketing, and even academic papers.

This seems to suggest that the model can see and understand things in a sense that matches that of humans.

So let’s think about this: if we were to give visual language models a vision test, would they have standard vision of 5.2, or would they be extremely nearsighted, or would they be unable to see anything at all?

A new study suggests that large language models don’t actually have the human-like vision capabilities that people had hoped for. In fact, they’re blind.

Researchers from Auburn University and the University of Alberta tested four of today’s most advanced multimodal models on a series of very simple visual tasks and found that the results were disappointing.

These tasks are extremely simple for humans, such as whether two shapes overlap, how many pentagons are in a picture, or which letter in a word is circled.

However, the vision of these advanced models is at best "nearsighted", with very blurry details. At worst, the model is like a "smart blind person" making some educated guesses.


Paper address: https://arxiv.org/pdf/2407.06581

7 major tasks

Now, the vision test officially begins, and VLM needs to complete 7 small tasks.


Anh Nguye, co-author of the paper, emphasized that "our seven tasks are so simple that humans can perform them with 100% accuracy."

So, how will the AI ​​model perform when faced with questions that even first-grade students can answer correctly?


Task 1: How many intersection points do two broken lines have?

Given that VLM has performed amazingly well in previous chart-related benchmarks, such as Claude 3.5 Sonnet scoring 94.7% in AI2D and 90.8% in ChartQA, we can reasonably infer that this type of problem should not be difficult for them.

As shown in the figure below, a total of 150 line charts are drawn on the white canvas, each consisting of two lines, where each line is defined by three points.

The x-coordinates of these three points are fixed and equidistant, and the y-coordinates are obtained by random sampling, thus creating two polylines with 0, 1, or 2 intersection points.


The experiment used two different wordings to ask the big model questions, such as, “How many times do the blue and red graphs cross each other?” and “How many times does the blue and red lines cross each other?”

By calculating the average accuracy of each model in answering these two types of questions, we can eliminate some of the influence of prompts and achieve more accurate results.


In comparison, Sonnet-3.5 performs slightly better in this task, with an average accuracy of 77.33%, while other models perform poorly.

Although 77.33% sounds like a good score, since there are only three possible answers: 0, 1, and 2, the accuracy rate of random guessing is 33%.

It is worth noting that VLM tends to perform worse when the distance between two polylines becomes narrower. In summary, VLM cannot reliably identify and calculate line segment intersections.


Task 2: Problems of intersection, tangency and separation of circles


This problem belongs to the category of junior high school geometry: the intersection, tangency and separation of circles (no one will forget the back of the teacher drawing a circle with his bare hands).

However, we will not examine the VLM in these terms, but rather perform a simple test of overlapping shapes, which is arguably one of the simplest visual reasoning tasks we can imagine.

Unfortunately, no matter whether the two circles slightly overlap, just touch, or are a certain distance apart, the model is always unable to make the correct judgment.


In comparison, when the two circles were far apart, GPT-4o was correct more than 95% of the time, but when they were zero or very close together, it was correct only 18% of the time, less than the 50% accuracy rate for random guessing.


Gemini Pro 1.5 performed the best, with an average accuracy of 92.78%, but the accuracy was only 70% when the two circles were close.


Task 3: Identify the circled letters

The letters in the word are circled with a red circle ⭕, one at a time, and the task requires the VLM to identify the circled letters.

Obviously, this task is easy for humans, but the authors’ hypothesis is that if the VLM’s vision is blurred, it may not be able to identify the exact letters that are circled because of the small spacing between adjacent letters.


The words Acknowledgement, Subdermatoglyphic, and the string tHyUiKaRbNqWeOpXcZvM were chosen because they contain characters of varying widths and heights. (Fun fact: subdermatoglyphic is the longest word with no repeated letters.)

The experiment found that although VLM can accurately recognize the shape of the red circle and can perfectly spell out the word, "reading the circled letters" is difficult for all models. For example, when the letter is slightly blocked by the red oval, VLM recognition often makes mistakes.


When an error occurs, the VLM often predicts letters that are adjacent to the circled letter.

Sometimes the model would hallucinate and, despite being able to spell the word accurately, would include characters that do not exist in Subdermatoglyphic (e.g. 9, n, ©).


With the exception of GPT-4o, all models performed slightly better than a random string of two English words (between 2 and 6 points higher), suggesting that familiarity with the words themselves may help the VLM make more educated guesses.

Gemini-1.5 and Sonnet-3.5 are the top two models (92.81% and 89.22%), nearly 20 points higher than GPT-4o and Sonnet-3.

In summary, the VLM may be able to guess what the circled letters are based on the spelling of the word, slightly improving accuracy, but it does not mean that the VLM can see the letters inside the red circle.

Task 4: Interlocking Problems

Next, VLM needs to face an "interlocking" problem, that is, there are several circles interlocking in the calculation image.

The BGM should sound here: Ahhhhh ~ Five Rings, you have one more ring than the Four Rings ~


The results of this test were a bit strange: when there were five rings in the graph, the model was 100% correct; once there was one more ring, the VLM was completely confused.


Gemini lost its way and answered incorrectly all of the time, Sonnet-3.5 was correct a third of the time, and GPT-4o was correct nearly half the time.


The author suggests that the high accuracy of identifying the "five rings" is closely related to the common Olympic "five rings" logo.

As can be seen from Table 5, the four models tend to count 5 circles, which is much higher than the frequency of counting 5 pentagons.


This test shows that whatever these models are doing, they don’t have what we humans understand as “vision.” The main problem is that their performance is very unstable, with huge differences in recognition success rates across images with different numbers and shapes.


Task 5: Nested Squares

Task 2 showed that the VLM had difficulty calculating intersecting circles, so how would the VLM perform if the squares were completely nested inside another larger square so that their edges did not intersect?

As shown in the figure below, on a canvas of size C×C, the author renders N∈{2,3,4,5} nested squares.


The outermost square is rendered first with a random side length d∈{2,3,4}px. The remaining N-1 squares are drawn with a reduction factor of 0.75×d and placed at random coordinates to ensure that they do not touch the outer squares.

We generated 10 images (with the squares in different random positions) for each of the three line thickness settings, and repeated the process for all values ​​of N, resulting in a total of 120 images.

It can be found that calculating the number of nested squares is a difficult task for VLM to complete accurately.


The model accuracy varies greatly, with GPT-4o (48.33%) and Gemini-1.5 (55.00%) lagging behind Gemini-1.5 (80.00%) and Claude3.5 (87.50%) by at least 30 points.


Task 6: How many columns and rows does the table have?

The results of the previous tasks showed that VLMs were unable to cope with problems such as overlap (Task 4) or nesting (Task 5). The authors decided to give VLMs a different approach and see how they perform on problems related to adjacent shapes.

The authors arranged the squares into a grid and asked VLMs to count them. These VLMs have performed well in DocVQA (≥ 90% accuracy), which contains many questions with tables, so this task should be easy for VLMs.

To simplify the task, the authors only ask the model to count the number of rows and columns in a given table.


It turns out that the model is always unable to correctly calculate the number of rows and columns of the blank grid.


However, when text is included in the grid cells, the performance of all VLMs improves, especially Sonnet-3.5.


Task 7: Identify the roadmap

This task tests the VLM's ability to identify color-coded paths and follow the colored lines from a given starting point to a destination, an important ability required for reading and understanding maps.

As shown in the figure below, a subway map is created on an image of size C×C, where C∈{512, 1024}px.

Write 4 station names (A, B, C, D) at 4 fixed coordinates. Divide the canvas into an invisible grid of 18×18 cells and initialize 3 path starting points at a distance of C/18px from each station.

Use a depth-first search algorithm to draw a path starting from a random station and a random starting point, where each step can move one cell in any direction. Repeat this process so that each station has N∈{1,2,3} output paths, and draw a total of 180 maps.


Given two designated stations, the task requires the VLM to calculate how many different colored paths there are between the two stations.

The experimental results show that even if there is only one color path between two stations, no model can achieve 100% accuracy.


The highest accuracy is Sonnet-3.5, which can reach 95% when there is only one path. However, when there are two paths, the accuracy drops rapidly to only 50.18%.


Most VLMs show significant performance degradation as path complexity increases from 1 path to 3 paths.

VLM’s “blindness”

Why does the VLM perform so unstable in the above visual ability tests?

Perhaps we can find clues from the model's preference for the "Olympic Rings" in Task 4. The most common sense explanation is:

In VLM's training data, images of the "Olympic Rings" appear repeatedly and are described in detail in many text materials.

However, in the VLM's training data, there are no 6 or 7 interlocking rings, which is why their answers are not satisfactory.

Because VLMs probably don't even know what they're "seeing," nor do they really understand what rings, overlaps, or any other concepts are.

However, even if we use the word "blind" to describe the model's performance, we are still anthropomorphizing the model, just like the big model companies promote their "visual ability".

Researcher Nguyen also said that "blindness" has many definitions even for humans, and there is currently no word to describe the blindness/insensitivity of artificial intelligence to the images we show it.

The behavior of a VLM is a complex function of the input text prompt, the input image, and billions of weights, and there is currently no technology that can visualize exactly what the model sees.

He speculated that these models are not completely "blind", they can only extract "approximate" and abstract visual information from images, but cannot make visual judgments, and therefore behave like a person who understands the image but cannot actually see it.

Nguyen uses an example to support the above hypothesis:


User: What color is the area where two circles overlap? GPT-4o: The area where two circles overlap is turquoise (a shade of cyan). Gemini-1.5: The area where a green circle overlaps with a blue circle will be cyan. Sonnet-3.5: I'm sorry, but there are only two separate circles in the image, one green and one blue, and there is no overlapping area.

Obviously, except for Sonnet-3.5, GPT-4o and Gemini-1.5 are just "imagining" the image instead of really "seeing" it.

So, does this research mean that these “vision” AI models are useless?

This is not the case. Each of these models demonstrates high accuracy on a wide range of tasks, such as recognizing human actions and expressions, everyday objects, and environmental photographs.

The significance of this research is that it helps us dispel the overly “anthropomorphic” marketing strategy of VLM.

If we believe the marketing rhetoric of the tech giants, we might actually think that big visual models can "see".

But with just a few small tests, we can easily find the essential difference between VLM and humans. Its "anthropomorphism" actually highlights its inhuman nature.

References:

https://arxiv.org/abs/2407.06581

https://techcrunch.com/2024/07/11/are-visual-ai-models-actually-blind/?_refluxos=a10

https://vlmsareblind.github.io/