OpenAI Strawberry Model was delayed again. What is the SWE-bench Verified released in the early morning?

OpenAI Strawberry Model was delayed again. What is SWE-bench Verified released in the early morning?

2024-08-14

Machine Heart Report

Editor: Zhang Qian, Xiao Zhou

Someone said, "We were expecting strawberry, but they released kale." Let's see what this "kale" is used for.

The programming ability of large models has always attracted much attention. The emergence of the super AI programmer Devin has pushed the topic of "Can AI replace programmers?" to the forefront. Recently, Devin has also welcomed a new competitor - the autonomous AI programmer launched by the startup Cosine.GenieThe company said Genie easily outperformed Devin, scoring 30% on the third-party benchmark SWE-bench, while Devin scored just 13.8%.

This SWE-Bench is a benchmark dataset for evaluating LLM's ability to solve real software problems on GitHub. It collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. During the test, LLM will get a code base and issue description, and then generate a patch to solve the problem described in the issue. This dataset has been widely used in the evaluation of AI programming capabilities.

As AI programming capabilities evolve, so does this benchmark. This morning, the OpenAI "Strawberry" model that was rumored to be released online was delayed again, but OpenAI did release something new, an improved version of SWE-Bench - SWE-bench Verified.

OpenAI pointed out that the original SWE-bench had some problems, which might lead to the underestimation of the model's autonomous software engineering capabilities. Therefore, during the improvement process, they worked with the original author of SWE-Bench to conduct manual screening and improvement to ensure that the scope of the unit test was appropriate and the problem description was clear.

In the new test conducted on SWE-bench Verified, many AI programming agents scored higher than before. Among them, UIUC's agentless solution even doubled its score. OpenAI believes that this proves that the previous benchmark did underestimate AI programming capabilities.

But for netizens around the world who were waiting for "strawberries", this release was still too perfunctory. Some people said, "We were expecting strawberries, but they released kale."

Background information about SWE-bench

Each sample in the SWE-bench test set was created from a resolved GitHub issue in 12 open source Python code repositories on GitHub. Each sample has an associated pull request (PR) that includes the solution code and unit tests to verify the correctness of the code. These unit tests are called FAIL_TO_PASS tests because they fail before the solution code in the PR is added and pass after it is added. Each sample also includes PASS_TO_PASS tests, which pass before and after the PR is merged to check whether the PR breaks other features in the code base that are not related to the problem.

In SWE-bench, the AI agent is given the original text from the GitHub issue, i.e., the problem statement, and has access to the code repository. Given this information, the agent must edit the files in the code repository to solve the problem.

The edits given by the AI agent will be evaluated by running the FAIL_TO_PASS and PASS_TO_PASS tests. If the FAIL_TO_PASS test passes, it means the edit fixed the issue. If the PASS_TO_PASS test passes, it means the edit did not break unrelated parts of the codebase. To fully resolve the original GitHub issue, both sets of tests must pass.

Three directions for improving the robustness and reliability of SWE-bench

In order to improve the robustness and reliability of SWE-bench, the development team identified three main improvement directions:

Unit tests used to evaluate the correctness of a solution are often too specific and sometimes even irrelevant to the problem. This can lead to correct solutions being rejected.
The problem descriptions of many samples were not clear enough, leading to ambiguity about what the problem was and how it should be solved.
Sometimes it is difficult to reliably set up a SWE-bench development environment for an agent, which can inadvertently cause unit tests to fail regardless of the solution. In such cases, a perfectly valid solution may be scored as incorrect.

SWE-bench Verified

To address these issues, OpenAI launched a manual annotation campaign by professional software developers to screen every example in the SWE-bench test set to ensure that the scope of the unit test was appropriate and the problem description was clear and unambiguous.

Together with the authors of SWE-bench, they released SWE-bench Verified: a subset of the original SWE-bench test set containing 500 samples that have been verified by human annotators. This version replaces the original SWE-bench and SWE-bench Lite test sets. In addition, they are also releasing human annotations for all SWE-bench test samples.

They also worked with the authors of SWE-bench to develop a new evaluation tool for SWE-bench that uses a containerized Docker environment to make evaluation on SWE-bench easier and more reliable.

Tool address: https://github.com/princeton-nlp/SWE-bench/tree/main/docs/20240627_docker

Improvement methods

OpenAI worked with 93 software developers with Python experience to manually screen SWE-bench samples and annotated 1,699 random samples in the SWE-bench test set, ultimately obtaining SWE-bench Verified.

Their approach is to annotate samples in the SWE-bench test set to ensure the fairness and accuracy of the test. Specifically, they focus on two key points: first, assess whether the problem description is detailed enough to prevent overly vague descriptions from causing unfair testing; second, check whether the FAIL_TO_PASS unit test will incorrectly screen out valid solutions.

Each annotation criterion has a label ranging from [0, 1, 2, 3] with increasing severity. Labels 0 and 1 are minor; labels 2 and 3 are severe, indicating that the sample is inadequate in some way and should be discarded.

Additionally, OpenAI assessed the difficulty of each sample by asking annotators to estimate how long it would take a developer to decide on and implement a solution, assuming the sample had no problems. Finally, OpenAI provided a free-form input option to flag any other major issues with the sample.

To build SWE-bench Verified, OpenAI filtered out any samples with a problem statement or FAIL_TO_PASS unit test severity of 2 or above from the original test set, and also filtered out all samples marked with other severe issues.

Annotation results

According to the new standard, a large portion of the samples in the original SWE-bench are not qualified. As shown in the figure, 38.3% of the samples are marked because the problem statement is not clear enough, and 61.1% of the samples are marked because the unit test may unfairly mark valid solutions as incorrect (severity 2 and 3 combined). Overall, their annotation process resulted in 68.3% of SWE-bench samples being filtered out due to unclear problem statements, unfair unit tests, or other issues.

The figure below compares the difficulty distribution of the original SWE-bench dataset and the new SWE-bench Verified dataset. They estimated the difficulty distribution of SWE-bench based on a random subset of 1699 samples.

As can be seen from the figure, in the original SWE-bench dataset, the estimated completion time for most (77.8%) samples is less than one hour of work for an experienced software engineer. SWE-bench Lite and the new SWE-bench Verified datasets further increase this proportion, with less than 10% of problems estimated to take more than one hour to solve. However, the mechanism behind this change is very different: SWE-bench Lite is a subsampling of the original dataset to make the benchmark easier, while SWE-bench Verified attempts to remove infeasible samples from the dataset.

Performance of each agent on SWE-bench Verified

On the new SWE-bench Verified dataset, the development team tested the performance of GPT-4o using several open source scaffolds that performed well on the original SWE-bench leaderboard.

It was found that GPT-4o’s performance on the best-performing scaffold reached 33.2% on SWE-bench Verified, more than double the 16% score on the original SWE-bench. Overall, this confirmed OpenAI’s initial suspicion that the original SWE-bench underestimated the capabilities of the agent.

It is worth noting that the jump from SWE-bench Lite to SWE-bench Verified is not that obvious, because after screening, SWE-bench Lite has become easier than the full dataset.

Performance analysis by difficulty

When evaluated on SWE-bench Verified, the improved performance may be partly due to the distribution of test samples being skewed toward simpler samples.

We investigated this by plotting performance stratified by difficulty. If the new dataset simply changes the difficulty distribution to include easier samples, then the stratified performance within each class will not change, as is the case when going from the original SWE-bench to SWE-bench Lite.

Instead, OpenAI observed that when switching to SWE-bench Verified, the agent’s performance improved across difficulty categories, which is consistent with the expected effect of removing impossible examples from all categories rather than simply removing difficult examples.

Reference link: https://openai.com/index/introducing-swe-bench-verified/

news

OpenAI Strawberry Model was delayed again. What is SWE-bench Verified released in the early morning?

Introduction

My contact information