SWE-bench Pro vs Verified: Why AI Scores Don't Compare

SWE-bench Verified and SWE-bench Pro are two different tests of the same skill: whether an AI model can fix a real software bug on its own. Verified is a cleaner, easier 500-task subset; Pro is a harder, contamination-resistant benchmark. The gap between them is not small. DeepSeek V4-Pro scores 80.6 on SWE-bench Verified but only 55.4 on SWE-bench Pro — the exact same model, two different tests, a 25-point drop. That single fact is why a Verified score and a Pro score can never be compared head to head, no matter how similar the names look.

This guide explains what each benchmark actually measures, why the same model lands so far apart on the two, and how to read an AI coding announcement without fooling yourself. It is the reference we point back to every time we publish a model comparison, because getting this wrong is the most common mistake in AI coding coverage — including from people who should know better.

What SWE-bench Actually Measures

SWE-bench is a benchmark that tests whether a language model can resolve a real, previously unsolved software issue. It was created by a team at Princeton NLP — Carlos E. Jimenez, John Yang, and colleagues — and presented as an oral paper at ICLR 2024. Instead of asking a model to write a small function from a prompt, SWE-bench hands it an entire codebase and a genuine GitHub issue, then asks it to produce a patch that fixes the problem.

The original SWE-bench contains 2,294 task instances built from real GitHub issues and their merged fixes across a dozen popular open-source Python projects such as Django, scikit-learn, and SymPy. Each task is graded automatically: the model's patch is applied inside a Docker container, the project's own hidden unit tests are run, and the task counts as solved only if those tests pass. There is no partial credit and no human judging the "vibe" of the answer — either the tests go green or they do not.

This design is what makes SWE-bench meaningful. It rewards the messy, real-world skill that matters in production: reading unfamiliar code, locating the true source of a bug, and editing the right lines without breaking anything else. It is also why a single headline number carries so much weight in model launches — and why the fine print underneath it matters even more.

The SWE-bench family — the original 2,294-task benchmark, the 500-task Verified subset, and the harder 1,865-task Pro benchmark — One family, three tests. Verified is a curated subset of the original; Pro is a separate, harder benchmark. Illustration by ThePlanetTools.ai.

SWE-bench Verified: The Cleaner, Easier Subset

SWE-bench Verified is a 500-task subset of the original benchmark, released on August 13, 2024 in collaboration with OpenAI. Its purpose was to fix a quality problem in the original set: some of the 2,294 tasks were underspecified, some had broken or flaky tests, and a few were effectively impossible to solve from the issue text alone. Professional software engineers reviewed the tasks by hand and kept only 500 that were confirmed solvable, well-specified, and reliably graded.

Removing the noise made SWE-bench Verified more trustworthy, but it also made it easier. When you strip out the ambiguous and unsolvable tasks, the remaining ones are, by definition, the ones a capable model can actually crack. Scores rose accordingly. By 2026, frontier models routinely post SWE-bench Verified numbers in the high 80s — Claude Opus 4.8 reports 88.6 and DeepSeek V4-Pro reports 80.6 — and the leaderboard is crowded near the top. In benchmarking terms, Verified has become saturated: the best models are bunched together, and small differences say more about test noise than real capability.

None of this makes Verified a bad benchmark. It is clean, widely reported, and useful for tracking progress over time. But a saturated, curated, easier test is a very different measuring stick from a hard one, and treating the two as interchangeable is exactly the trap this article exists to prevent.

SWE-bench Pro: Harder and Built to Resist Contamination

SWE-bench Pro is a separate, more demanding benchmark created by Scale AI to address the weaknesses that make older tests easy to game. Two problems in particular motivated it: saturation (models scoring so high the benchmark stops discriminating) and contamination (models having effectively seen the answers because the public repositories and their fixes ended up in training data).

Pro attacks both. It is larger and structured in three parts: a public set of 731 instances drawn from repositories under strong copyleft licenses such as GPL, a private set of 276 instances from proprietary startup codebases that are not publicly available, and a held-out set of 858 instances reserved for future evaluation — 1,865 tasks in total across 41 repositories. The copyleft and private code is deliberately chosen because it is far less likely to have leaked into a model's training data, so a high score reflects reasoning rather than memorization.

The tasks are also simply harder. Reference solutions on SWE-bench Pro average 107.4 lines of code spread across 4.1 files, meaning a model has to coordinate a genuine multi-file change rather than patch a single function. The result is a dramatic difficulty gap. On Scale AI's neutral public leaderboard, top agents land far below where they sit on Verified — the strict harness puts leading systems in a range that would look like failure next to a Verified scoreboard. Pro was built to hurt, and it does.

The Proof in One Number: 80.6 Versus 55.4

Here is the cleanest possible demonstration that these two benchmarks are not interchangeable. DeepSeek V4-Pro, running at maximum reasoning, scores 80.6 on SWE-bench Verified and 55.4 on SWE-bench Pro. Same model. Same vendor doing the reporting. Same underlying skill being tested. The only thing that changed is which benchmark it ran — and the score fell by more than 25 points.

Same model, two scores — DeepSeek V4-Pro reports 80.6 on SWE-bench Verified and 55.4 on SWE-bench Pro, a 25-point gap — The single number that settles the argument: one model, a 25-point spread between the two tests. Vendor-reported figures. Illustration by ThePlanetTools.ai.

This is not a DeepSeek quirk. The same pattern shows up on every model that reports both tests. Claude Opus 4.8 posts 88.6 on Verified and 69.2 on Pro — a 19-point gap on the same model. Wherever a lab publishes both numbers, Verified is systematically higher because it is the easier test. The gap is a property of the benchmarks, not of any one model.

Once you have seen the 80.6-versus-55.4 spread, the rule writes itself: a Verified score tells you roughly nothing about how a model will place on Pro, and vice versa. They live on different scales. You can read our full breakdown in the Claude Sonnet 5 vs DeepSeek V4 comparison, where we deliberately show DeepSeek's 80.6 only as a single-sided Verified figure and never line it up against a rival's Pro score.

Why Mixing Verified and Pro Inflates or Deflates a Model

The danger is not abstract. Cross-benchmark mixing is how a weaker model gets made to look like a leader — and it usually happens by accident, not malice.

Picture two coding models. Model A reports 80.6, and Model B reports 63.2. On paper, Model A wins comfortably. But Model A's 80.6 is a SWE-bench Verified number and Model B's 63.2 is a SWE-bench Pro number. Put both on the same test — Pro versus Pro — and the ranking flips: Model A drops to 55.4 while Model B holds at 63.2. The "clear win" was an illusion created by comparing an easy-test score to a hard-test score. This is exactly the DeepSeek-versus-Sonnet situation, and it is why our Claude Sonnet 5 and DeepSeek V4 pages keep their shared benchmark strictly Pro-to-Pro.

What you compare	Model A	Model B	Apparent winner	Honest?
A's Verified vs B's Pro	80.6 (Verified)	63.2 (Pro)	Model A	No — mixed tests
Pro vs Pro (same test)	55.4 (Pro)	63.2 (Pro)	Model B	Yes — like for like

The lesson is uncomfortable but simple: a higher number is meaningless unless it comes from the same test. Mixing benchmarks does not just add uncertainty — it can completely reverse who is actually better.

The Shared-Test Scoreboard: Pro Versus Pro

When you hold the benchmark constant and line models up on SWE-bench Pro alone, a clean and useful picture emerges. These are the vendor-reported Pro figures for the current generation of coding models, all on the same test:

SWE-bench Pro scoreboard — Opus 4.8 at 69.2, Sonnet 5 at 63.2, GLM-5.2 at 62.1, GPT-5.5 at 58.6, DeepSeek V4 at 55.4 — Held constant on SWE-bench Pro, the ranking becomes meaningful. Vendor self-reported figures. Illustration by ThePlanetTools.ai.

Model	SWE-bench Pro (vendor-reported)
Claude Opus 4.8	69.2
Claude Opus 4.7	64.3
Claude Sonnet 5	63.2
GLM-5.2	62.1
GPT-5.5	58.6
DeepSeek V4-Pro	55.4

Now the gaps mean something. The spread from top to bottom is about 14 points, and every number is measured on identical terms. This is the scoreboard we build our comparisons from — for example, Claude Opus 4.8 vs GPT-5.5 rests on the honest 69.2-versus-58.6 Pro gap, and GLM-5.2 vs DeepSeek V4 turns on 62.1 versus 55.4. When we compare Claude Sonnet 5 vs GPT-5.5 or Claude Sonnet 5 vs GLM-5.2, the same discipline applies: one shared test, no mixing.

Same Name Is Not Enough: Variants, Harnesses, and Testers

Matching the benchmark name is necessary but not sufficient. Two deeper traps hide inside scores that look identical on the surface.

Trap one: same benchmark, different version

Benchmarks get revised, and version numbers matter. A model reporting Terminal-Bench 2.1 has not been measured on the same test as a model reporting Terminal-Bench 2.0 — the task set changed between versions. We see this in the Opus 4.8 versus GPT-5.5 data, where one vendor cites Terminal-Bench 2.1 and the other cites 2.0, so those two numbers are flagged as not comparable even though the benchmark name is the same. The same caution applies to SWE-bench itself: always confirm you are comparing Pro to Pro or Verified to Verified, not Pro to Verified.

Trap two: same benchmark, different tester

Who ran the test, and with what scaffolding, changes the number. Vendors self-report their own SWE-bench Pro scores using their own agent harnesses, and those figures tend to run higher than the strict, uniform harness on Scale AI's neutral public leaderboard. Neither is lying — a self-report with a strong agent loop and a third-party run with a fixed harness are measuring slightly different things. But it means a vendor's self-reported 62.1 and a leaderboard's independent score for the same model are not automatically the same evidence. When a number is self-reported, we say so; when it is third-party verified, that carries more weight.

This is why our internal rule includes the tester, not just the benchmark. A DeepSeek self-report against a Claude self-report is defensible if both are the vendor's own Pro run; a self-report against a third-party leaderboard result is not a clean comparison, even when the benchmark and version match.

How to Read a Vendor Benchmark Claim

Model launches are marketing events, and benchmark charts are marketing assets. That does not make them dishonest — most labs report real numbers — but it does mean the framing is chosen to flatter. Here is how we read a benchmark claim before we trust it.

Checklist for reading an AI benchmark claim — same benchmark, same variant, same tester, self-report or third-party — Four questions to ask before you trust a benchmark chart. Illustration by ThePlanetTools.ai.

Which exact benchmark and variant is this? A number floating next to the word "SWE-bench" is not enough. Verified and Pro are different tests; a chart that does not say which one is hiding the most important detail. If it says Terminal-Bench, ask which version.

Who reported it — the vendor or an independent tester? Self-reported numbers use the vendor's own harness and are chosen by the vendor. Third-party results from a neutral leaderboard are harder to cherry-pick. Both are useful; they are not equal.

What is conspicuously missing? When a launch shows a glowing SWE-bench Verified score but no Pro number, that absence is information. When a rival's numbers appear on a competitor's slide, assume the most flattering framing for the presenter was selected.

Could contamination explain it? A very high score on an old, public benchmark can mean the model memorized the answers as much as it reasoned them out. Contamination-resistant tests like SWE-bench Pro exist precisely because this is a real effect, not a hypothetical one.

The One Rule That Keeps You Honest

Everything above collapses into a single rule you can apply in five seconds: only compare scores that share the same benchmark, the same variant, and — ideally — the same tester.

Same benchmark means Pro to Pro or Verified to Verified, never across the two. Same variant means Terminal-Bench 2.1 to 2.1, not 2.1 to 2.0. Same tester means vendor self-report against vendor self-report, or third-party against third-party, rather than mixing a marketing figure with an independent one. If a comparison fails any of those three checks, the ranking it implies is not trustworthy — full stop.

Apply this rule and most viral "Model X crushes Model Y" charts fall apart on inspection, because they quietly mix an easy-test score with a hard-test score. Apply it consistently and you will read AI coding claims more clearly than most of the industry.

How We Apply This Across Our Comparisons

This is not a theoretical standard for us; it is the operating rule behind every head-to-head we publish. When two models both report SWE-bench Pro, we build the verdict on that shared number and present everything else as single-sided, clearly labeled figures. When one model reports only Verified and the other only Pro, we refuse to invent a shared row and say so plainly.

You can see the discipline in practice across our library: Claude Sonnet 5 vs Gemini 3.1 Pro, Claude Sonnet 5 vs Kimi K2.7, Claude Sonnet 5 vs Qwen 3.6, Kimi K2.7 vs DeepSeek V4, and Claude Opus 4.8 vs Kimi K2.7 all keep their shared coding benchmark strictly like for like. The same rule shapes how we score coding agents such as Claude Code and OpenAI Codex — and it is the reason our Claude Code vs OpenAI Codex verdict leans on hands-on testing rather than a single benchmark either vendor happens to favor.

If you take one thing from this page, make it the rule: match the benchmark, match the variant, match the tester. A score without that context is a headline, not evidence.

Frequently Asked Questions

What is the difference between SWE-bench Verified and SWE-bench Pro?

SWE-bench Verified is a cleaner, easier 500-task subset of the original SWE-bench, human-validated in collaboration with OpenAI and released in August 2024. SWE-bench Pro is a separate, harder benchmark from Scale AI with 1,865 tasks across 41 repositories, built to resist training-data contamination using copyleft and private code. Pro tasks are larger — averaging 107.4 lines across 4.1 files — so scores on Pro run far below scores on Verified.

Why does the same model score higher on SWE-bench Verified than on SWE-bench Pro?

Because Verified is the easier test. It was curated to remove ambiguous and unsolvable tasks, leaving only problems a capable model can crack, and it has become saturated near the top. Pro deliberately uses harder, multi-file, contamination-resistant tasks. DeepSeek V4-Pro shows the gap on a single model: 80.6 on Verified versus 55.4 on Pro, a drop of more than 25 points.

Is SWE-bench Pro harder than SWE-bench Verified?

Yes, substantially. Pro's reference solutions average 107.4 lines of code across 4.1 files, and it draws on copyleft and private repositories that are unlikely to be in training data. On Scale AI's neutral public leaderboard, top agents land far below their Verified scores, and every model that reports both tests posts a lower number on Pro.

Can you compare a SWE-bench Verified score to a SWE-bench Pro score?

No. They are different tests on different scales, so a Verified number and a Pro number cannot be ranked against each other. Doing so can completely reverse which model looks better: DeepSeek V4-Pro's 80.6 on Verified appears to beat Claude Sonnet 5's 63.2 on Pro, but on the shared Pro test DeepSeek scores only 55.4 and Sonnet 5 leads. Always compare Pro to Pro or Verified to Verified.

Who created SWE-bench, SWE-bench Verified, and SWE-bench Pro?

The original SWE-bench was created by a team at Princeton NLP, including Carlos E. Jimenez and John Yang, and presented at ICLR 2024. SWE-bench Verified was released in August 2024 in collaboration with OpenAI as a human-validated 500-task subset. SWE-bench Pro was built by Scale AI as a harder, contamination-resistant benchmark.

What is a good SWE-bench Pro score in 2026?

On vendor-reported figures, the leading coding models cluster in the high 50s to high 60s: Claude Opus 4.8 at 69.2, Claude Sonnet 5 at 63.2, GLM-5.2 at 62.1, GPT-5.5 at 58.6, and DeepSeek V4-Pro at 55.4. Anything in that band is strong. Because Pro is designed to be hard, these numbers are far lower than the high-80s scores the same tier of models posts on SWE-bench Verified.

Why do vendor-reported SWE-bench Pro scores differ from Scale AI's public leaderboard?

Because the tester and the harness differ. Vendors self-report using their own agent scaffolding, which tends to produce higher numbers than the strict, uniform harness on Scale AI's neutral public leaderboard. Neither is dishonest, but a self-reported score and a third-party leaderboard score are not the same evidence, which is why matching the tester matters as much as matching the benchmark.

How should I read an AI model's benchmark announcement?

Ask four questions: which exact benchmark and variant is this, who reported it (the vendor or an independent tester), what is conspicuously missing, and could contamination explain a very high score on an old public test. Only compare numbers that share the same benchmark, the same variant, and ideally the same tester. A score without that context is a headline, not evidence.

SWE-bench Pro vs SWE-bench Verified: Why These AI Coding Scores Don't Compare