Claude Sonnet 5 vs Gemini 3.1 Pro: Coding Leader vs Reasoning Leader (2026)
Claude Sonnet 5 wins the shared SWE-bench Pro test (63.2% vs 54.2%); Gemini 3.1 Pro owns reasoning (94.3% GPQA) and native multimodal. Our 2026 split verdict.
Feature Comparison
| Feature | Claude Sonnet 5 | Gemini 3.1 Pro Preview |
|---|---|---|
| SWE-bench Pro (shared coding benchmark) | 63.2% | 54.2% |
| SWE-bench Verified (Gemini single-sided) | Not in published set | 80.6% |
| GPQA Diamond reasoning (single-sided) | Not published | 94.3% |
| ARC-AGI-2 abstract reasoning (single-sided) | Not published | 77.1% |
| Documented computer use (single-sided) | OSWorld-Verified 81.2% | Not published |
| Native multimodal input | Text, images | Text, images, audio, video, PDF |
| Input price per 1M tokens (up to 200K) | $2 intro / $3 standard | $2 |
| Output price per 1M tokens (up to 200K) | $10 intro / $15 standard | $12 |
| Context window | 1M tokens | 1M input / 64K output |
| Availability | Generally available, default on Claude.ai | Preview (since Feb 19, 2026) |
| Ecosystem and distribution | Claude Code, Claude API, Claude.ai | Gemini API, Vertex AI, AI Studio, Gemini app |
Pricing Comparison
Claude Sonnet 5
Gemini 3.1 Pro Preview
Detailed Comparison
Claude Sonnet 5 and Gemini 3.1 Pro are both mid-tier, value-priced models with a 1-million-token context window, but they lead in different lanes. On the one benchmark both vendors report on the same scale — SWE-bench Pro — Claude Sonnet 5 leads with 63.2% against Gemini 3.1 Pro's 54.2%, and it also publishes a computer-use score Gemini does not. Gemini 3.1 Pro answers with a stronger published reasoning profile (94.3% on GPQA Diamond, 77.1% on ARC-AGI-2) and richer native multimodality, accepting audio and video as well as text and images. This one does not have a single overall winner: Sonnet 5 is the pick for coding, computer use, and shipping maturity, while Gemini 3.1 Pro is the pick for reasoning and multimodal work.
Quick Verdict
If you want the single sentence: pick Claude Sonnet 5 when coding, documented computer use, and a generally available model matter most; pick Gemini 3.1 Pro when top-tier published reasoning scores and native audio-and-video multimodality are the priority. This one is a genuine split, and we are not going to invent a winner where the two models barely compete on the same tests.
The honest headline is that these two rarely meet on the same benchmark. The one place they do — SWE-bench Pro, the harder, contamination-resistant coding-agent test — Claude Sonnet 5 posts 63.2% from Anthropic's system card against Gemini 3.1 Pro's 54.2% from Google's model card, a nine-point edge to Sonnet 5. Almost everywhere else, each model publishes numbers the other does not: Gemini 3.1 Pro reports 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2 for reasoning, while Claude Sonnet 5 reports 81.2% on OSWorld-Verified for graphical computer use. Neither vendor mirrors the other, so we describe those figures as single-sided rather than pretending they are a head-to-head. What tips your decision is which lane you live in.
- Best shared-benchmark coding: Claude Sonnet 5 (63.2% vs 54.2% on SWE-bench Pro)
- Best published reasoning: Gemini 3.1 Pro (94.3% GPQA Diamond, 77.1% ARC-AGI-2 — scores Sonnet 5 does not publish)
- Best documented computer use: Claude Sonnet 5 (81.2% OSWorld-Verified, which Gemini does not report)
- Best native multimodality: Gemini 3.1 Pro (accepts text, images, audio, video, and PDFs; Sonnet 5 takes text and images)
- Best maturity and availability: Claude Sonnet 5 (generally available and the default on Claude.ai, while Gemini 3.1 Pro is still in preview)
- Best value at short context and agentic breadth: Gemini 3.1 Pro (cheaper standard input at up to 200K tokens, and a broad tool-use benchmark suite)
- Overall winner: None — this is a split. Choose by lane, not by a headline.
How We Compared Them
Honesty first. We have limited first-day hands-on time with Claude Sonnet 5, which launched on June 30, 2026, and our read on Gemini 3.1 Pro is research-led — we have not run either model as a month-long production driver at ThePlanetTools. So this is not a "we ran both side by side for weeks" piece. It is a structured comparison built on each vendor's published benchmarks and pricing pages, cross-checked against Google's model card and Anthropic's system card, plus the limited hands-on signal we have on the Anthropic side.
One rule shaped every number below: we place two figures head to head only when both vendors report the same benchmark on the same scale. That rule does more work than usual here, because these two models publish almost non-overlapping benchmark suites. The single clean overlap is SWE-bench Pro, where Anthropic's system card lists Claude Sonnet 5 at 63.2% and Google's model card lists Gemini 3.1 Pro at 54.2% (Google labels its figure "SWE-bench Pro (Public)," a single attempt averaged over five runs). We treat that as the one shared coding number, with a caveat: because each vendor runs its own harness and may score a slightly different subset, read the nine-point gap as indicative of a real coding edge for Sonnet 5 rather than as a refereed, independently reproduced result.
Everywhere else, we refuse to fabricate a mirror. Google publishes SWE-bench Verified at 80.6% for Gemini 3.1 Pro, but that is a different, more saturated benchmark than SWE-bench Pro, and Claude Sonnet 5's locked benchmark set centers on SWE-bench Pro and OSWorld rather than Verified — so we never line up an 80.6% Verified score against a 63.2% Pro score. Gemini 3.1 Pro reports 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2 for reasoning, which Sonnet 5 does not publish; Sonnet 5 reports 81.2% on OSWorld-Verified for graphical computer use, which Google does not publish for Gemini 3.1 Pro (we verified that absence across Google's model page, model card, and evaluation report, so we do not invent one). Those are single-sided strengths, and we label them as such. Pricing we took directly from each vendor's own pricing documentation — Anthropic's API pricing and Google's Gemini API and Vertex AI pricing pages — rather than from search snippets, and confirmed both were current at the time of writing.
Meet Both Models
Claude Sonnet 5 — Anthropic's mid-tier coding workhorse
Claude Sonnet 5 is Anthropic's mid-tier model, released June 30, 2026, and described by Anthropic as its most agentic midsize model — built to plan, drive browsers and terminals, and run multi-step tasks. It sits below the Claude Opus 4.8 flagship and replaces Claude Sonnet 4.6 as the default workhorse. Its headline is agentic performance at a mid-tier price: 63.2% on SWE-bench Pro, roughly 91% of Opus 4.8's 69.2%, plus 81.2% on OSWorld-Verified for computer use. It is a closed model — you reach it through the Claude API with the model id claude-sonnet-5, inside Claude Code, and as the default model on the free and Pro plans of Claude.ai. That last point shortens evaluation: the same model powering production agents is the one a free user chats with in the browser. Its introductory price of $2 per million input tokens and $10 per million output tokens runs through August 31, 2026, then steps up to $3 and $15 on September 1, 2026. On multimodal input it accepts text and images and returns text.
Gemini 3.1 Pro — Google's reasoning-and-multimodal Pro model
Gemini 3.1 Pro is the "Pro" tier of Google's Gemini 3 series, published February 19, 2026, and still offered in preview as of this writing (its model id is gemini-3.1-pro-preview). It sits above the Gemini Flash tiers and below Google's higher-effort Gemini 3 Deep Think; there is no separate "Ultra" model, since "Ultra" refers to a Google subscription plan rather than a model. Its strengths are reasoning and multimodality: Google reports 94.3% on GPQA Diamond graduate-science reasoning, 77.1% on ARC-AGI-2 abstract reasoning (ARC Prize Verified), and 51.4% on Humanity's Last Exam, alongside a broad agentic suite that includes 85.9% on BrowseComp and 68.5% on Terminal-Bench 2.0. It is natively multimodal, accepting text, images, audio, video, and PDFs (and even whole code repositories) as input, and returning text. It ships a 1-million-token input context window with a 64,000-token output limit, and is distributed through the Gemini API, Vertex AI, Google AI Studio, and the Gemini app. Standard pricing is $2 per million input tokens and $12 per million output tokens for prompts up to 200K tokens, rising to $4 and $18 above that.
Head-to-Head at a Glance
| Dimension | Claude Sonnet 5 | Gemini 3.1 Pro | Edge |
|---|---|---|---|
| SWE-bench Pro (shared coding benchmark) | 63.2% | 54.2% | Sonnet 5 (+9.0) |
| SWE-bench Verified (Gemini single-sided) | Not in published set | 80.6% | Gemini (its own board) |
| GPQA Diamond reasoning (single-sided) | Not published | 94.3% | Gemini |
| ARC-AGI-2 abstract reasoning (single-sided) | Not published | 77.1% | Gemini |
| Documented computer use (single-sided) | OSWorld-Verified 81.2% | Not published | Sonnet 5 |
| Native multimodal input | Text, images | Text, images, audio, video, PDF | Gemini |
| Input price per 1M tokens (up to 200K) | $2 intro / $3 standard | $2 | Near tie |
| Output price per 1M tokens (up to 200K) | $10 intro / $15 standard | $12 | Sonnet 5 |
| Context window | 1M tokens | 1M input / 64K output | Near tie |
| Availability | Generally available, default on Claude.ai | Preview (since Feb 19, 2026) | Sonnet 5 |
| Ecosystem and distribution | Claude Code, Claude API, Claude.ai | Gemini API, Vertex AI, AI Studio, Gemini app | Different lanes |
The table splits almost cleanly down the middle, and that is the point. Sonnet 5 takes the one shared coding benchmark, documented computer use, output price, and shipping maturity; Gemini 3.1 Pro takes published reasoning, native multimodality, and a slight standard-input price edge at short context. The two effectively tie on context window and on the value of their respective ecosystems. Which column of "edge" matters more depends entirely on whether your work is coding-and-computer-use heavy or reasoning-and-multimodal heavy.
Capability: What the Benchmarks Actually Say
The cleanest capability signal is the one benchmark both vendors report on the same scale: SWE-bench Pro, the coding-agent test built to resist contamination. Anthropic's system card puts Claude Sonnet 5 at 63.2%; Google's model card puts Gemini 3.1 Pro at 54.2% (labeled "SWE-bench Pro (Public)," a single attempt averaged over five runs). The nine-point gap favors Sonnet 5 and is the widest apples-to-apples capability difference we could establish between these two. Two caveats keep it honest. Both numbers are vendor-reported and not independently reproduced by our team, and because each lab runs its own harness the exact subset may differ slightly — so treat this as a real but indicative coding edge rather than a refereed scoreboard. Still, the direction is clear: on the harder shared coding benchmark, Anthropic's mid-tier model is ahead.
Gemini 3.1 Pro's strongest single claims sit on benchmarks Claude Sonnet 5 does not publish. On reasoning, Google reports 94.3% on GPQA Diamond — a graduate-level science test near the top of the field — and 77.1% on ARC-AGI-2, the abstract-reasoning benchmark verified by the ARC Prize team, plus 51.4% on the brutal Humanity's Last Exam. Claude Sonnet 5's locked benchmark set does not include a comparable GPQA or ARC-AGI-2 figure, so we do not place a Sonnet number beside these. That is a genuine, single-sided strength for Gemini 3.1 Pro: if published reasoning scores are what you optimize for, it brings the receipts and Sonnet 5, at least publicly, does not. Google also reports SWE-bench Verified at 80.6% for Gemini 3.1 Pro, but Verified is a different, more saturated coding benchmark than Pro, and we keep the two apart rather than mixing an 80.6% Verified score against a 63.2% Pro score.
On computer use, the roles reverse. Claude Sonnet 5 reports 81.2% on OSWorld-Verified, the benchmark that measures whether a model can operate real graphical software — clicking through dashboards, filling forms, and extracting data from interfaces without an API. Google does not publish an OSWorld or equivalent graphical computer-use score for Gemini 3.1 Pro; we checked Google's model page, model card, and evaluation report, and the capability is simply not in the reported set (Google's related tool-use numbers, such as 85.9% on BrowseComp and 68.5% on Terminal-Bench 2.0, cover agentic search and command-line workflows, which are different tests). So for documented graphical computer use, Sonnet 5 is the measured choice, and we refuse to invent a Gemini figure to sit next to its 81.2%. The takeaway across the benchmark spread: the two models are strong in non-overlapping directions — Sonnet 5 in coding and computer use, Gemini 3.1 Pro in reasoning and broad agentic tool use.
Reasoning and Multimodality: Gemini's Lane
Where Gemini 3.1 Pro pulls clearly ahead is the pairing of reasoning and multimodal input. Its 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2 are the sort of numbers that anchor a "smartest model" pitch, and Google backs them with 80.5% on MMMU-Pro (multimodal understanding) and 92.6% on multilingual MMLU. Just as important for real products, Gemini 3.1 Pro is natively multimodal in a way Sonnet 5 is not: it accepts audio and video alongside text, images, and PDFs, so a single call can reason over a meeting recording, a screen capture video, or a stack of scanned documents. Claude Sonnet 5 accepts text and images and returns text; it has no native audio or video input on this base model. For pipelines that ingest audio, video, or mixed media — transcription-plus-analysis, video understanding, multilingual document processing — Gemini 3.1 Pro is the more natural fit, and that gap is not close.
There is a maturity asterisk to weigh against Gemini's capability lead, though. Gemini 3.1 Pro has been available since February 19, 2026, yet it is still labeled preview, which means Google reserves the right to change behavior before general availability. Claude Sonnet 5, despite launching only on June 30, 2026, shipped as a generally available model and the default on Claude.ai. So the newer model is the more production-stable of the two on paper — an unusual inversion that matters if you are putting either into a system you cannot easily re-tune.
Coding, Computer Use, and Agentic Work
For an agentic coding workflow, Claude Sonnet 5 has the more direct evidence. It leads the shared SWE-bench Pro benchmark, it integrates tightly with Claude Code, and its 81.2% OSWorld-Verified score speaks to the "drive a real UI" loop that pure coding benchmarks miss. If your agent needs to read a dashboard screenshot, decide the next click, and act, Sonnet 5 has published that capability where Gemini 3.1 Pro has not. Gemini 3.1 Pro is not weak at agentic work, though — far from it. Google reports 85.9% on BrowseComp for agentic web search, 68.5% on Terminal-Bench 2.0 for command-line workflows, and strong tool-use results on tau2-bench and MCP Atlas, which point to a model built to plan and call tools across long horizons. The difference is emphasis: Sonnet 5's agentic story is documented most strongly in code and graphical computer use, while Gemini 3.1 Pro's is documented most strongly in search, terminal, and tool orchestration.
That distinction should map onto your architecture. Teams building coding agents, computer-use bots, or anything that manipulates a graphical interface will find Sonnet 5's published numbers more directly reassuring. Teams building research agents, browser-driven data gathering, or multi-tool pipelines will find Gemini 3.1 Pro's BrowseComp and tool-use suite more relevant. Neither is a bad agentic model; they simply publish their strengths in different places, and a single benchmark headline will not tell you which one fits your loop.
Pricing: A Nuanced Split
Price is closer than the capability lanes suggest, and it does not hand a clean win to either side. Both are mid-tier, value-priced models, and we verified both directly from vendor pricing pages. The wrinkle is that Gemini 3.1 Pro charges by context length while Claude Sonnet 5 charges a flat rate that changes on a calendar date.
| Cost dimension | Claude Sonnet 5 | Gemini 3.1 Pro |
|---|---|---|
| Input per 1M tokens (up to 200K) | $2 introductory, $3 standard from September 1, 2026 | $2 |
| Input per 1M tokens (above 200K) | $2 introductory, $3 standard (flat, no context tier) | $4 |
| Output per 1M tokens (up to 200K) | $10 introductory, $15 standard | $12 |
| Output per 1M tokens (above 200K) | $10 introductory, $15 standard (flat) | $18 |
| Cached input per 1M tokens | $0.20 introductory, $0.30 standard | $0.20 (up to 200K), $0.40 (above 200K), plus storage |
| Availability and free access | Generally available; free as the default on Claude.ai | Preview; testable in Google AI Studio and the Gemini app |
Read it carefully and the winner depends on your prompt shape. During Claude Sonnet 5's introductory window through August 2026, it is cheaper than or equal to Gemini 3.1 Pro across the board — the same $2 input at short context, and clearly cheaper output at $10 against $12. Once Sonnet 5 moves to standard pricing on September 1, 2026, the picture flips at short context: Gemini 3.1 Pro's $2 standard input undercuts Sonnet 5's $3, though Sonnet 5 stays cheaper on output at $15 against $12. For very long prompts above 200,000 tokens, Sonnet 5's flat rate becomes the cheaper option again, because Gemini 3.1 Pro's rate steps up to $4 input and $18 output while Sonnet 5 stays at its single tier. Since output tokens usually dominate the bill on generation-heavy and agentic work, Sonnet 5's consistent output-price edge is the more common practical advantage — but a workload that is input-heavy at short context can genuinely be cheaper on Gemini 3.1 Pro at standard rates. Measure on your own token mix before assuming either is cheaper.
One honesty note keeps this from being a tidy verdict: cross-vendor token pricing is not perfectly one-to-one, because Anthropic and Google use different tokenizers, so the real cost difference on your actual workload may not track the headline per-token ratio exactly. Both also offer discount modes — Gemini 3.1 Pro publishes batch and flex tiers that lower its rate, and Sonnet 5 is free to use as the default on Claude.ai — so the effective cost depends on how you run it, not just the sticker rate.
Positioning, Availability, and Ecosystem
Both models are the value-tier workhorses of their families, but they sit in differently shaped lineups. Claude Sonnet 5 is Anthropic's mid-tier default, with Claude Opus 4.8 reserved above it for the hardest, most safety-sensitive slice of work. Gemini 3.1 Pro is the "Pro" tier of the Gemini 3 series, above the Flash models and below Gemini 3 Deep Think for the heaviest science and research reasoning. So both are the "reach for this first" option in their stack, with a bigger sibling available when you need it.
Availability is where they diverge most in day-to-day terms. Claude Sonnet 5 is generally available through the Claude API, Claude Code, and Claude.ai, and it is the default model non-technical users meet in the browser — which means you can evaluate the exact production model for free before spending on the API. Gemini 3.1 Pro, despite being months older, is still in preview, distributed through the Gemini API, Vertex AI, Google AI Studio, and the Gemini app; it is testable in AI Studio and the app, but its preview label means Google may adjust it before general availability. On raw reach, Google's distribution through Vertex AI and Workspace-adjacent surfaces is enormous, and Anthropic's is a tighter, developer-focused ecosystem anchored by Claude Code. If deep Google Cloud integration or the Gemini app matters to you, Gemini 3.1 Pro is the natural fit; if you want a stable, generally available model with a lean migration path and free in-browser evaluation, Sonnet 5 has the edge.
Winner by Category
Best shared-benchmark coding: Claude Sonnet 5
On the one clean shared benchmark, Sonnet 5's 63.2% beats Gemini 3.1 Pro's 54.2% on SWE-bench Pro, the harder, contamination-resistant coding test. If your priority is the best documented result on the benchmark where these two actually meet, Sonnet 5 has a nine-point edge — with the caveat that both figures are vendor-reported rather than independently refereed.
Best published reasoning: Gemini 3.1 Pro
Gemini 3.1 Pro reports 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2, top-tier reasoning scores that Claude Sonnet 5 does not publish. For graduate-science reasoning and abstract problem-solving where you want documented numbers, Gemini 3.1 Pro is the model with the receipts.
Best documented computer use: Claude Sonnet 5
Sonnet 5 publishes an 81.2% OSWorld-Verified score for graphical computer use — operating real software without an API. Google does not report an OSWorld figure for Gemini 3.1 Pro, so for documented graphical computer use Sonnet 5 is the measured, safer choice.
Best native multimodality: Gemini 3.1 Pro
Gemini 3.1 Pro accepts text, images, audio, video, and PDFs as input, while Claude Sonnet 5 takes text and images. For any pipeline that ingests audio or video, or reasons over mixed media in a single call, Gemini 3.1 Pro is the clear fit.
Best maturity and availability: Claude Sonnet 5
Claude Sonnet 5 shipped generally available and as the default on Claude.ai, while Gemini 3.1 Pro — although released back in February 2026 — remains in preview. If you need a production-stable model you can rely on not changing under you, Sonnet 5 is the steadier bet today.
Best short-context value and agentic breadth: Gemini 3.1 Pro
At standard rates, Gemini 3.1 Pro's $2 input up to 200K tokens undercuts Sonnet 5's $3, and its broad tool-use suite — 85.9% on BrowseComp, strong tau2-bench and MCP Atlas results — makes it a strong pick for research and multi-tool agents. Sonnet 5 answers with a lower output price and long-context flat rate, so weigh this by your token mix.
Overall winner: A split, not a headline
There is no across-the-board winner here, and calling one would be dishonest. The two models publish almost non-overlapping benchmarks and lead in different lanes, so the right pick is the one whose lane is yours: Sonnet 5 for coding, computer use, and shipping stability; Gemini 3.1 Pro for reasoning and multimodal reach.
Pros and Cons of Each
Claude Sonnet 5
What stands out:
- Leads the one shared coding benchmark: 63.2% SWE-bench Pro against Gemini 3.1 Pro's 54.2%, a nine-point edge
- Publishes an 81.2% OSWorld-Verified computer-use score for driving real graphical software, which Gemini 3.1 Pro does not report
- Generally available and the default model on Claude.ai, so you can evaluate the exact production model for free before any API spend
- Cheaper output at $10 per million tokens introductory ($15 standard) against Gemini 3.1 Pro's $12, plus a flat rate that stays cheaper on very long prompts
- Tight developer ecosystem with Claude Code and a one-line model-string migration from prior Claude models
Where it falls short:
- Publishes no GPQA Diamond or ARC-AGI-2 reasoning score to match Gemini 3.1 Pro's 94.3% and 77.1%
- Accepts only text and images — no native audio or video input
- Introductory pricing rises to $3 input and $15 output per million tokens on September 1, 2026, ceding the short-context input price to Gemini
- Brand new (June 30, 2026 launch), so its long-run independent track record is still thin
- Smaller distribution surface than Google's Vertex AI and Gemini app reach
Gemini 3.1 Pro
What stands out:
- Top-tier published reasoning: 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2, scores Claude Sonnet 5 does not publish
- Natively multimodal — accepts text, images, audio, video, and PDFs, a genuine advantage for mixed-media pipelines
- Broad agentic tool-use suite, including 85.9% on BrowseComp and 68.5% on Terminal-Bench 2.0
- Cheaper standard input at short context ($2 per million tokens up to 200K) and huge distribution through Vertex AI, AI Studio, and the Gemini app
- Also publishes SWE-bench Verified at 80.6%, so it is no slouch at coding on its own board
Where it falls short:
- Trails Sonnet 5 by nine points on the shared SWE-bench Pro benchmark (54.2% vs 63.2%)
- Publishes no OSWorld or graphical computer-use score to match Sonnet 5's 81.2%
- Still labeled preview months after its February 2026 release, so behavior may change before general availability
- Higher output price ($12 per million tokens, rising to $18 above 200K) and a context-tiered rate that gets more expensive on long prompts
- Text-only output, and a 64,000-token output cap
When to Pick Which
Pick Claude Sonnet 5 if...
Your work is coding-heavy, computer-use-heavy, or you need a generally available model you can trust not to shift under you. Sonnet 5 is the stronger default when you trust the shared SWE-bench Pro benchmark, when your agents drive graphical interfaces where its OSWorld score is directly relevant, when output-token cost dominates your bill, or when free in-browser evaluation on Claude.ai and a lean Claude Code integration shorten your rollout. It is also the pragmatic pick during the introductory pricing window through August 2026, when it is cheaper than or equal to Gemini 3.1 Pro across the board. Reach up to Claude Opus 4.8 only for the hardest slice above Sonnet 5.
Pick Gemini 3.1 Pro if...
Your work leans on reasoning, mixed-media input, or the Google ecosystem. Gemini 3.1 Pro is the better choice when documented reasoning scores like GPQA Diamond and ARC-AGI-2 are what you optimize for, when your pipeline ingests audio or video and needs native multimodality, when you are building research or browser-driven agents where its BrowseComp and tool-use results shine, or when you are already on Vertex AI and want tight Google Cloud integration. The caveats to weigh first: it is still in preview, so behavior may change, and its output price and long-context rate run higher than Sonnet 5's. If those do not block you, its reasoning and multimodal reach are strong reasons to choose it.
Or run a split stack
The two are not mutually exclusive, and because they lead in different lanes a split stack is unusually natural here. A common 2026 pattern is to route by task: send coding, computer-use, and generation-heavy work to Claude Sonnet 5 where its shared-benchmark edge and output price earn their keep, and send reasoning-intensive or multimodal work to Gemini 3.1 Pro where its GPQA, ARC-AGI-2, and native audio-video handling do. If you want to see how each stacks up against neighboring models, our Claude Sonnet 5 vs GPT-5.5, Claude Opus 4.8 vs Gemini 3.1 Pro, and Claude Sonnet 5 vs Claude Opus 4.8 comparisons cover the surrounding matchups.
Frequently Asked Questions
Is Claude Sonnet 5 or Gemini 3.1 Pro better for coding?
On the one benchmark both vendors report on the same scale — SWE-bench Pro — Claude Sonnet 5 leads with 63.2% against Gemini 3.1 Pro's 54.2%, a nine-point edge from Anthropic's system card versus Google's model card. Sonnet 5 also publishes an 81.2% OSWorld-Verified computer-use score that Gemini does not report. Gemini 3.1 Pro is no slouch at coding on its own board, reporting SWE-bench Verified at 80.6%, but that is a different benchmark we do not line up against Pro. For agentic coding and computer use, Sonnet 5 is the better-documented choice.
Is Gemini 3.1 Pro or Claude Sonnet 5 better at reasoning?
Gemini 3.1 Pro, based on published numbers. Google reports 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2 for Gemini 3.1 Pro, top-tier reasoning scores. Claude Sonnet 5's locked benchmark set does not include comparable GPQA or ARC-AGI-2 figures, so we do not place a Sonnet number beside them. If documented reasoning performance is your priority, Gemini 3.1 Pro is the model with the receipts.
Which model is cheaper, Claude Sonnet 5 or Gemini 3.1 Pro?
It depends on your prompt shape. During Sonnet 5's introductory window through August 2026, it is cheaper than or equal to Gemini 3.1 Pro across the board ($2 input, $10 output per million tokens against Gemini's $2 and $12). At standard rates from September 2026, Gemini 3.1 Pro's $2 input up to 200K undercuts Sonnet 5's $3, but Sonnet 5 stays cheaper on output ($15 against $12) and on prompts above 200K tokens, where Gemini rises to $4 input and $18 output. Since output usually dominates the bill, Sonnet 5 is often cheaper in practice, but measure on your own token mix.
Why is Gemini 3.1 Pro still in preview if it launched before Claude Sonnet 5?
Gemini 3.1 Pro was published on February 19, 2026, and Google released it in preview to validate updates before general availability; as of this writing it is still labeled preview (model id gemini-3.1-pro-preview). Claude Sonnet 5 launched later, on June 30, 2026, but shipped as a generally available model and the default on Claude.ai. So the newer model is the more production-stable of the two today, an unusual inversion worth noting if you need a model that will not change under you.
Do both models have a 1-million-token context window?
Effectively yes on input. Claude Sonnet 5 supports a 1-million-token context window, and Gemini 3.1 Pro supports 1 million input tokens with a 64,000-token output limit. For very large multi-document or multi-repository analysis, both comfortably handle long contexts. Note that Gemini 3.1 Pro's pricing steps up above 200,000 tokens, while Sonnet 5 keeps a single flat rate regardless of prompt length.
Which model handles audio and video?
Gemini 3.1 Pro. It is natively multimodal and accepts text, images, audio, video, and PDFs as input, returning text. Claude Sonnet 5 accepts text and images and returns text, with no native audio or video input on this base model. For transcription-plus-analysis, video understanding, or mixed-media pipelines, Gemini 3.1 Pro is the more natural fit.
Does either model do computer use or browser automation?
They document different agentic strengths. Claude Sonnet 5 reports 81.2% on OSWorld-Verified, the test for operating real graphical software without an API. Google does not publish an OSWorld figure for Gemini 3.1 Pro; its related agentic numbers are 85.9% on BrowseComp for agentic web search and 68.5% on Terminal-Bench 2.0 for command-line workflows. Because those are different tests, we do not compare the scores directly. For documented graphical computer use, Sonnet 5 is the measured choice; for agentic web search, Gemini 3.1 Pro has the published edge.
What is Gemini 3.1 Pro's SWE-bench Pro score?
Google's model card lists Gemini 3.1 Pro at 54.2% on SWE-bench Pro, labeled "SWE-bench Pro (Public)" and measured as a single attempt averaged over five runs. Claude Sonnet 5 reports 63.2% on SWE-bench Pro from Anthropic's system card, a nine-point lead. Both are vendor-reported figures we have not independently reproduced, and because each lab runs its own harness the exact subset may differ slightly, so read the gap as indicative rather than refereed.
Is Gemini 3.1 Pro a flagship model?
It is Google's "Pro" tier of the Gemini 3 series — the value-tier workhorse, above the Flash models and below Gemini 3 Deep Think for the heaviest reasoning. There is no separate "Ultra" model; "Ultra" refers to a Google subscription plan, not a model. Claude Sonnet 5 sits in a similar spot as Anthropic's mid-tier default below the Claude Opus 4.8 flagship. Both are "reach for this first" models with a bigger sibling available above them.
Which model should I choose for a research or data-gathering agent?
Gemini 3.1 Pro has the more relevant published numbers for that use case. It reports 85.9% on BrowseComp for agentic web search plus strong tool-use results on tau2-bench and MCP Atlas, and its native multimodality helps when sources include images, audio, or PDFs. Claude Sonnet 5 is the stronger pick if your agent instead manipulates graphical software or writes and runs code, where its OSWorld and SWE-bench Pro results lead. Match the model to the shape of your agent's work.
Can I try both models for free?
Partly. Claude Sonnet 5 is free to use as the default model on Claude.ai's free plan, so you can evaluate the exact production model in the browser before any API spend. Gemini 3.1 Pro is a paid model through the Gemini API and Vertex AI, but you can test it in Google AI Studio and through the Gemini app. For production, both are billed per token, and you should run a representative evaluation on your own workload before committing.
Which model is the better overall pick in 2026?
Neither wins outright — this is a genuine split. The two publish almost non-overlapping benchmarks and lead in different lanes. Claude Sonnet 5 is the better pick for coding, documented computer use, and a generally available, production-stable model. Gemini 3.1 Pro is the better pick for published reasoning scores, native audio and video multimodality, and Google-ecosystem integration. Choose by which lane your work lives in, not by a single headline number.
Final Verdict
This comparison does not have a single winner, and pretending otherwise would be dishonest. Claude Sonnet 5 and Gemini 3.1 Pro lead in different lanes, and they barely meet on the same benchmark. Where they do meet — SWE-bench Pro — Sonnet 5 is ahead, 63.2% to 54.2%, and it backs that with a documented 81.2% OSWorld-Verified computer-use score and the reassurance of being generally available. If your work is coding, computer use, or anything you need to ship on a stable model today, Claude Sonnet 5 is the answer.
But Gemini 3.1 Pro owns the reasoning-and-multimodal lane just as clearly. Its 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2 are reasoning scores Sonnet 5 does not publish, and its native handling of audio, video, and PDFs is a capability Sonnet 5 simply does not have. Add a broad agentic tool-use suite and deep Google Cloud distribution, and it is the rational pick for reasoning-heavy or mixed-media work — with the honest caveat that it is still in preview months after its February 2026 release. The two models are close enough in tier and price, and different enough in strengths, that the smart move for many teams is a split stack: Sonnet 5 for code and computer use, Gemini 3.1 Pro for reasoning and multimodal input. Measure both on your own workload, and let the shape of your work — not a single benchmark headline — make the call.
Last compared: July 2026. Claude Sonnet 5 launched June 30, 2026; Gemini 3.1 Pro was published February 19, 2026, and remains in preview as of this writing. Our Sonnet 5 assessment reflects limited first-day hands-on time plus Anthropic's published benchmarks; our Gemini 3.1 Pro assessment is research-led, built on Google's model card, model page, and evaluation report. Benchmark figures are vendor-reported and not independently reproduced by our team; SWE-bench Pro and SWE-bench Verified are different benchmarks and are never compared against each other, and single-sided scores (such as Gemini's GPQA Diamond and Sonnet 5's OSWorld-Verified) are labeled as such rather than mirrored. Pricing verified directly from Anthropic's and Google's pricing pages at the time of writing.
Our Verdict
This is a genuine split with no single overall winner. Claude Sonnet 5 and Gemini 3.1 Pro lead in different lanes and barely meet on the same benchmark. Where they do — SWE-bench Pro — Sonnet 5 is ahead, 63.2% to 54.2%, and it backs that with a documented 81.2% OSWorld-Verified computer-use score and generally available, production-stable shipping. Gemini 3.1 Pro owns the reasoning-and-multimodal lane just as clearly: 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2 are scores Sonnet 5 does not publish, and native handling of audio, video, and PDFs is a capability Sonnet 5 lacks — with the caveat that Gemini 3.1 Pro is still in preview months after its February 2026 release. Best for coding, computer use, and maturity: Claude Sonnet 5. Best for reasoning and multimodal input: Gemini 3.1 Pro. Many teams should run a split stack.
Choose Claude Sonnet 5
Anthropic's most agentic midsize model — near-Opus 4.8 coding and computer use at $2 per million input tokens (introductory through August 2026).
Try Claude Sonnet 5 →Choose Gemini 3.1 Pro Preview
Google DeepMind's flagship Gemini 3.1 Pro Preview — 94.3% GPQA Diamond, 77.1% ARC-AGI-2, 1M-token context, multimodal in/text out, vibe coding plus agentic tool use. Preview status as of April 2026.
Try Gemini 3.1 Pro Preview →Frequently Asked Questions
Is Claude Sonnet 5 better than Gemini 3.1 Pro Preview?
This is a genuine split with no single overall winner. Claude Sonnet 5 and Gemini 3.1 Pro lead in different lanes and barely meet on the same benchmark. Where they do — SWE-bench Pro — Sonnet 5 is ahead, 63.2% to 54.2%, and it backs that with a documented 81.2% OSWorld-Verified computer-use score and generally available, production-stable shipping. Gemini 3.1 Pro owns the reasoning-and-multimodal lane just as clearly: 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2 are scores Sonnet 5 does not publish, and native handling of audio, video, and PDFs is a capability Sonnet 5 lacks — with the caveat that Gemini 3.1 Pro is still in preview months after its February 2026 release. Best for coding, computer use, and maturity: Claude Sonnet 5. Best for reasoning and multimodal input: Gemini 3.1 Pro. Many teams should run a split stack.
Which is cheaper, Claude Sonnet 5 or Gemini 3.1 Pro Preview?
Claude Sonnet 5 is priced at $2 in / $10 out per M tokens (free plan available). Gemini 3.1 Pro Preview is priced at $2 in / $12 out per M tokens. Check the pricing comparison section above for a full breakdown.
What are the main differences between Claude Sonnet 5 and Gemini 3.1 Pro Preview?
The key differences span across 11 features we compared. For SWE-bench Pro (shared coding benchmark), Claude Sonnet 5 offers 63.2% while Gemini 3.1 Pro Preview offers 54.2%. For SWE-bench Verified (Gemini single-sided), Claude Sonnet 5 offers Not in published set while Gemini 3.1 Pro Preview offers 80.6%. For GPQA Diamond reasoning (single-sided), Claude Sonnet 5 offers Not published while Gemini 3.1 Pro Preview offers 94.3%. See the full feature comparison table above for all details.

