Qwen3 Goes Sparse: 9.36x Prefill at 1M Context

RTPurbo is a research method, described in a May 2026 arXiv preprint by authors from Nanjing University and Alibaba Group, for converting a full-attention large language model into a sparse-attention one in only a few hundred training steps. Applied to Qwen3-family models, the authors report a prefill speedup of up to 9.36x at one million tokens of context versus FlashAttention-2, plus roughly 2.01x faster decoding, while claiming near-lossless accuracy on long-context and reasoning benchmarks. This is a not-yet-peer-reviewed academic paper, not a shipped product or an official Qwen feature.

Editorial note: This is an analytical explainer by Anthony Martinez (CEO & Founder, ThePlanetTools.ai). ThePlanetTools.ai has no affiliation with Alibaba, the Qwen team, or Nanjing University, and earns nothing from this piece. There are no affiliate links here. The findings below come from a preprint and are reported as the authors' claims, not independently reproduced results. Internal links point to our own editorial coverage.

What the paper actually claims

On May 16, 2026, a group of researchers affiliated with Nanjing University and Alibaba Group posted a preprint to arXiv with a deliberately provocative title: "Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps." The method they describe is called RTPurbo, and the headline number is a prefill speedup of up to 9.36x at one million tokens of context, measured against FlashAttention-2, the de facto baseline for fast attention on modern GPUs.

I want to be precise about what this is before we get into why it matters. This is a research paper — a preprint that has not yet been peer-reviewed — from a university lab working alongside engineers at Alibaba. It is not a feature that Alibaba has shipped, and it is not something the Qwen team has folded into a public model. The Qwen3 models in the paper are used as experimental subjects, because they are strong open-weight models that anyone can download and modify. When you read "9.36x," read it as "the authors report 9.36x in their own experiments," because no one outside this group has reproduced it yet.

With that framing in place, the claim is still striking. The authors say they can take an existing full-attention model and convert it into a sparse-attention model in a few hundred training steps — not a full retrain costing millions of dollars, but a short, cheap fine-tuning pass — and come out the other side with a model that runs dramatically faster on long inputs while losing almost no accuracy. If that holds up, it changes the economics of long-context inference, which is the part of the AI bill that has been quietly ballooning over the past two years.

Prefill versus decode: where the time actually goes

To understand why a "prefill speedup" is a big deal, you have to understand how a language model spends its time when it answers you. There are two distinct phases, and they behave very differently.

The first phase is prefill. This is when the model reads your entire prompt — every token of it — before producing a single word of output. If you paste a 200-page contract, a long codebase, or an hour of meeting transcripts into the context window, prefill is the model chewing through all of that at once. The second phase is decode, where the model generates its response one token at a time, each new token attending back over everything that came before.

Here is the uncomfortable math. In a standard transformer, attention cost grows quadratically with the length of the input. Double the context and you roughly quadruple the prefill work. The reason is that every token attends to every other token: with N tokens, that is N times N comparisons. At a few thousand tokens this is fine. At one million tokens — the kind of context window that frontier models now advertise — it becomes the dominant cost of the whole request, often by a wide margin.

Diagram contrasting the prefill phase, which scales quadratically with input length, against the decode phase, which generates one token at a time — Prefill reads the whole prompt at once and scales quadratically; decode generates one token at a time.

This is exactly why the long-context era has been more expensive than the marketing suggests. We covered the architectural side of this when DeepSeek V4 launched with a hybrid attention design and a 1M-token context window: the headline feature is the context length, but the hidden story is what it costs to actually use it. RTPurbo is attacking that same prefill bottleneck, but from a different angle — not by designing a new architecture from scratch, but by cheaply transferring an existing model into a sparse regime.

Why one million tokens of context is so expensive

Let me put the quadratic problem in concrete terms, because it is easy to wave at "quadratic" and move on without feeling it.

Suppose a model handles a 32,000-token prompt comfortably. Now extend that to one million tokens, roughly a 31x increase in length. Because attention scales with the square of the length, the raw attention work does not grow 31x — it grows closer to 31 times 31, or nearly a thousand-fold, before you account for the optimizations that real systems use. That is the wall. It is why a million-token request can cost orders of magnitude more compute than a short one, and why providers price long-context usage carefully.

There is a second cost that gets less attention in headlines: the KV cache. As the model processes your prompt, it stores a key and a value vector for every token in every attention layer so it does not have to recompute them. This is the KV cache, and it grows linearly with context length. At one million tokens it can swallow an enormous amount of GPU memory — frequently more than the model weights themselves. Memory pressure forces smaller batch sizes, which means fewer requests served per GPU, which means higher cost per query. So the long-context bill has two components: the quadratic compute of prefill, and the linear-but-huge memory of the KV cache.

Visualization of the quadratic attention cost curve rising sharply toward one million tokens of context, alongside the linearly growing KV cache memory block — The long-context bill has two parts: quadratic prefill compute and a large, linearly growing KV cache.

Any method that genuinely shrinks both — without wrecking accuracy — is economically interesting. The authors' reported numbers (9.36x faster prefill, a smaller KV cache footprint via selective retention) target precisely these two pain points. The honest caveat, again, is that these are single-paper numbers on a specific model family.

What sparse attention is, in plain language

Sparse attention is the family of techniques built on a simple observation: when a model is reading a long document, most tokens do not need to look at most other tokens. A word in chapter twelve rarely depends on a specific word in chapter two. Full attention computes every pairwise interaction anyway, which is exhaustive but wasteful. Sparse attention tries to compute only the interactions that matter and skip the rest.

The catch has always been the same: how do you decide, on the fly, which interactions matter? Pick wrong and you drop the one token that the answer actually depended on — and accuracy collapses on exactly the long-context retrieval tasks where long context is supposed to shine. Many sparse-attention schemes trade real accuracy for their speed, and that trade is why full attention has kept "striking back," to borrow the paper's phrase.

The interesting move in this preprint is the claim that you do not have to design a sparse model from the ground up and retrain it from scratch. Instead, you take a fully trained full-attention model that already knows how to use its context well, and you transfer that knowledge into a sparse configuration with a short fine-tuning pass. The authors lean on what they call the intrinsic sparsity of large language models — the empirical finding that trained models already concentrate most of their attention on a small subset of tokens, so the sparse structure is, in a sense, already latent in the weights. The method just makes it explicit and efficient.

How RTPurbo works: retrieval heads, a tiny indexer, and top-p selection

The method rests on three ideas that fit together. None of them is magic on its own; the combination is what produces the reported numbers.

The first idea is retrieval heads. Not all attention heads in a transformer do the same job. Prior research has shown that a small minority of heads are responsible for the model's ability to fetch specific information from far back in the context — the heads that actually carry long-range dependencies. RTPurbo keeps the full KV cache only for these retrieval heads, where dropping information would be catastrophic, and treats the rest more aggressively. That alone cuts a large share of the memory and compute, because the expensive full-attention behavior is reserved for the heads that genuinely need it.

The second idea is a lightweight token indexer. To decide which past tokens a query should attend to, you need a fast way to score relevance. Computing that in the model's full hidden dimension would be expensive — that is the very cost you are trying to avoid. So the authors use a tiny 16-dimensional indexer: a compressed representation that estimates which tokens are worth attending to, cheaply. Sixteen dimensions is small enough to be nearly free relative to the main computation, yet apparently expressive enough to pick the right tokens.

Grid of attention heads with only a few highlighted as retrieval heads, connected to a small sixteen-dimensional indexer and a top-p token selection badge — Full KV cache is kept only for retrieval heads, scored by a tiny 16-dimensional indexer with adaptive top-p selection.

The third idea is dynamic top-p token selection rather than a fixed top-k. A fixed top-k always keeps exactly the same number of tokens — say, the 256 highest-scoring — regardless of the situation. That is rigid: sometimes the answer depends on a handful of tokens, sometimes on a broad swath of the document. Top-p instead keeps as many tokens as needed to cover a target share of the attention mass, so the budget flexes with the input. When relevance is concentrated, it keeps few tokens and runs fast; when relevance is diffuse, it keeps more and protects accuracy. That adaptivity is the authors' answer to the classic sparse-attention failure mode.

Reading the benchmark numbers carefully

The speedup is not a single flat figure — it scales with context length, which is exactly what you would hope to see from a method targeting the quadratic term. The authors report roughly 2.83x faster prefill at 32,000 tokens, climbing to 9.36x at one million tokens, all measured against FlashAttention-2. On the decode side, where the bottleneck is memory bandwidth rather than quadratic compute, they report about 2.01x faster generation.

Line chart showing the reported prefill speedup rising from 2.83x at 32K tokens to 9.36x at one million tokens, measured against FlashAttention-2 — The reported speedup grows with context length — from 2.83x at 32K to 9.36x at 1M versus FlashAttention-2.

The shape of that curve is the tell. A method that only helped at short context would be uninteresting, because short context is already cheap. The fact that the advantage grows as the context grows — from under 3x at 32K to over 9x at 1M — is consistent with a technique that is genuinely cutting the quadratic prefill cost rather than shaving a constant factor. If the numbers hold, the method gets more valuable precisely where long-context inference hurts most.

On accuracy, the authors describe their results as near-lossless across long-context and reasoning benchmarks. This is the claim to scope most carefully. "Near-lossless" is the authors' characterization of their own measurements on the benchmarks they chose, using the Qwen3 models they tested. It is not an independent verdict, and it is not a guarantee that the method preserves accuracy on every task, every model, or every benchmark you might care about. Sparse-attention papers have a long history of looking near-lossless on the evaluations selected and then degrading on adversarial long-context retrieval. The honest reading is: promising, and worth reproducing.

Why the experiments use Qwen3 models

The paper runs its experiments on two members of the Qwen family: Qwen3-Coder-30B-A3B and Qwen3-30B-A3B-Think. The "A3B" notation refers to a mixture-of-experts design where the model has roughly 30 billion total parameters but activates only about 3 billion per token — a now-common pattern that keeps inference cheap relative to the parameter count.

The choice of Qwen3 is pragmatic. These are strong, openly available models that any researcher can download, modify, and run, which makes them ideal subjects for a method that fine-tunes and re-instruments an existing network. Alibaba's Qwen line has become one of the most-used open-weight families in research precisely because it is capable and accessible — we wrote about that momentum when Qwen 3.6 outscored Google's Gemma 4 on coding benchmarks. It is worth repeating that using Qwen3 as a testbed does not make this an official Qwen capability; the Qwen models here are the experiment, not the product.

The bigger pattern: Chinese labs and the efficiency frontier

Step back from the specific method and a pattern comes into view. Over the past two years, a striking share of the most aggressive efficiency research in large language models has come out of Chinese labs and the universities feeding into them. DeepSeek's attention and mixture-of-experts work, Alibaba's Qwen releases, and now a Nanjing University and Alibaba collaboration on cheap sparsification all point the same direction: squeeze more capability out of less compute.

A glass globe with one region glowing brighter, surrounded by research-paper cards and GPU chips, labeled as the efficiency frontier of large language model research — Under compute constraints, efficiency research becomes a competitive edge rather than a nice-to-have.

There is a structural reason for this emphasis. Export controls have constrained access to the highest-end accelerators in China, which makes raw compute scarcer and more expensive there than in the US. When you cannot simply buy your way past a bottleneck with more GPUs, efficiency stops being a nice-to-have and becomes the competitive edge. A method that delivers a 9x prefill speedup is, in that light, not just an academic result — it is a way to make a constrained GPU fleet behave like a larger one. We have written before about how efficiency research is becoming the real frontier of AI competition, and this preprint fits squarely in that story.

The open-weight angle matters too. Because the experiments run on downloadable Qwen3 models and the method is described in enough detail to reproduce, anyone with the GPUs and the patience can attempt to replicate it. That is the open-research flywheel that has made the open-weight ecosystem move so fast — the same dynamic we covered around Google's Apache-2.0 release of Gemma 4. Methods do not stay siloed; they get tried, broken, fixed, and folded into the next release across the whole field.

What a real 9x prefill speedup would change

Suppose, for a moment, that the numbers reproduce and the method generalizes. What actually changes downstream?

The most immediate effect is on the unit cost of long-context inference. If prefill at one million tokens gets 9x cheaper in compute, the per-request cost of feeding an entire codebase, a full legal discovery set, or a year of email threads into a model drops sharply. Applications that were technically possible but economically painful — whole-repository code analysis, document-set question answering, long-horizon agent memory — move closer to being routine rather than premium. The smaller KV-cache footprint compounds this: less memory per request means more requests packed onto the same GPU, which lowers cost per query independent of the raw speedup.

There is also a latency story. Prefill is the part of a long-context request the user waits through before any text appears. Cutting it materially shortens the awkward pause between submitting a giant prompt and seeing the first token. For interactive long-context tools, that responsiveness is often the difference between a feature people use and one they avoid. We have seen this efficiency-equals-accessibility pattern elsewhere — for example when Cohere shipped a 111B open-weights model that runs on just two GPUs with a 256K context window, the headline was not raw capability but the fact that the hardware bar had dropped.

The caveats, stated plainly

I have scattered cautions throughout, but they deserve a clean summary, because the gap between a striking preprint and a deployed capability is where a lot of AI hype goes to quietly die.

First, this is a preprint. It has not been peer-reviewed, and the benchmark results are mono-method: produced by the authors, on their setup, with no independent reproduction at the time of writing. Second, "near-lossless" is a claim about the specific benchmarks the authors chose, not a universal property; sparse-attention methods have a documented habit of looking clean on selected evaluations and then degrading on harder long-context retrieval. Third, the experiments cover two Qwen3 models, not a broad sweep of architectures, so generalization to other model families is an open question. Fourth, a 9.36x figure measured in a research harness is not the same as a 9.36x improvement in a production serving stack, where batching, quantization, and real traffic patterns all interact.

None of this means the work is unimportant. The cheap-transfer framing — convert an existing full-attention model to sparse in a few hundred steps, rather than retraining from scratch — is a genuinely useful idea if it holds, because it lets the whole open-weight ecosystem retrofit speed onto models people already trust. But the right posture toward a single, unreproduced preprint is curiosity with a hand on the brake, not a victory lap.

The bottom line

RTPurbo is a research method from Nanjing University and Alibaba Group, posted as an arXiv preprint in May 2026, that converts full-attention Qwen3 models into sparse-attention ones in a few hundred training steps and reports up to 9.36x faster prefill at one million tokens of context, plus roughly 2.01x faster decoding, with what the authors call near-lossless accuracy. The technical core — keep the full KV cache only for retrieval heads, score relevance with a tiny 16-dimensional indexer, and select tokens with adaptive top-p rather than fixed top-k — is elegant and targets the exact place long-context inference hurts: the quadratic cost of prefill.

If it reproduces, it is a meaningful nudge to the economics of long-context AI, and another data point in the broader story of Chinese labs leading on efficiency under compute constraints. If it does not, it joins the long list of sparse-attention methods that looked near-lossless until someone ran the hard retrieval tests. Either way, the question worth tracking is not whether attention can be made sparse — we have known that for years — but whether it can be made sparse cheaply, on models we already have, without quietly losing the long-range memory that made long context worth wanting in the first place. This preprint is a serious attempt at that question. It is not yet an answer.

Frequently asked questions

What is RTPurbo?

RTPurbo is a research method described in a May 2026 arXiv preprint (arXiv:2605.16928) by authors from Nanjing University and Alibaba Group. It converts a full-attention large language model into a sparse-attention one in only a few hundred training steps, rather than retraining from scratch. Applied to Qwen3-family models, the authors report up to a 9.36x prefill speedup at one million tokens of context versus FlashAttention-2. It is an academic preprint, not a shipped product or an official Qwen feature.

Did Alibaba ship RTPurbo as a product or add it to Qwen?

No. RTPurbo is described in a research paper, not deployed in a product. The authors are affiliated with Nanjing University and Alibaba Group, and they use the open-weight Qwen3 models as experimental subjects. There is no indication that the Qwen team has folded the method into a public model. It should be read as "researchers report," not "Alibaba shipped" or "Qwen now has."

What is the 9.36x speedup, exactly?

It is a prefill speedup that the authors report at one million tokens of context, measured against FlashAttention-2 (FA2). The advantage scales with context length: roughly 2.83x at 32,000 tokens, climbing to 9.36x at one million tokens. The paper also reports about 2.01x faster decoding. These are single-paper numbers on Qwen3 models and have not been independently reproduced at the time of writing.

What is the difference between prefill and decode?

Prefill is the phase where a model reads your entire prompt before producing any output; its cost grows quadratically with input length, so long prompts dominate the bill. Decode is the phase where the model generates its answer one token at a time, where the main bottleneck is memory bandwidth rather than quadratic compute. RTPurbo targets both, with a larger reported gain on prefill (up to 9.36x) than on decode (about 2.01x).

Why is one million tokens of context so expensive?

Standard transformer attention scales with the square of the input length, so going from 32,000 tokens to one million tokens increases attention work close to a thousand-fold before optimizations. On top of that, the KV cache — the keys and values stored for every token — grows linearly and can consume more GPU memory than the model weights, forcing smaller batches and raising the cost per query. Long context is expensive on both compute and memory axes.

What is sparse attention?

Sparse attention is a family of techniques based on the observation that, in long documents, most tokens do not need to attend to most other tokens. Instead of computing every pairwise interaction like full attention does, sparse attention computes only the interactions that matter and skips the rest. The hard part is deciding which interactions matter without dropping the one token an answer depends on, which is where many sparse methods lose accuracy.

How does RTPurbo decide which tokens to keep?

It combines three ideas: it keeps the full KV cache only for the small set of "retrieval heads" that carry long-range dependencies; it uses a lightweight 16-dimensional token indexer to cheaply score which past tokens are worth attending to; and it selects tokens with dynamic top-p rather than a fixed top-k, so the budget flexes with the input — keeping few tokens when relevance is concentrated and more when it is diffuse.

What are retrieval heads?

Retrieval heads are the minority of attention heads in a transformer that are mainly responsible for fetching specific information from far back in the context. Prior research found that only a small subset of heads carry these long-range dependencies. RTPurbo exploits this by retaining the full KV cache only for retrieval heads, where losing information would be catastrophic, and treating the remaining heads more aggressively to save memory and compute.

Why top-p token selection instead of top-k?

A fixed top-k always keeps the same number of tokens regardless of the situation, which is rigid: some answers depend on a handful of tokens and others on a broad swath of the document. Top-p instead keeps as many tokens as needed to cover a target share of the attention mass, so it runs fast when relevance is concentrated and protects accuracy when relevance is diffuse. The authors present this adaptivity as their answer to the classic sparse-attention failure mode.

Which models were tested in the paper?

The experiments use two members of the Qwen family: Qwen3-Coder-30B-A3B and Qwen3-30B-A3B-Think. The "A3B" notation refers to a mixture-of-experts design with roughly 30 billion total parameters but only about 3 billion activated per token. These are strong, openly available models, which makes them convenient subjects for a method that fine-tunes and re-instruments an existing network.

Is the "near-lossless accuracy" claim independently verified?

No. "Near-lossless" is the authors' characterization of their own measurements on the long-context and reasoning benchmarks they chose, using the Qwen3 models they tested. The benchmark results are mono-method — produced by the authors with no independent reproduction at the time of writing. Sparse-attention methods have a documented history of looking near-lossless on selected evaluations and then degrading on harder long-context retrieval tasks, so the claim should be scoped accordingly.

How does this relate to DeepSeek and other efficiency research?

It fits a broader pattern of Chinese labs leading on inference efficiency under compute constraints. DeepSeek V4 launched with a hybrid attention design and a 1M-token context window, and Alibaba's Qwen line is one of the most-used open-weight families in research. With export controls making high-end GPUs scarcer in China, efficiency becomes a competitive edge rather than a nice-to-have, which is part of why cheap-sparsification methods like RTPurbo are emerging from this ecosystem.

About this analysis: Written by Anthony Martinez, CEO & Founder of ThePlanetTools.ai. This explainer reports on a not-yet-peer-reviewed arXiv preprint and presents its results as the authors' claims, not independently verified findings. ThePlanetTools.ai has no affiliation with Alibaba, the Qwen team, or Nanjing University. Primary source: arXiv:2605.16928.

Alibaba and Nanjing University Turned Qwen3 Sparse in a Few Hundred Steps — and Hit a 9.36x Prefill Speedup at 1M Context