Skip to content
news12 min read

xAI Grok Imagine Video 1.5 Hits No. 1 on the Image-to-Video Arena With Native Audio (June 2026)

xAI released Grok Imagine Video 1.5 in preview, an image-to-video model with native single-pass audio. It ranked No. 1 on the Artificial Analysis Image-to-Video Arena at launch, plus 52 Elo over v1.0, ahead of Seedance 2.0 and Google Veo. Output is 720p or 480p at 24 fps, clips up to 15 seconds. Pricing on fal.ai: $0.14 per second for 720p, $0.08 per second for 480p.

Author
Anthony M.
12 min readVerified June 7, 2026Tested hands-on
xAI Grok Imagine Video 1.5 tops the Artificial Analysis Image-to-Video Arena — Hero
Grok Imagine Video 1.5 takes the No. 1 spot on the image-to-video leaderboard with native audio

Grok Imagine Video 1.5 is xAI's new image-to-video model, released in preview in late May 2026, that generates clips up to 15 seconds at 720p and 24 frames per second with native synchronized audio produced in a single pass. At launch it ranked No. 1 on the Artificial Analysis Image-to-Video Arena, gaining 52 Elo points over version 1.0 and edging out Seedance 2.0 and Google Veo. Pricing on fal.ai is $0.14 per second for 720p, $0.08 per second for 480p, plus $0.01 per input image.

We track the AI video generation race almost daily, and the pace right now is brutal. Three months ago Google Veo 3.1 and the now-shuttered OpenAI Sora were the names everyone benchmarked against. Then ByteDance's Seedance 2.0 surged. Now xAI has quietly slid a preview model to the top of the most-watched image-to-video leaderboard — and it did it with the one feature that has separated finished clips from raw demos all year: audio baked into the same generation pass as the pixels. This is an editorial breakdown of what shipped, why the ranking matters, and how it reshapes a market that was already moving faster than anyone could keep up with.

What Happened

xAI released Grok Imagine Video 1.5 in preview at the end of May 2026, with tier-1 coverage following on June 4. It is primarily an image-to-video model — you feed it a still frame and a prompt, and it animates the scene — but it also supports text-to-video generation from a written description alone. The model outputs at 720p or 480p, both at 24 frames per second, and produces clips with a configurable length between 6 and 15 seconds.

The headline capability is native audio. Grok Imagine Video 1.5 generates music, sound effects, and spoken dialogue with lip-sync in the same pass that creates the video, rather than stitching a soundtrack on afterward. For anyone who has spent the last year exporting silent clips and hand-syncing audio in a separate editor, that single-pass design is the part that changes the workflow, not just the spec sheet.

Beyond core generation, the model adds two practical features. Video extension lets you continue an existing clip past its original endpoint, and reference-guided generation lets you steer the output using a supplied reference frame so motion and style stay consistent. Both are aimed squarely at production use, where a single 15-second take is rarely the whole job.

The release lands in a market that has reshuffled twice already this year. OpenAI's Sora was shut down in March, and we documented the fallout when Kevin Weil, Bill Peebles and Srinivas Narayanan walked out as Sora died. Google moved aggressively to fill the gap, launching the budget tier we covered in Google Veo 3.1 Lite at $0.05 per second. Grok Imagine Video 1.5 is xAI's answer to both.

Grok Imagine Video 1.5 ranks first on the Artificial Analysis Image-to-Video Arena, ahead of Seedance 2.0 and Google Veo
Grok Imagine Video 1.5 leads the image-to-video leaderboard at launch, plus 52 Elo over version 1.0

The Leaderboard Result That Got Everyone's Attention

At launch, Grok Imagine Video 1.5 ranked No. 1 on the Artificial Analysis Image-to-Video Arena. That is the result driving the conversation. The Arena ranks models by head-to-head human preference, where evaluators pick the better of two generated clips and the scores are aggregated into an Elo rating — the same comparative method used to rank chess players and, more recently, chat models.

The improvement over xAI's own previous version is the more telling number. Grok Imagine Video 1.5 gained 52 Elo points over version 1.0. In a preference-ranked Arena, a jump of that size between two releases of the same family is large; it reflects a clear, repeatable quality gap that human raters noticed again and again, not a coin-flip difference. The new model placed ahead of Seedance 2.0, ByteDance's strong contender, and ahead of Google Veo, the model that has set the bar for much of 2026.

A word of editorial caution, because it matters. A top spot on a preference Arena is a snapshot, not a permanent crown. Leaderboards move when new models ship, and this is a preview release, which means behavior and ranking can shift before any general-availability version arrives. We are reporting where Grok Imagine Video 1.5 sits today and why that is notable — not declaring a settled winner of a race that has changed leaders three times this year.

Why Native Audio Is the Real Story

Resolution and frame rate are table stakes now. What still separates the field is whether a model can deliver a finished clip — picture and sound — in one shot. For most of this year, the workflow was the same regardless of which model you used: generate silent video, then build the audio separately in another tool, then sync the two by hand. That extra round trip is where production time disappears.

Grok Imagine Video 1.5 collapses that into a single generation pass. Music, sound effects, and lip-synced dialogue come out attached to the video. Google's Veo line has pushed native audio hard this year, and it is a major reason Veo set the standard. By matching single-pass audio and pairing it with a top Arena placement, xAI is competing on the exact axis that has defined the leading edge of AI video — not just chasing pixel counts.

The lip-sync dimension specifically is where this gets interesting for creators. A clip with a character that talks, where the mouth movements match the generated speech, is the difference between an animated still and something that reads as a real shot. Doing that in the same pass as the video — instead of generating the face, then the voice, then forcing them to agree — is a meaningfully harder problem, and it is the one xAI is claiming to have solved well enough to top the Arena.

Grok Imagine Video 1.5 specifications — 720p and 480p, 24 fps, up to 15 second clips, native synced audio
Grok Imagine Video 1.5 at a glance — resolution, frame rate, clip length, and native audio

The Specs, Without the Hype

Here is what Grok Imagine Video 1.5 actually does, stated plainly:

  • Mode: image-to-video (primary) plus text-to-video
  • Resolution: 720p or 480p
  • Frame rate: 24 frames per second
  • Clip length: configurable, 6 to 15 seconds
  • Audio: native, synchronized music, sound effects, and lip-synced dialogue in a single pass
  • Extra features: video extension and reference-guided generation
  • Status: preview

A few things stand out. The 15-second ceiling is generous for this generation of models — many competitors cap shorter, and longer single takes reduce the number of clips you have to stitch together. The 24 frames per second matches cinematic frame rate, which keeps motion looking filmic rather than over-smooth. And the dual 720p/480p output gives you a real cost lever, because the lower resolution is cheaper to run, which matters once you are generating at volume.

What It Costs

Pricing comes from the fal.ai model card, where Grok Imagine Video 1.5 is offered per second of output. The 720p tier runs $0.14 per second, the 480p tier runs $0.08 per second, and each input image you supply adds $0.01.

OutputPriceCost of a 15-second clip
720p video$0.14 per second$2.10
480p video$0.08 per second$1.20
Input image$0.01 per image

To put that in context: a full-length 15-second clip at 720p costs $2.10 before the input image, and dropping to 480p brings the same clip down to $1.20. For comparison, Google's budget tier, Veo 3.1 Lite, launched at $0.05 per second for 720p, so on raw per-second cost Veo's cheapest option undercuts Grok. The pitch for Grok Imagine Video 1.5 is not that it is the cheapest — it is that you are paying for the model that currently tops the preference Arena, with native audio included in that single per-second rate rather than billed as a separate audio workflow.

How It Compares to the Field

The image-to-video market in mid-2026 is crowded and genuinely competitive. Here is where Grok Imagine Video 1.5 sits against the names it is benchmarked against, on the facts available today.

Versus Google Veo. Veo set the 2026 standard, particularly for native audio, and our breakdown of Veo 3.1 Lite, Fast and Full covers how Google tiered its lineup by price and capability. Grok Imagine Video 1.5 placed ahead of Veo on the Arena at launch, but Veo's cheapest tier is less expensive per second, and Veo has a longer track record in production. This is the closest matchup.

Versus Seedance 2.0. ByteDance's Seedance 2.0 has been one of the strongest image-to-video models this year. Grok Imagine Video 1.5 ranked above it on the Arena at launch, which is the clearest signal we have that xAI's new model is operating at the front of the field rather than just near it.

Versus the rest. With Sora 2 discontinued, the open field includes Runway Gen-4.5, Kling, and Luma Ray 3, each with its own strengths in style, control, or ecosystem tooling. Open-source has a seat at the table too — we covered LTX-2.3, the free 4K model from Lightricks, which competes on cost and resolution rather than on the Arena. Grok Imagine Video 1.5 differentiates on the combination of top-ranked output and single-pass audio.

Image-to-video model landscape mid-2026 — Grok Imagine Video 1.5, Veo, Seedance, Runway, Kling, Luma
The mid-2026 image-to-video field — where Grok Imagine Video 1.5 enters the race

Why It Matters

The strategic read here is bigger than one model. xAI has spent 2026 pushing Grok into territory well beyond chat — we have covered Grok Build, its terminal coding agent, and Grok Skills, its cross-conversation memory feature. Topping the image-to-video Arena puts xAI in direct contention with Google in a category Google had been positioned to lead, and it does so by matching Google on the audio feature that defined the category's leading edge.

For creators and developers, the practical takeaway is simpler. Single-pass audio plus a 15-second clip length plus top-ranked visual quality means more of the job gets done in one generation, with fewer manual steps. That is the kind of incremental compression of the workflow that, over a few releases, turns AI video from a novelty into a default production tool. The cost is real — $2.10 for a full 720p clip is not trivial at scale — but for finished output rather than raw footage that still needs an audio pass, the math changes.

It also keeps the pressure on. Every time a new model takes the top spot, the others respond, and the cadence this year has been relentless. A preview model leading the Arena today guarantees that Google, ByteDance, and the rest will answer — which is the best possible outcome for anyone who actually uses these tools.

Our Take

We have watched enough of these launches to be wary of leaderboard headlines, and we will hold to that here. The No. 1 Arena placement is real and notable, and a 52 Elo jump over version 1.0 is a genuinely strong generational improvement that human raters confirmed repeatedly. But it is a preview, the Arena is a snapshot, and this market has changed leaders three times in 2026 already. We would not bet the farm on the ranking holding through to general availability.

What we are more confident about is the direction. Native single-pass audio is the feature that matters most right now, and xAI shipping a top-ranked model with it confirms that the whole field is converging on finished-clip generation rather than silent footage. That is the trend worth watching, and Grok Imagine Video 1.5 is a clean data point on it. We have not run extended production tests on the preview ourselves, so we are reporting the verified specs, pricing, and ranking rather than claiming hands-on benchmarks we do not have.

What's Next

Two things to watch. First, whether Grok Imagine Video 1.5 holds its Arena lead as Google, ByteDance, and others ship their next versions — and how the ranking moves when it exits preview. Second, the pricing trajectory: $0.14 per second for 720p sits above Veo's cheapest tier, and if xAI wants volume adoption, the per-second rate is the obvious lever to pull. We will update this piece as a general-availability version lands and as we get extended time with the model. For now, xAI has done something it has rarely done in video: shipped the model to beat.

Frequently Asked Questions

What is Grok Imagine Video 1.5?

Grok Imagine Video 1.5 is xAI's image-to-video model, released in preview in late May 2026. It animates a still image (or generates from a text prompt) into clips of up to 15 seconds at 720p or 480p and 24 frames per second, with native synchronized audio — music, sound effects, and lip-synced dialogue — produced in a single generation pass. At launch it ranked No. 1 on the Artificial Analysis Image-to-Video Arena.

How much does Grok Imagine Video 1.5 cost?

On the fal.ai model card, Grok Imagine Video 1.5 costs $0.14 per second for 720p output and $0.08 per second for 480p output, plus $0.01 per input image. A full 15-second 720p clip therefore costs $2.10, while the same clip at 480p costs $1.20, before the input image charge.

Is Grok Imagine Video 1.5 better than Google Veo?

At launch, Grok Imagine Video 1.5 ranked above Google Veo on the Artificial Analysis Image-to-Video Arena, which ranks models by human preference. However, Veo's cheapest tier, Veo 3.1 Lite, is less expensive at $0.05 per second for 720p versus Grok's $0.14 per second, and Veo has a longer production track record. Grok leads on the current Arena ranking; Veo leads on price and maturity. This is the closest matchup in the category.

Is Grok Imagine Video 1.5 better than Seedance 2.0?

Grok Imagine Video 1.5 ranked above ByteDance's Seedance 2.0 on the Artificial Analysis Image-to-Video Arena at launch. Seedance 2.0 has been one of the strongest image-to-video models of 2026, so placing ahead of it is the clearest signal that Grok's new model is operating at the front of the field. Rankings can shift as new versions ship, so treat the lead as a current snapshot rather than a permanent result.

Does Grok Imagine Video 1.5 generate audio?

Yes. Native audio is its headline feature. Grok Imagine Video 1.5 generates music, sound effects, and lip-synced dialogue in the same pass that creates the video, rather than requiring a separate audio workflow. This single-pass design is what lets it compete on the same axis as Google Veo, which set the 2026 standard for native audio in AI video.

How long can Grok Imagine Video 1.5 clips be?

Clip length is configurable between 6 and 15 seconds. The 15-second ceiling is generous for this generation of image-to-video models — many competitors cap shorter — which reduces how many clips you have to stitch together for longer sequences. Output is at 24 frames per second, matching cinematic frame rate.

What does the Elo gain over version 1.0 mean?

Grok Imagine Video 1.5 gained 52 Elo points over version 1.0 on the Artificial Analysis Image-to-Video Arena. Elo is a comparative rating from head-to-head human preference votes, the same method used to rank chess players and chat models. A 52-point jump between two releases of the same family is large, reflecting a clear, repeatable quality gap that raters noticed consistently rather than a marginal difference.

What resolution and frame rate does Grok Imagine Video 1.5 support?

Grok Imagine Video 1.5 outputs at 720p or 480p, both at 24 frames per second. The dual-resolution option is a cost lever: 480p is cheaper to run at $0.08 per second versus $0.14 per second for 720p, which matters when generating at volume. The 24 frames per second keeps motion looking filmic rather than over-smooth.

Can Grok Imagine Video 1.5 do text-to-video?

Yes. While it is primarily an image-to-video model — animating a supplied still frame — Grok Imagine Video 1.5 also supports text-to-video, generating a clip from a written description alone. It additionally offers video extension, to continue an existing clip past its endpoint, and reference-guided generation, to keep motion and style consistent with a supplied reference frame.

How does Grok Imagine Video 1.5 compare to Sora and Runway?

OpenAI's Sora was discontinued in March 2026, removing it from the field. Among active models, Runway Gen-4.5, Kling, and Luma Ray 3 each have strengths in style, control, or ecosystem tooling. Grok Imagine Video 1.5 differentiates on the combination of a top Arena ranking and native single-pass audio, while open-source options like LTX-2.3 compete on cost and resolution rather than on preference rankings.

Is Grok Imagine Video 1.5 generally available?

No. As of early June 2026, Grok Imagine Video 1.5 is a preview release. Preview models can change in behavior, pricing, and ranking before a general-availability version arrives. The specs, pricing, and Arena placement reported here reflect the preview at launch and may be updated when a stable version ships.

Why does native single-pass audio matter for AI video?

For most of 2026, the standard workflow was to generate silent video, then build audio separately, then sync the two by hand — a round trip that consumes production time. Grok Imagine Video 1.5 produces picture and sound together, so you get a finished clip from one generation. That compression of the workflow is what turns AI video from a novelty into a default production tool, and it is the axis where the leading models now compete.

Related Articles

Was this review helpful?
Anthony M. — Founder & Lead Reviewer
Anthony M.Verified Builder

We're developers and SaaS builders who use these tools daily in production. Every review comes from hands-on experience building real products — DealPropFirm, ThePlanetIndicator, PropFirmsCodes, and many more. We don't just review tools — we build and ship with them every day.

Written and tested by developers who build with these tools daily.