Skip to content
news13 min read

Ideogram 4.0 Trades the Prompt for Layout Control — and Ships the Weights

Ideogram 4.0 is a 9.3B open-weight image model released June 3, 2026. It swaps prompt engineering for bounding-box layout control in JSON, scores 0.69 mIoU on 7Bench and 0.97 English OCR, and runs on a single 24 GB GPU.

Author
Anthony M.
13 min readVerified June 7, 2026Tested hands-on
Ideogram 4.0 — open-weight image model that swaps the prompt for layout control
Ideogram 4.0 launched June 3, 2026 — a 9.3B open-weight model that positions elements by coordinates instead of describing them in a sentence

Ideogram 4.0 is a 9.3-billion-parameter open-weight text-to-image model that launched on June 3, 2026, with weights and inference code published on Hugging Face and GitHub. Its headline shift is structural: instead of describing a scene in a sentence, you place each element by bounding box in 0-1000 normalized coordinates inside a structured JSON prompt. The model runs a single-stream Diffusion Transformer of 34 layers, uses a frozen Qwen3-VL-8B-Instruct text encoder, generates natively up to 2048 px per side, and scores 0.69 mIoU on the 7Bench layout benchmark and 0.97 on English OCR. The nf4 checkpoint fits on a single 24 GB GPU.

What Happened

On June 3, 2026, Ideogram released Ideogram 4.0 and did something the company had never done before: it shipped the weights. The model, its inference code, a full prompting guide, and sampler presets all went public on Hugging Face and GitHub. The repository ships two quantized checkpoints — an fp8 variant and an nf4 variant — and the nf4 build is small enough to run on a single 24 GB consumer GPU. Commercial deployment requires a paid license; the open weights cover research and the right to inspect, run, and fine-tune the model locally.

The technical sheet is unusually specific. Ideogram 4.0 is a 9.3-billion-parameter single-stream Diffusion Transformer with 34 layers, where text and image tokens share the same projections at every layer rather than being processed by separate streams that cross-attend. The text encoder is Qwen3-VL-8B-Instruct, run in text-only mode and frozen during training, with hidden states pulled from 13 of its intermediate layers and concatenated along the feature dimension. Native resolution spans 256 to 2048 px per side with flexible aspect ratios, and the model supports transparent backgrounds out of the box.

But the number that defines this release is not the parameter count. It is the way you talk to the model. Ideogram 4.0 was trained on structured JSON prompts, and it accepts spatial instructions: any element can be placed by a bounding box given as [y_min, x_min, y_max, x_max] in 0-1000 normalized coordinates. Each element carries its own styling, a text element carries the literal string to render, and the prompt can specify up to 16 hex colors per image and 5 per element. On the 7Bench layout-control benchmark, Ideogram 4.0 scores 0.69 mIoU — a measure of how closely the rendered elements land inside the boxes you asked for. On English text rendering it scores 0.97 OCR accuracy, 0.76 on spatial reasoning, and 0.89 on prompt alignment.

Ideogram 4.0 bounding-box layout control — elements placed by normalized 0-1000 coordinates instead of described in a prompt
Layout-first generation: every element is positioned by a [y_min, x_min, y_max, x_max] box in 0-1000 coordinates, with per-element styling and palette baked into the JSON

From the Prompt to the Layout

For three years, the entire skill of image generation has been prompt engineering — learning to coax a model into a composition by stacking adjectives, weighting tokens, and rerolling seeds until the cat sat where you wanted it. Ideogram 4.0 attacks that workflow at the root. The premise is that a sentence is a terrible interface for spatial intent. "A poster with the title at the top and a product shot in the lower right" is ambiguous in a dozen ways a designer would never tolerate, and diffusion models have historically resolved that ambiguity by guessing.

The layout approach removes the guessing. You declare a canvas, then place each element by coordinate the way you would drag a box in a design editor. The title goes in a box near the top. The product goes in a box in the lower right. A background fill, a logo region, a caption strip — each is an addressable element with its own position, its own style block, and, for text elements, its own literal string. The model's job stops being interpretation and starts being rendering. That is a fundamentally different contract, and it is why the 0.69 mIoU layout score matters more than any aesthetic comparison: it quantifies how faithfully the model honors the boxes rather than how pretty the output looks.

The 0.97 OCR number is the second half of the story. Text-in-image has been the persistent failure mode of diffusion models — the garbled signage, the misspelled headlines, the lorem-ipsum-that-isn't. Because Ideogram 4.0 receives the literal string as a typed field rather than inferring it from a description, English text rendering lands at 0.97 accuracy. For anyone who has tried to get a clean headline out of an image model, that combination — exact text in an exact box — is the actual unlock.

Why Open Weights Changes the Calculus

Plenty of image models render text and accept some form of spatial guidance. What separates this release is that Ideogram shipped the weights. The repository carries both an fp8 and an nf4 checkpoint, and the nf4 variant fits on a single 24 GB GPU — the kind of card that sits in a prosumer workstation, not a data center. That moves Ideogram 4.0 into the same tier of accessibility as the open-weight image flagships, alongside models like FLUX 2 and Stable Diffusion, rather than the API-only tier occupied by GPT Image 2 and Google's Nano Banana Pro.

Open weights matter for three concrete reasons. First, control: you can run the model offline, inspect its behavior, and fine-tune it on your own brand assets without sending a single image to a third-party server. Second, cost structure: once the weights are on your hardware, the marginal cost of an image is electricity, not a per-image API charge that scales with volume. Third, durability: an API can change its pricing, its content policy, or its very existence overnight — a local checkpoint cannot be revoked. The catch is the license. The open weights cover research and local experimentation, but commercial production requires a paid license from Ideogram. That is the now-standard open-weight-with-commercial-license posture, and it is the lever Ideogram keeps to monetize a model it otherwise gave away.

Ideogram 4.0 open-weight tiers — fp8 and nf4 checkpoints, nf4 runs on a single 24 GB GPU, commercial use requires a paid license
Two checkpoints ship publicly — fp8 for fidelity, nf4 for a single 24 GB GPU — while commercial deployment sits behind a paid Ideogram license

The JSON Prompt, in Practice

The structured prompt is the part developers will either love or resent. Ideogram 4.0 was trained exclusively on JSON, so the prompt is not a string — it is a document. At the top level you set the canvas and, optionally, a palette of up to 16 hex colors. Beneath that sits an array of elements. Each element has a type, a bounding box in [y_min, x_min, y_max, x_max] form across the 0-1000 grid, and a styling block that can carry up to 5 of its own hex colors. A text element adds a literal string field that the model renders verbatim. The coordinate origin and the normalized 0-1000 range mean the same prompt produces the same relative composition regardless of the output resolution you request, from a 256 px thumbnail to a 2048 px poster.

For a designer, this reads like a layout file. For a developer, it reads like an API contract — and that is the point. A JSON prompt is programmatically generated, version-controlled, diffed, and templated. You can build a system that fills a brand template with this week's product, this week's headline, and this week's palette, and get a deterministic-by-construction layout every time. That is a different proposition from a prompt string that drifts on every model update. The cost is the learning curve: writing JSON by hand is slower than typing a sentence, and the prompting guide Ideogram shipped exists precisely because the surface area is larger.

How It Compares

Against the 2026 image-gen field, Ideogram 4.0 is not trying to win the aesthetic-photorealism contest. GPT Image 2 and Nano Banana Pro remain the reference points for raw image quality and conversational editing, and both are API-only with per-image pricing — a structure that our analysis of Krea 2 argued is under pressure from free and open alternatives. Midjourney still owns the opinionated house-style position. Where Ideogram 4.0 separates is the combination nobody else ships at once: open weights you can run locally, plus a layout-first JSON interface, plus 0.97 OCR text rendering. That is a workflow model, not a vibes model.

The open-weight comparison is the more revealing one. FLUX 2 and Stable Diffusion gave the community local control over generation, but spatial control on those models still lives in bolt-on adapters — ControlNet, regional prompting, IP-Adapter — trained separately from the base. Ideogram 4.0 builds the spatial contract into the base model and its prompt format. For the open-weights ecosystem, which has been expanding fast across modalities as documented in releases like NVIDIA's Nemotron 3 Ultra, that is a meaningful architectural step rather than a feature checkbox.

Timing is its own signal. Ideogram 4.0 shipped the same day as Reve 2.0, another image release leaning into layout-driven control. Two flagship models converging on the same idea on the same day is not coincidence — it is the field collectively deciding that the prompt has hit its ceiling as a control surface and that explicit layout is the next interface. The era that began with ChatGPT Images 2.0 retiring DALL-E is giving way to one where you stop describing pictures and start composing them.

Ideogram 4.0 benchmark scores — 0.69 mIoU layout on 7Bench, 0.97 English OCR, 0.76 spatial reasoning, 0.89 prompt alignment
Where the model is strong: 0.97 English OCR and 0.69 layout mIoU anchor a profile built for precise, text-heavy compositions rather than pure photorealism

Who Should Care

Three audiences should pay attention. Marketing and design teams that produce templated visuals at volume — ad sets, social variants, localized banners, product cards — get a model that turns a layout spec into a deterministic image and renders headline text correctly. The JSON prompt is a feature, not a tax, when your job is to fill the same template a thousand times with different content. Developers building image pipelines get a programmable interface that fits version control and CI, plus weights they can host themselves to avoid per-image API costs at scale.

The third audience is the privacy- and cost-sensitive builder. If your constraint is that images cannot leave your infrastructure, or that a per-image API bill becomes untenable at your volume, an open-weight model that runs on a single 24 GB GPU is the difference between a project that ships and one that does not — provided you secure the commercial license. The audience this is not built for is the person who wants to type one evocative sentence and get a gorgeous surprise. For that, the prompt-native, photorealism-first models remain the better tool. Ideogram 4.0 trades serendipity for control, and that is a deliberate trade.

Our Take

We have watched image generation iterate on the same loop for years: better prompts, bigger models, slightly more obedient outputs. Ideogram 4.0 is the first release in a while that changes the question instead of improving the answer. The move from prompt to layout is not a feature — it is a different mental model for how a human and an image model collaborate. You stop being a writer hoping to be understood and start being a designer placing elements with intent. The 0.69 mIoU layout score and 0.97 OCR number tell us the model can actually honor that intent, which is the part that has been missing every time someone promised "controllable" generation before.

The open-weight decision is what makes the release consequential rather than merely interesting. A layout-first API would have been a nice closed product. Layout-first weights you can run on a 24 GB card, fine-tune on your brand, and deploy without a per-image meter — that is infrastructure. The paid commercial license is the obvious tension, and teams will need to read the terms carefully before betting a production pipeline on it. But the direction is set. When two flagship models land on the same idea on the same day, the field has chosen. The prompt was never a great interface for spatial design. Ideogram 4.0 is a bet that the layout is.

What's Next

The immediate question is how fast the open-weights community builds on top of these checkpoints. Fine-tunes, LoRA-style adapters, and brand-specific variants tend to appear within days of a weight drop, and a layout-aware base model gives that ecosystem a new primitive to work with. The second question is pricing: with Ideogram giving away the weights and charging only for commercial deployment, the per-image API economics of the closed flagships face fresh pressure from below — the same dynamic now playing out across the image-gen market. The third is whether layout-first becomes the default. If Reve 2.0 and Ideogram 4.0 are the leading edge, expect the next round of releases from the larger labs to add explicit spatial control rather than leave it to adapters. We will be testing the JSON workflow against our own templated-image needs and will report back on where it earns its place — and where the sentence is still faster.

Frequently Asked Questions

What is Ideogram 4.0?

Ideogram 4.0 is a 9.3-billion-parameter open-weight text-to-image model released on June 3, 2026. It is built on a single-stream Diffusion Transformer with 34 layers and uses a frozen Qwen3-VL-8B-Instruct text encoder. Its defining feature is layout-first generation: instead of describing a scene in a sentence, you position each element by bounding box in 0-1000 normalized coordinates inside a structured JSON prompt. Weights and inference code are public on Hugging Face and GitHub, with fp8 and nf4 checkpoints.

What does "layout control instead of prompt" actually mean?

It means you place elements by coordinate rather than describe them in text. In Ideogram 4.0, any element is positioned with a bounding box given as [y_min, x_min, y_max, x_max] on a 0-1000 normalized grid, and each element carries its own styling and, for text, its own literal string. Instead of writing "a poster with the title at the top," you declare a title element in a box near the top. The model renders the layout you specified rather than guessing at your intent, which is why it scores 0.69 mIoU on the 7Bench layout benchmark.

Is Ideogram 4.0 really open-weight?

Yes. Ideogram published the weights, inference code, a full prompting guide, and sampler presets on Hugging Face and GitHub. The repository ships two quantized checkpoints — an fp8 variant and an nf4 variant — and the nf4 build fits on a single 24 GB GPU. The open weights cover research, inspection, local inference, and fine-tuning. Commercial production use requires a paid license from Ideogram, the now-standard open-weight-with-commercial-license model.

How much does Ideogram 4.0 cost?

The model weights themselves are free to download and run locally for research and experimentation. Commercial deployment requires a paid license from Ideogram. Ideogram also offers hosted access through its platform, with paid tiers reported across the industry but priced per image rather than a fixed sum we can confirm here — check ideogram.ai directly for current published rates before budgeting. The economic appeal of the open weights is that once the model runs on your own hardware, the marginal cost of an image is compute rather than a per-image API charge.

What hardware do I need to run Ideogram 4.0 locally?

The nf4 checkpoint is the accessible option: it fits on a single 24 GB GPU, the kind of card found in a prosumer workstation. The fp8 checkpoint preserves more fidelity but needs more memory. Because the model generates natively up to 2048 px per side, higher resolutions and larger batches increase memory pressure, so the 24 GB figure is the floor for the nf4 build rather than a guarantee for every workload.

How does the JSON prompt format work?

Ideogram 4.0 was trained exclusively on structured JSON, so a prompt is a document, not a string. At the top level you set the canvas and an optional palette of up to 16 hex colors. Below that is an array of elements, each with a type, a bounding box in [y_min, x_min, y_max, x_max] form on the 0-1000 grid, and a styling block that can carry up to 5 hex colors of its own. Text elements add a literal string field that the model renders verbatim. Because coordinates are normalized, the same prompt produces the same relative composition at any output resolution.

How does Ideogram 4.0 compare to GPT Image 2 and Nano Banana Pro?

GPT Image 2 and Nano Banana Pro lead on raw photorealism and conversational editing, and both are API-only with per-image pricing. Ideogram 4.0 does not try to beat them on aesthetics. Its edge is the combination nobody else ships together: open weights you can run on a single 24 GB GPU, a layout-first JSON interface, and 0.97 English OCR text rendering. For templated, text-heavy, precision-layout work it is the stronger fit; for a gorgeous one-sentence surprise, the prompt-native flagships still win.

How does Ideogram 4.0 compare to FLUX 2 and Stable Diffusion?

FLUX 2 and Stable Diffusion are the established open-weight image flagships, but spatial control on them lives in bolt-on adapters like ControlNet and IP-Adapter, trained separately from the base model. Ideogram 4.0 builds the spatial contract directly into the base model and its JSON prompt format, so layout is a native control input rather than a post-hoc module. All three give you local control and freedom from per-image API costs; Ideogram 4.0 adds layout-first generation and stronger native text rendering.

How good is Ideogram 4.0 at rendering text in images?

Strong. On the X-Omni OCR benchmark for English text rendering, Ideogram 4.0 scores 0.97. The reason is structural: a text element in the JSON prompt carries the literal string as a typed field, so the model renders it verbatim rather than inferring it from a description. That solves the classic diffusion failure mode of garbled or misspelled in-image text, which makes the model especially useful for headlines, posters, signage, and any composition where the words have to be exactly right.

Who should use Ideogram 4.0?

Three groups. Marketing and design teams producing templated visuals at volume get deterministic layouts and correct headline text from a reusable JSON spec. Developers building image pipelines get a programmable, version-controllable interface plus weights they can self-host to avoid per-image API costs. Privacy- and cost-sensitive builders who need images to stay on their own infrastructure get a model that runs on a single 24 GB GPU. It is not the right tool for someone who wants to type one evocative sentence and be surprised — that is still the territory of prompt-native, photorealism-first models.

What are Ideogram 4.0's limitations?

The JSON prompt has a learning curve — writing a structured document by hand is slower than typing a sentence, which is why Ideogram shipped a dedicated prompting guide. The model is built for precision and text rather than photorealistic beauty, so it is not the pick for a one-line evocative scene. Commercial use requires a paid license despite the open weights, so production teams must read the terms first. And the headline OCR figure of 0.97 is for English; performance on other scripts is not characterized by that number.

Why did two layout-first image models launch on the same day?

Ideogram 4.0 shipped on June 3, 2026, the same day as Reve 2.0, another image model leaning into layout-driven control. Two flagships converging on the same idea on the same day signals that the field has collectively decided the text prompt has reached its ceiling as a control surface for spatial design. Explicit layout — placing elements by coordinate the way you would in a design editor — is emerging as the next interface for image generation, replacing the describe-and-hope workflow that has defined the category since its start.

Related Articles

Was this review helpful?
Anthony M. — Founder & Lead Reviewer
Anthony M.Verified Builder

We're developers and SaaS builders who use these tools daily in production. Every review comes from hands-on experience building real products — DealPropFirm, ThePlanetIndicator, PropFirmsCodes, and many more. We don't just review tools — we build and ship with them every day.

Written and tested by developers who build with these tools daily.