Gemini Omni: Google's New AI Video World Model (2026)

Gemini Omni is Google DeepMind's video world model from May 2026, generating 10-second 1080p clips with conversational editing for AI Plus, Pro, and Ultra users

Gemini Omni is Google DeepMind's multimodal world model, announced May 19, 2026 at Google I/O alongside Gemini 3.5 Flash, unifying Veo, Imagen, Lyria and Nano Banana into one any-to-any video generator capped at 10-second 1080p clips. It is bundled into Google AI Plus ($20/month), Pro ($30/month) and Ultra ($100/month) via the Gemini app and Google Flow, with a developer API still pending as of June 2026.

Gemini Omni, released by Google DeepMind on May 19, 2026, is a multimodal world model that generates and edits video from text, image, audio and video inputs. The Omni Flash variant produces a 5-second 1080p clip in under 15 seconds, capped at 10 seconds per clip, and is bundled into Google AI Plus ($20/month), Pro ($30/month) and Ultra ($100/month), with a developer API still unannounced as of June 2026.

Provider: Google DeepMind · Family: Gemini Omni

Input modalities: text, image, audio, video · Output: video, audio, image, text

About Gemini Omni

Gemini Omni is a multimodal "world model" introduced by Google DeepMind at Google I/O on May 19, 2026, alongside Gemini 3.5 Flash and the Antigravity 2.0 developer platform. Unlike the numbered Gemini reasoning line (3, 3.1, 3.5), Omni is a separate generative system that unifies capabilities previously split across Veo (video), Imagen (image), Lyria (audio) and Nano Banana (image editing) into a single any-to-any reasoning pass. The first publicly available variant is Gemini Omni Flash, a faster, lower-cost tier aimed at consumer-scale rollout; Google has signaled a higher-quality "Omni Pro" tier will follow. The headline feature is treating video as a first-class output: Omni accepts text, images, audio, and existing video clips as input and produces new or edited video grounded in Gemini's world knowledge, with consistent objects, characters and physics across frames. Because Omni's outputs are video and audio rather than text completions, it has not been run through the standard LLM benchmark suite (SWE-bench Verified, GPQA Diamond, AIME, MMLU-Pro, ARC-AGI 2), and Google has not published comparable scores. Early hands-on reviews instead grade it on generative-video metrics: testers reported it scored noticeably better than prior Google video models on "object permanence" checks, where a character that walks behind an obstacle re-emerges with the same clothing, face and proportions, a common failure mode for earlier diffusion-based video generators such as Veo 3.1 and rivals like Seedance. No independently verified numeric score for these tests has been published as of June 2026. Google has not disclosed a token-based context window specifically for Omni; its limits are expressed as clip length and resolution instead. Omni Flash is capped at 10-second clips at up to 1080p. Google DeepMind researcher Nicole Brichtova told TechCrunch the 10-second ceiling is a deployment choice to manage compute demand during rollout rather than an architectural limit, implying longer outputs are possible once capacity allows. A 5-second 1080p preview clip generates in under 15 seconds on Omni Flash, fast enough for iterative, conversational editing sessions. Omni accepts text, static images, audio clips and video as input, and can combine several of these in one prompt, for example a photo plus a voice note plus a reference clip, asking for a new video that incorporates all three. Its standout feature is conversational editing: after a clip is generated, a user can request follow-up changes such as swapping the location, weather, or a character's outfit without re-describing the whole scene, and the model preserves character identity and scene context across the edit chain. Omni does not expose function calling, tool use, or code execution; it is a generation and editing model, not an agentic reasoning model, those capabilities remain with the Gemini 3.5 line. As of June 2026, Gemini Omni Flash is bundled into Google's existing consumer subscriptions rather than billed separately: it is included for Google AI Plus ($20/month), AI Pro ($30/month) and AI Ultra ($100/month) subscribers through the Gemini app and Google Flow. Google is also giving it away free on YouTube Shorts and the YouTube Create app, though the free tier is rationed to roughly 50 Flow credits per day, enough for one or two short generations. A standalone developer API via Vertex AI was promised "in the coming weeks" at I/O but had not shipped as of this writing, so no official per-token or per-second API price exists yet; third-party estimates project $1.50 to $2.50 per 1M input tokens and $0.20 to $0.60 per second of generated video, anchored to existing Veo 3.1 and Gemini 3.5 Flash rates, but these are unconfirmed projections, not Google pricing. The only confirmed access paths today are the Gemini consumer app, Google Flow (Google's AI filmmaking tool), and the YouTube Shorts Remix and YouTube Create apps. There is no self-hosting option: Omni is closed-weight and API-only, consistent with the rest of the proprietary Gemini line (Google's open-weight releases are branded separately as Gemma). Vertex AI access is the expected enterprise on-ramp once the API ships, following the pattern of prior Gemini launches where consumer access preceded Vertex AI by several weeks. Google has not published a dedicated system card for Omni at launch. Generated video and audio outputs carry Google's SynthID invisible watermark for provenance tracking, consistent with Google's policy for Veo- and Imagen-derived generative media. Beyond the SynthID disclosure, no Omni-specific red-teaming partners, refusal-rate benchmarks, or training-data cutoff have been disclosed; the model inherits Google DeepMind's general responsible-AI commitments but Omni-specific documentation was still pending as of June 2026. Omni is best suited for short-form social video creators, marketers producing quick ad variations, and YouTube Shorts creators who want to generate or edit a clip conversationally without a traditional editor. It is not yet a fit for production pipelines that need an API: there is no programmatic access, the 10-second clip cap rules out longer-form content, and pricing for any future API tier is unknown, so teams budgeting a video workflow should provisionally plan around Veo 3.1 (which has a published API) or wait for Omni Pro and the Vertex AI release. Teams needing text-only multimodal reasoning, such as document QA or coding agents, should use Gemini 3.5 Flash or 3.1 Pro instead, since Omni does not produce text completions or support tool use. Gemini Omni Flash is explicitly positioned as the first step in a longer rollout. Google has signaled an "Omni Pro" tier is coming, likely with longer clip durations and higher resolution, and a Vertex AI developer API is expected within weeks of the May 19, 2026 announcement. Until those ship, Omni should be treated as a consumer-facing preview of Google's "world model" direction rather than a finished platform component.

Pricing

No standalone API pricing exists as of June 2026. Omni Flash is bundled into Google AI Plus ($20/month), AI Pro ($30/month) and AI Ultra ($100/month) via the Gemini app and Google Flow. Third-party projections for a future API anchor to Veo 3.1 / Gemini 3.5 Flash rates ($1.50-$2.50 per 1M input tokens, $0.20-$0.60 per second of video output) but these are unconfirmed.

Key Features

Pros

Cons

Frequently Asked Questions

What is Gemini Omni and who built it?

Gemini Omni is a multimodal 'world model' built by Google DeepMind and announced on May 19, 2026 at Google I/O, alongside Gemini 3.5 Flash and the Antigravity 2.0 developer platform. It unifies capabilities previously split across separate Google models, Veo for video, Imagen for images, Lyria for audio, and Nano Banana for image editing, into a single any-to-any reasoning pass. The first public variant is Gemini Omni Flash, a faster and lower-cost tier aimed at consumer-scale rollout, with a higher-quality 'Omni Pro' tier signaled for later. Omni's headline ability is treating video as a first-class output: it can take text, images, audio and existing video clips as input and produce a new or edited video that stays consistent with Gemini's world knowledge. Google has not disclosed a parameter count or architecture details beyond describing it as a unified multimodal system. It sits alongside, not inside, the numbered Gemini reasoning line (3, 3.1, 3.5), which remains text-output focused. Omni was designed to compete directly with OpenAI's Sora 2 and other video-generation models like Seedance.

How much does Gemini Omni cost?

As of June 2026, Gemini Omni Flash has no standalone per-token or per-second API price; Google has not published developer API pricing. Instead, it is bundled into Google's existing consumer subscriptions: Google AI Plus at $20 per month, AI Pro at $30 per month, and AI Ultra at $100 per month, all accessed through the Gemini app and Google Flow. Google also made Omni Flash free on YouTube Shorts Remix and the YouTube Create app, though free usage is rationed to roughly 50 Flow credits per day, enough for about one to two short generations. A developer API via Vertex AI was promised 'in the coming weeks' at I/O 2026 but had not shipped as of this writing. Third-party analysts have projected future API rates of $1.50 to $2.50 per 1 million input tokens and $0.20 to $0.60 per second of generated video, anchored to existing Veo 3.1 and Gemini 3.5 Flash pricing, but these are unconfirmed estimates, not official Google pricing. There is no self-hosting option since Omni is closed-weight.

What is Gemini Omni's context window and output limit?

Google has not published a token-based context window for Gemini Omni, unlike the numbered Gemini reasoning models (Gemini 3.1 Pro reportedly supports up to 2 million tokens). Instead, Omni's limits are expressed in terms of video clip length and resolution. The Omni Flash variant is hard-capped at 10-second clips at up to 1080p resolution. Google DeepMind researcher Nicole Brichtova told TechCrunch this 10-second ceiling is a deployment choice to manage compute demand during the initial rollout, not a fixed architectural limit, suggesting longer clips could become available as capacity increases. In practice, a 5-second 1080p preview clip generates in under 15 seconds on Omni Flash. There is no separate 'extended output' tier disclosed yet, and no information on how Omni handles multi-file or long-document inputs the way text-based Gemini models do.

How does Gemini Omni compare to Gemini 3.5 Flash and OpenAI's Sora 2?

Gemini Omni and Gemini 3.5 Flash are different kinds of models released at the same Google I/O 2026 event. Gemini 3.5 Flash is a text-output reasoning model that accepts text, code, images, audio, video and PDFs and scores 78% on SWE-bench Verified and 90.4% on GPQA Diamond, but it cannot generate images, audio or video. Gemini Omni is the inverse: it is a generative video, audio and image model that does not produce text completions, function calls, or structured output at all. Compared to OpenAI's Sora 2, the closest direct competitor, Gemini Omni's distinguishing feature is conversational editing, the ability to revise an already-generated clip with follow-up instructions while preserving character identity, rather than re-generating from scratch. Neither Omni nor Sora 2 publishes directly comparable numeric quality benchmarks as of June 2026, so comparisons rely on hands-on reviews, which reported Omni performing well on 'object permanence' consistency tests relative to prior Google video models.

Is Gemini Omni open source or proprietary?

Gemini Omni is fully proprietary and closed-weight. Google does not release model weights or architectural details for any model in the Gemini line, including Omni; that is reserved for Google's separately branded Gemma open-weight model family. There is no Hugging Face listing, no downloadable checkpoint, and no quantized or self-hosted deployment option for Omni. Access is entirely through Google-controlled surfaces: the consumer Gemini app, Google Flow, and the YouTube Shorts Remix and YouTube Create apps. A developer-facing API via Vertex AI was promised at I/O 2026 but was not live as of June 2026, and even once it ships, it will be a hosted API rather than a downloadable model. There are no commercial-use restrictions specific to Omni beyond Google's standard generative AI usage policies and the SynthID watermarking applied to outputs.

What modalities does Gemini Omni support?

Gemini Omni accepts text, static images, audio clips and video as input, and can combine multiple of these in a single prompt, for example a reference photo, a voice-over audio clip and a short text instruction together. Its outputs are primarily video with synchronized audio, along with supporting image and text elements for the conversational editing interface. All confirmed modalities are live in the Gemini app and Google Flow as of June 2026; no modalities have been described as 'coming soon' beyond the broader Omni Pro tier. Omni does not support function calling, tool use, structured JSON output, code execution, or web browsing, those capabilities belong to the separate Gemini 3.5 reasoning line. There is no computer-use or agentic loop support in Omni; it is purely a generation and editing model, with the main 'capability' being its conversational, multi-turn editing of previously generated video.

Does Gemini Omni train on user data, and what is its data policy?

Google has not published an Omni-specific data retention or training policy as of June 2026. Generated video and audio outputs from Omni carry Google's SynthID invisible watermark for provenance tracking, consistent with Google's policy for Veo- and Imagen-derived media, which helps identify AI-generated content even after the file is shared or edited. Beyond the SynthID disclosure, Omni inherits the general data handling and Activity controls of the consumer Gemini app, which let users review and delete stored prompts and generated media. No SOC 2, ISO 27001, HIPAA, or GDPR compliance statements specific to Omni have been published, and no enterprise zero-retention option has been announced, likely because there is no enterprise API yet. Once Vertex AI access ships, it would be expected to inherit Google Cloud's existing enterprise data governance commitments, but this has not been confirmed for Omni specifically.

Who is Gemini Omni best for, and who should avoid it?

Gemini Omni is best for short-form social video creators making YouTube Shorts or similar vertical clips, marketers who want to quickly prototype video ad variations, and Google AI Plus, Pro or Ultra subscribers already using Google Flow for AI filmmaking. Its conversational editing, where a generated clip can be revised with follow-up instructions like changing the weather or a character's outfit while keeping identity consistent, is its strongest differentiator for iterative creative work. Teams that need a developer API for automated video pipelines should avoid Omni for now, since no Vertex AI or API access existed as of June 2026; Veo 3.1 remains the API-accessible alternative. Anyone needing video longer than 10 seconds per clip is also blocked by Omni Flash's hard cap. Finally, teams needing text reasoning, coding assistance, or agentic tool use should use Gemini 3.5 Flash or Gemini 3.1 Pro instead, since Omni produces no text completions and has no function calling.

Visit Gemini Omni Official Page