Skip to main content
  1. Blog/

Google Gemini Arrives — Multimodal AI Gets Real

·873 words·5 mins
Osmond van Hemert
Author
Osmond van Hemert
AI Models & Releases - This article is part of a series.
Part : This Article

Google just dropped Gemini, and the AI landscape shifted again. After months of speculation and what felt like an eternity of playing catch-up to OpenAI, Google DeepMind has released what they claim is their most capable model to date — one that natively understands text, code, audio, image, and video. Having watched this space for three decades now, I can tell you: the pace of change in 2023 has been unlike anything I’ve seen before.

What Makes Gemini Different
#

The key selling point here isn’t just benchmark performance — though Google is certainly eager to highlight that Gemini Ultra reportedly outperforms GPT-4 on several standard benchmarks including MMLU. What’s genuinely interesting is the architecture: Gemini was built from the ground up as a multimodal model, not a text model with vision bolted on afterward.

This matters more than it might seem at first glance. When you retrofit multimodal capabilities onto a text-first architecture, you get something that can process images but doesn’t truly reason across modalities. Google’s claim is that Gemini can natively interleave understanding across text, images, audio, and video in a way that feels more integrated. Whether that holds up in real-world usage remains to be seen — benchmarks and demos have a way of looking better than production reality.

The model comes in three sizes: Ultra (the full powerhouse), Pro (the balanced middle tier already rolling out in Bard), and Nano (designed to run on-device, specifically on the Pixel 8 Pro). That tiered approach is smart — it acknowledges that not every use case needs the biggest model, and on-device inference is where a lot of practical value lives.

The Developer Angle
#

For those of us building applications, the immediate impact comes through the Gemini API, available via Google AI Studio and Vertex AI. Gemini Pro is accessible now, and it slots into the space where many teams have been using GPT-3.5 Turbo — fast, capable, cost-effective.

What I find most compelling from a development perspective is the potential for truly multimodal application logic. Right now, most AI-powered applications treat different input types as separate pipelines: you have your text processing, your image analysis, maybe some audio transcription, and you glue them together with application code. A natively multimodal model opens the door to much simpler architectures where you can throw heterogeneous inputs at a single endpoint and get coherent reasoning back.

Of course, “opens the door” and “works reliably in production” are two very different things. I’ve been burned enough times by demo-day promises to maintain healthy skepticism. But the direction is clear, and competition in this space benefits all of us who build on these platforms.

The Competitive Landscape
#

This launch puts real pressure on OpenAI and the open-source community. OpenAI has been the clear front-runner since ChatGPT’s launch a year ago, and while Google had Bard and PaLM 2, they never quite matched the developer mindshare that OpenAI captured. Gemini feels like a more serious response.

But here’s what I think matters more than the Google vs. OpenAI narrative: this accelerates the expectation that AI models should be multimodal by default. Meta’s Llama 2 pushed open-source text models forward. Now the bar is moving to multimodal capabilities. Mistral just released Mixtral 8x7B this week as well — the open-source ecosystem isn’t standing still.

For teams evaluating their AI stack, the practical takeaway is that model choice is increasingly about ecosystem fit rather than raw capability. Google’s integration with Cloud Platform, OpenAI’s partnership with Microsoft Azure, and the flexibility of open-source models all represent different trade-offs that matter more than a few percentage points on benchmarks.

The On-Device Story
#

The Nano variant deserves special attention. Running capable AI models directly on mobile hardware is a game-changer for applications where latency matters, where privacy is a concern, or where connectivity is unreliable. Google shipping this in the Pixel 8 Pro suggests they see on-device AI as a mainstream feature, not a research curiosity.

I’ve been working with edge computing long enough to know that the gap between “runs on device” and “runs well on device” can be enormous. But the trajectory is promising. If Gemini Nano delivers even 70% of what the demos suggest, it opens up entire categories of mobile and IoT applications that currently require round-trips to cloud APIs.

My Take
#

After thirty years in this industry, I’ve learned to separate signal from noise in product launches. Gemini is signal. It may not immediately dethrone GPT-4 in every benchmark or use case, but it validates the multimodal-native approach and introduces genuine competition at the top of the AI capability spectrum.

What I’m watching for is the developer experience. The best model in the world doesn’t matter if the API is flaky, the documentation is sparse, or the pricing model doesn’t work for real applications. Google has historically struggled with developer relations compared to smaller, more focused companies. If they get that right with Gemini, the impact on our industry could be substantial.

For now, I’d recommend any team currently building with LLMs to allocate some time to evaluate Gemini Pro through the API. Competition is good for all of us, and having viable alternatives to OpenAI’s offerings makes our architectures more resilient.

AI Models & Releases - This article is part of a series.
Part : This Article