Skip to main content
  1. Blog/

NVIDIA's Q2 Numbers Are Staggering — What It Tells Us About AI Infrastructure Demand

·917 words·5 mins
Osmond van Hemert
Author
Osmond van Hemert
Cloud Operations - This article is part of a series.
Part : This Article

NVIDIA reported its Q2 FY2025 earnings yesterday, and the numbers are worth pausing on. Revenue hit $30 billion for the quarter — up 122% year-over-year. Data center revenue alone was $26.3 billion, a 154% increase. The company is now worth over $3 trillion.

I’m not a financial analyst, and I have zero interest in stock picks. But as someone who’s been building and deploying software infrastructure for three decades, these numbers tell me something important about where our industry is heading. The demand for AI compute is not slowing down — it’s accelerating.

Following the Money to Understand the Architecture
#

When you look at who’s buying NVIDIA’s GPUs, a pattern emerges. The hyperscalers — Microsoft, Google, Amazon, Meta — are collectively spending tens of billions on GPU clusters. Meta alone has talked about deploying 350,000 H100 GPUs. Microsoft’s capital expenditure has ballooned to support Azure AI services and OpenAI’s infrastructure needs.

This level of spending tells us several things:

Training is still compute-hungry: Despite advances in model efficiency, the frontier labs are training ever-larger models that require more compute, not less. The Llama 3.1 405B model that Meta released last month was trained on a cluster of 16,000 H100 GPUs. The next generation will likely require even more.

Inference is becoming the bigger market: As AI features ship in production products — search, code completion, document summarization, image generation — the inference workload is growing exponentially. Every ChatGPT query, every Copilot suggestion, every AI-generated search summary requires GPU compute. Jensen Huang noted that inference now represents about 40% of data center revenue.

The supply chain is the bottleneck: NVIDIA’s gross margins are above 75%, which in hardware is extraordinary. This pricing power exists because demand far exceeds supply. TSMC’s advanced packaging capacity, specifically CoWoS (Chip on Wafer on Substrate), is the physical constraint. Every GPU needs this packaging, and there simply aren’t enough production lines.

What This Means for Cloud Costs
#

If you’re deploying AI workloads in the cloud, you’ve probably noticed that GPU instances are expensive and often unavailable. This isn’t going to get better soon. The demand dynamics suggest that GPU compute will remain a scarce, premium resource for the foreseeable future.

This has practical implications for architecture decisions:

Right-size your model: Running a 70B parameter model when a fine-tuned 7B model would suffice is burning money. The trend toward smaller, specialized models isn’t just an academic exercise — it’s an economic necessity. I’ve seen teams cut their inference costs by 80% by switching from GPT-4-class models to well-tuned smaller models for specific tasks.

Quantization matters: Techniques like GPTQ and AWQ that reduce model precision from FP16 to INT4 can cut GPU memory requirements by 4x with minimal quality loss. If you’re not quantizing your inference models, you’re probably over-provisioning.

Batch your inference: If your workload allows it, batching multiple requests together dramatically improves GPU utilization. A single H100 running one request at a time is catastrophically underutilized. Frameworks like vLLM and TensorRT-LLM handle this automatically, but you need to architect your application to support it.

The Blackwell Generation
#

NVIDIA’s next-generation Blackwell GPUs (B100, B200, and the GB200 “superchip”) are expected to ship later this year and ramp in early 2025. The performance claims are significant: up to 4x faster training and up to 30x faster inference for large language models compared to H100.

If those numbers hold even partially, the economics of AI inference could shift meaningfully. Operations that currently require a cluster of H100s might run on a single Blackwell node. That could democratize access to larger models and make AI features viable for smaller companies that currently can’t afford the compute.

But there’s a catch: the initial supply will be constrained, just like H100s were. Early access will go to the hyperscalers and large enterprises with existing purchase agreements. If you’re planning your 2025 infrastructure around Blackwell availability, build in contingency plans.

Beyond NVIDIA: The Competitive Landscape
#

It’s worth noting that NVIDIA isn’t the only game in town, even if it feels that way:

AMD’s MI300X is gaining traction, particularly for inference workloads. It offers competitive performance at a lower price point, and major cloud providers are adding MI300X instances.

Google’s TPUs continue to be a strong option if you’re in the Google Cloud ecosystem. TPU v5e is particularly cost-effective for inference.

Custom silicon from AWS (Trainium, Inferentia) and Microsoft (Maia) is designed specifically for their cloud customers. These chips won’t match NVIDIA’s flexibility, but they could offer better price-performance for specific workloads.

Groq’s LPUs and other specialized inference accelerators promise dramatically faster and cheaper inference for specific model architectures.

My Take
#

The NVIDIA earnings story isn’t really about NVIDIA — it’s about the fundamental reshaping of computing infrastructure. We’re in the middle of a build-out that’s comparable to the original cloud computing wave. The hyperscalers are effectively building a new tier of compute infrastructure specifically for AI, and the spend is unprecedented.

For those of us who design and deploy systems, the practical takeaway is clear: GPU compute is expensive and will remain so. Design your AI features with cost efficiency in mind from day one. Use the smallest model that meets your quality bar, optimize your inference pipeline, and keep an eye on the rapidly evolving hardware landscape.

The companies that succeed with AI won’t necessarily be the ones with the most GPUs — they’ll be the ones that use their GPUs most efficiently. That’s always been true with infrastructure, and it’s no different now.

Cloud Operations - This article is part of a series.
Part : This Article