Infrastructure as Code Under Pressure — Lessons from Pandemic-Scale Scaling

A month into the global lockdown, and infrastructure teams everywhere are running on caffeine and Terraform plans. The shift to remote work didn’t just increase traffic — it fundamentally changed traffic patterns. VPN concentrators that handled 10% of the workforce are now handling 100%. Video conferencing backends that provisioned for peak meeting hours are seeing all-day sustained load. And the teams managing all of this are doing it from their kitchen tables.

I’ve been talking to infrastructure engineers across several companies this week, and a clear pattern is emerging: organizations that invested in Infrastructure as Code (IaC) practices are weathering this storm dramatically better than those still managing infrastructure manually. It’s not even close.

The Scaling Stories
#

Let me share a few anonymized examples of what I’m hearing:

Company A (financial services, ~5,000 employees) had their VPN infrastructure defined in Terraform with auto-scaling groups. When remote work mandates hit, they modified a few variables — instance count, instance type — ran terraform plan, reviewed the diff, and terraform apply. New VPN capacity was live in under an hour. Total downtime: zero.

Company B (similar size, similar industry) managed their VPN infrastructure through a combination of manual provisioning and shell scripts accumulated over five years. Scaling up required someone to log into the AWS console, manually launch instances, configure them by hand, update load balancer targets, and modify security groups. It took three days and two misconfigurations that caused outages.

The difference isn’t talent. Company B has excellent engineers. The difference is that Company A had codified their infrastructure decisions into version-controlled, reviewable, repeatable artifacts. When the crisis hit, they could adapt quickly because changing infrastructure meant changing code, not performing a sequence of manual steps from memory.

Terraform, CloudFormation, and the State Problem
#

Terraform has emerged as the de facto standard for multi-cloud IaC, and its usage has reportedly spiked in the past month. HashiCorp’s Terraform Cloud saw a significant increase in runs as teams scrambled to scale.

But Terraform’s state management — always its Achilles’ heel — is causing pain at scale. Teams that were casually sharing state files in S3 buckets without proper locking are experiencing state corruption during concurrent modifications. When three engineers all need to scale different parts of the infrastructure simultaneously, state locking isn’t optional anymore.

I’ve been recommending Terraform Cloud or at minimum a properly configured S3 backend with DynamoDB locking to every team I talk to. The free tier of Terraform Cloud handles remote state management well enough for most teams. But honestly, this should have been set up before the crisis, not during it.

CloudFormation users, meanwhile, are dealing with their own challenges. AWS service limits — which many teams never thought about because they never approached them — are suddenly relevant. EC2 instance limits, EIP limits, NAT Gateway limits per AZ — all of these require support tickets to increase, and AWS support response times have understandably slowed as every customer makes the same requests simultaneously.

Ansible and Configuration Drift
#

Infrastructure provisioning is only half the story. Once the servers exist, they need to be configured. Ansible, which many teams use for configuration management, is proving its worth — but also exposing a common anti-pattern.

Teams that ran Ansible playbooks regularly (ideally on every commit to their config repo) are in good shape. Their configurations are consistent, their playbooks are tested, and scaling out means running the same playbook against new hosts. Teams that only ran Ansible during initial setup and then made manual changes to running systems are discovering the hard way what “configuration drift” means.

I spoke with one SRE who described their situation bluntly: “We have Ansible playbooks, but they haven’t been run against production in eight months. Nobody trusts them anymore.” They’re essentially back to manual configuration, but with the added confusion of Ansible playbooks that may or may not reflect reality.

The lesson is one that the DevOps community has been preaching for years: IaC only works if it’s the only way you change infrastructure. The moment someone SSH’s in and makes a manual change, your code and your reality diverge.

The Monitoring Gap
#

Scaling infrastructure is one thing. Knowing whether it’s working is another. I’m seeing a lot of teams that scaled their application infrastructure but forgot to scale their monitoring. Prometheus instances running out of memory because they’re scraping three times as many targets. Elasticsearch clusters for log aggregation hitting storage limits. PagerDuty alert fatigue as thresholds calibrated for normal traffic fire constantly under pandemic loads.

The best-prepared teams had their monitoring infrastructure defined in the same IaC pipelines as their application infrastructure. Scale the app, the monitoring scales with it. But that level of maturity is still rare. Most organizations have a gap between their IaC practices for “the application” and “everything else.”

Grafana dashboards, at least, have become a universal language. I’ve seen more screenshots of Grafana boards in Slack channels this month than in the previous year combined. If nothing else, this crisis is teaching everyone the value of observability.

What We Should Learn
#

When this crisis eventually subsides (and it will), I hope infrastructure teams take three lessons forward:

IaC is not optional. If your infrastructure can’t be reproduced from code, it can’t be scaled reliably under pressure. Full stop.
Practice your scaling procedures. Chaos engineering — which sometimes feels like a luxury — is actually preparation for exactly this kind of scenario. If you’ve never tested scaling your VPN infrastructure by 10x, you don’t actually know if your Terraform configs support it.
Treat monitoring and observability as first-class infrastructure. Your Prometheus, Grafana, and logging stack should be in the same Terraform modules as your application. If they’re not, they won’t scale when you need them to.

My Take
#

I’ve been advocating for Infrastructure as Code since long before it had a catchy name. We used to call it “scripting your environment” and it was considered a nice-to-have. The pandemic is proving what many of us always knew: it’s a necessity.

The good news is that the tooling has never been more mature. Terraform, Ansible, Pulumi, CloudFormation, CDK — there are excellent options for every use case and cloud provider. The barrier isn’t tooling; it’s organizational discipline.

If your team is currently in firefighting mode, manually scaling things to keep the lights on — that’s okay. Survive first. But once the immediate crisis passes, take what you learned about your infrastructure’s weaknesses and codify the fixes. Write the Terraform. Write the Ansible. Set up the state locking. Make sure the next crisis — and there will be a next one — finds you in Company A’s position, not Company B’s.

Cloud Operations - This article is part of a series.

Part : Cloud FinOps — Why Engineers Own the Cost Conversation Now

Part : OpenTelemetry Reaches GA for Logs — The Three Pillars Are Finally Complete

Part : The Stargate Project — $500 Billion and the Future of AI Infrastructure

Part : NVIDIA's Q2 Numbers Are Staggering — What It Tells Us About AI Infrastructure Demand

Part : IBM Acquires HashiCorp — What It Means for the Infrastructure-as-Code Ecosystem

Part : Broadcom's VMware Overhaul — The Virtualization World Is Rattled

Part : NVIDIA's $22 Billion Quarter — The AI Infrastructure Gold Rush Is Real

Part : Cloudflare R2 Goes GA — The S3-Compatible Storage War Heats Up

Part : Heroku Kills the Free Tier — End of an Era for Developer Onboarding

Part : Broadcom's $61 Billion VMware Bet — What It Means for Cloud Infrastructure

Part : Terraform 1.1 and the Maturing IaC Landscape