Docker Model Runner — Running AI Models Alongside Your Containers

Docker announced Model Runner this week, a new feature in Docker Desktop that lets you pull and run AI models locally using the same familiar workflow you’d use for container images. If you’ve ever wished you could just docker run a language model the way you spin up a PostgreSQL container, that’s essentially what this enables. And after spending a couple of days experimenting with it, I think it’s a bigger deal than the understated announcement might suggest.

The feature is currently in beta as part of Docker Desktop 4.41, and it supports a growing catalog of models from Hugging Face and other registries. You can run models locally using your machine’s GPU (or CPU, with corresponding performance trade-offs), and they integrate with Docker’s existing networking and volume systems.

How It Works
#

The basic workflow is surprisingly straightforward. Docker has extended its CLI and Desktop UI to support model management as a first-class concept. You can pull models from registries, list available models, and run inference — all through the Docker interface you already know.

Under the hood, Docker Model Runner leverages llama.cpp and similar inference engines, packaged and managed by Docker’s runtime. The models are stored in Docker’s content store alongside your container images, and they benefit from the same layer caching and deduplication mechanisms. If you pull two models that share a base architecture, Docker is smart enough to reuse the common layers.

What makes this different from just running Ollama or llama.cpp directly is the integration story. Docker Model Runner exposes a local API endpoint that’s compatible with the OpenAI API format. This means any application code that talks to OpenAI’s API can be pointed at your local model runner with just a URL change. For development and testing, this is extremely practical.

The Docker Compose integration is where things get really interesting. You can define a model as a service in your docker-compose.yml alongside your application containers, your database, your message queue — everything spins up together. Your application connects to the model service over Docker’s internal network, just like it would connect to any other service.

Why This Matters for Development Workflows
#

The pain point Docker is addressing here is real. Right now, if you’re building an application that uses AI models, your development workflow probably looks something like this: you call an external API (OpenAI, Anthropic, etc.) during development, which means you need API keys, network access, and you’re paying per request. Or you run something like Ollama separately, manage it as a distinct tool, and wire things together manually.

Docker Model Runner collapses this into the standard Docker development workflow. Clone a repo, run docker compose up, and you have your entire application stack — including the AI model — running locally. No external API calls, no separate tool management, no API keys needed for basic development.

For teams working on AI-powered applications, this addresses several practical problems:

Offline development: You can work on AI features on a plane, in a coffee shop with bad WiFi, or in an air-gapped environment. The model runs locally, no network required.

Cost control during development: Every time a developer hits “run tests” against an external AI API, the meter is ticking. Local models eliminate that cost for development and testing cycles.

Reproducibility: When the model is part of your Docker Compose stack, every developer on the team is running the same model version. No more “works on my machine” issues caused by different API model versions or rate limiting.

Privacy: For applications handling sensitive data, running inference locally during development means that data never leaves the machine. This is significant for healthcare, finance, and other regulated industries.

Performance Reality Check
#

Let’s be honest about the limitations. Running a 7B parameter model on a developer’s laptop is not the same as hitting GPT-4o or Claude via API. The model quality is lower, the inference speed depends heavily on your hardware, and the larger models that produce better results require serious GPU memory.

On my MacBook Pro with an M3 Max and 64GB of unified memory, I can run 13B parameter models comfortably and get reasonable inference speeds — maybe 15-20 tokens per second. That’s workable for development but not great for anything interactive. Smaller 7B models run faster, around 30-40 tokens per second, but with correspondingly lower quality.

The sweet spot I’ve found is using local models for development and testing of the integration layer — making sure your prompts are structured correctly, your response parsing handles edge cases, and your application logic works — while accepting that the actual model quality will be different in production. Think of it like developing against a local SQLite database when your production runs PostgreSQL. The interface is the same, the behavior is close enough for development, but you still need to test against the real thing.

The Docker Ecosystem Play
#

What Docker is really doing here is extending their platform strategy. They’ve already won the “how developers run local infrastructure” battle with Docker Compose. By adding AI models to that same ecosystem, they’re ensuring that Docker Desktop remains the central tool in the development workflow even as AI becomes a core component of more applications.

It’s a smart move. The alternative was that developers would adopt a separate tool for local AI — Ollama, LM Studio, or something similar — and Docker’s role would be limited to the non-AI parts of the stack. By integrating model running into Docker itself, they maintain their position as the unified development environment.

I also expect this to drive model registry standards in interesting directions. Docker has already influenced container image standards through OCI. If they push for similar standardization around model packaging and distribution, that could benefit the entire ecosystem.

My Take
#

Docker Model Runner is one of those features that seems incremental until you actually use it. The moment you add a model service to your Docker Compose file and have your entire AI-powered application stack running with a single command, the developer experience improvement is tangible.

Is it going to replace cloud-based AI APIs in production? No. Is it going to change how teams develop and test AI-powered applications? I think so, yes. The friction reduction is significant, and in my experience, reducing friction in development workflows has an outsized impact on team productivity and code quality.

If you’re building anything that integrates AI models, I’d recommend trying Docker Model Runner in your development stack this week. The setup takes about fifteen minutes, and the workflow improvement is immediate. Just make sure your laptop has enough RAM — those models are hungry.

Developer Tooling - This article is part of a series.

Part : GitHub Copilot Agent Mode Goes GA — What It Means for Developer Workflows

Part : AI Agent Frameworks — The Wild West of Autonomous Systems

Part : Platform Engineering in 2025 — A Year-End Retrospective

Part : GitHub Universe 2025 — Copilot Grows Up and the IDE Fades Further

Part : AI Coding Assistants Are Growing Up — Beyond Autocomplete

Part : SWE-bench Benchmark Contamination — When the Test Answers Are in the Training Data

Part : Mistral's Le Chat Gets MCP Connectors — The Protocol That's Quietly Connecting Everything

Part : OpenTelemetry Reaches Full Maturity — Observability Finally Has a Standard

Part : AI-Native IDEs — The Editor Wars Have a New Front

Part : This Article

Part : Model Context Protocol — The Quiet Standard That Could Reshape AI Tooling