LLM Agents in Production — Moving Beyond Chat Interfaces · Osmond van Hemert — Senior Software Engineer

Six months ago, talking about “agent-based systems” at an enterprise architecture meeting would have gotten you some skeptical looks. Today, it’s the question every engineering leader is asking: “How do we move our LLMs from chatbots into autonomous agents?”

The shift is real, and it’s reshaping how teams think about AI deployment. We’re moving past the era of “ChatGPT in a product” into the era of “systems that coordinate complex workflows without human intervention.” This is harder than it sounds, and the teams getting it right are finding that the architectural challenges are bigger than the model challenges.

I’ve been working with several companies navigating this transition over the past few months. What I’m seeing is a clear inflection point: companies that treat LLM agents as a straightforward extension of chatbot UX are going to struggle. The ones that respect the architectural complexity are building systems that are genuinely useful.

From Chat Interfaces to Autonomous Workflows
#

Let me be concrete about what’s changed. A year ago, the standard pattern was: “We built a chat interface to GPT-4 and deployed it.” That worked fine for customer support, basic information retrieval, and copilot-style interfaces where a human is involved in every decision.

Today’s pattern is different: “Our system receives a request, spawns multiple specialized agents to investigate different aspects in parallel, orchestrates their findings, flags risks, and executes or recommends an action — all without waiting for human feedback.”

That’s not a chat interface. That’s a distributed system.

The classic example is code analysis. Instead of asking a single LLM “analyze this codebase for security issues,” you can spawn agents that run in parallel: one analyzing dependency trees, another scanning authentication patterns, another looking for SQL injection vectors, another auditing data handling. They report their findings to an orchestrator, which synthesizes them into a coherent risk assessment. The whole workflow completes in seconds, and each agent has a specific job it’s optimized for.

This architecture mirrors what we’ve learned from microservices, except the “services” are LLMs with specific prompts and contexts. And just like microservices, this brings real power but also real complexity.

The Architecture That Actually Works in Production
#

The teams I’m advising are converging on a few core patterns:

The Orchestration Layer sits at the center. It’s not an LLM — it’s a state machine or a workflow engine that knows how to spawn agents, collect their results, handle failures, and make decisions about what happens next. This is where most engineering complexity lives. Some teams are writing this themselves. Others are adopting frameworks like LangGraph or AutoGen. The choice matters less than recognizing that you need this layer.

Specialized Agents are narrow and focused. An agent that “does everything” is an agent that does nothing well. The ones that work in production have a specific domain: “you analyze database queries,” “you audit permissions,” “you generate test cases.” Each agent has a fixed set of tools it can call, a specific prompt optimized for its domain, and known constraints. This is the opposite of the “general-purpose AI” narrative we hear in marketing.

Memory and State is the hidden complexity. In a chat interface, context is simple — it’s the conversation history. In a multi-agent system, you have shared state (what did Agent A discover?), individual agent state (is Agent B still thinking?), and the overall workflow state (have we collected enough information to make a decision?). Getting this wrong means agents hallucinate, repeat work, or get stuck in loops. The teams handling this best are treating agent state like database transactions — with explicit commit points, rollbacks, and consistency guarantees.

I worked with one fintech startup that spent weeks debugging an issue where agents were making conflicting recommendations. It turned out that two agents were reading stale state — they’d both queried the same account information 500ms apart, and the data had changed in between. That’s a distributed systems problem, not an AI problem. They needed to add explicit state versioning and locks.

Error Recovery and Graceful Degradation matter more in production than in prototypes. What happens if an agent times out? What if a tool call fails? What if an agent confidently produces a wrong answer? The best systems I’ve seen treat these cases explicitly: they have fallback agents, they log unexpected behaviors for human review, they can degrade to simpler workflows if complex ones fail. This is boring engineering, but it’s essential.

Tool Integration: The Real Bottleneck
#

Here’s something that surprised me: the bottleneck in agent systems isn’t the LLM, it’s the tools.

When an agent needs to take action — query a database, call an API, write to a file system, trigger a deployment — it has to be deeply integrated with your infrastructure. And every integration is a potential security boundary, a failure mode, a place where the agent can go wrong.

The teams doing this well have built abstraction layers. Instead of giving an agent direct access to your production database, you build a small API service that exposes specific, safe queries. Instead of direct file system access, you expose a file upload API with strict constraints. This is more work upfront, but it means you can scale the number of agents without scaling the security audit effort linearly.

One team I worked with was about to give agents direct access to their AWS account via boto3. I suggested we build a small safety layer instead — a service that agents query, which then validates requests against a policy before executing them. It added maybe two days of work. It prevented what could have been a catastrophic mistake.

This pattern echoes what we’ve learned from cloud infrastructure security: don’t scale complexity; scale abstraction. Agents should not see your actual infrastructure. They should see a carefully curated interface to it. AWS best practices for service authentication and the principle of least privilege apply directly to agent tool access control.

Memory Management at Scale
#

I’ve talked with teams running hundreds of concurrent agents, and they all hit the same wall: memory management becomes the limiting factor.

Each agent needs context. If you’re storing the full conversation history, the full state of the system, and the full output of prior agents, you can fit maybe ten concurrent agents on a reasonable machine before memory explodes. One team hit this at exactly nine agents running in parallel on their setup.

The solution is ruthless context pruning. You keep the last N steps of the workflow, you summarize intermediate results, you only pass agents the information they need to make their decision. You trade some theoretical completeness for practical scalability.

The other emerging pattern is using specialized storage. Instead of keeping everything in memory, you store the full state in a fast KV store (Redis, DynamoDB) and only load what each agent needs into its context window. This adds latency, but it’s a tradeoff most production systems are willing to make.

Think of it like database query optimization. You’re reducing the working set to what actually matters. Extended thinking models like Claude’s latest release give agents more cognitive capacity for reasoning, but you still need to manage how much context you’re passing to them.

The Agent Orchestration Patterns That Scale
#

The orchestrator is where the craft lives. I’m seeing three major patterns emerge:

Sequential orchestration is straightforward: Agent A completes, its output goes to Agent B, which completes, output goes to Agent C. This is the easiest to reason about, but it’s slow — you can’t parallelize. Use this when order matters and latency isn’t critical.

Parallel orchestration spawns multiple agents at once (like the code analysis example earlier) and waits for all results before proceeding. This is faster but more complex — you need to handle partial failures (what if one agent errors?), manage concurrent state updates, and synthesize conflicting results.

Conditional/Branching orchestration routes based on results. “If Agent A found a critical security issue, invoke the Security Agent. Otherwise, proceed to the standard review.” This is the most flexible but also the most complex. You need a clear state machine to avoid infinite loops or contradictory branches.

The best systems I’ve seen combine all three, depending on the workflow. Most workflows are 70% sequential (this step logically depends on the last one), 20% parallel (these aspects can be analyzed independently), and 10% conditional (some paths fork based on results).

The Debugging and Observability Crisis
#

Here’s a problem nobody talks about enough: agent systems are incredibly hard to debug.

When a chatbot gives a wrong answer, you can see the conversation and ask the user what went wrong. When an agent system makes a wrong decision, you have to trace through:

What instructions did it receive?
What tools did it call?
What were the results?
What state was it operating from?
Did it have conflicting information?
Where exactly did it diverge from what should have happened?

One team spent two days debugging why an agent kept recommending the wrong approach to a customer. Turns out the agent was reading stale state — a previous customer’s information was cached in its context. By the time they figured it out, the system had already made bad recommendations to twelve customers.

The solution is comprehensive observability. Every agent call, tool invocation, state update, and decision needs to be logged with context. You need to be able to replay the entire execution trace. You need to know not just what the agent decided, but why it decided it, what alternatives it considered, and what data it was working from. Frameworks like LangSmith and observability platforms like Datadog APM provide purpose-built tools for this. The OpenTelemetry specification offers standardized instrumentation patterns that work across agent frameworks.

This is more engineering overhead than most teams budget for. But I’d argue it’s non-negotiable in production. If your agent system is making decisions that affect customers or business outcomes, you need to be able to explain those decisions after the fact.

Teams That Got It Right
#

The companies navigating this successfully share a few traits:

First, they’re clear about scope. They’re not trying to build a general-purpose autonomous system. They’re building a system for a specific problem: contract analysis, incident response, customer onboarding, code review — something concrete with defined success criteria.

Second, they’re paranoid about safety. They assume agents will occasionally make wrong decisions, and they build detection and correction mechanisms into the workflow. They don’t assume “better models = fewer mistakes.” They engineer for mistakes to happen and be caught. Safety evaluation frameworks like HELM and adversarial testing approaches provide structured methodologies for identifying failure modes.

Third, they’re patient with infrastructure. They don’t try to run production agent systems on a weekend hack. They invest in proper orchestration frameworks, observability, state management, and tool safety layers. This work isn’t glamorous, but it’s essential.

Fourth, they’re honest about limitations. I’ve yet to meet a team that successfully deployed agents without human oversight loops. The pattern is: agents do the work, humans review the high-risk decisions or outcomes. This isn’t a failure of the approach — it’s a realistic assessment of what autonomous systems can do today.

The Talent Gap
#

One thing I haven’t mentioned enough: building production agent systems requires different skills than building chatbots.

You need systems engineers who understand distributed state management, not just prompt engineers who know how to coax better outputs from LLMs. You need people who understand failure modes and recovery, not just people who read LLM documentation. You need people who treat agent systems like infrastructure, not like clever scripts.

Most companies have the second group and lack the first. This is creating a talent bottleneck. If you’re hiring right now for agent teams, you want systems engineers with platform experience, not just AI engineers with ChatGPT experience. The skill transfer is real but non-trivial.

The Next Phase: Agents Coordinating Other Agents
#

Once you have one agent system working, the next logical step is having agents spawn other agents. This is where things get genuinely complex.

Imagine a system where a high-level agent receives a business request, decomposes it into subproblems, spawns sub-agents to handle each subproblem, aggregates their results, and decides on a final action. Each sub-agent might spawn its own sub-agents. This is a recursive, hierarchical agent system — and it’s where the hardest problems live.

How do you prevent infinite recursion? How do you track state across multiple levels? How do you handle a sub-agent that takes an hour to complete? How do you debug a problem that emerged from a sub-agent’s sub-agent’s decision?

I haven’t seen this pattern in production yet. A few teams are experimenting with it, and the ones I’m advising are treating it carefully. It’s the next frontier, but it’s not “ready” for mission-critical systems. Give it another year.

My Take
#

The shift from chat interfaces to production agent systems is real and accelerating. But it’s not a simple “bigger models = better agents” story. It’s a complex architectural problem that requires systems thinking, careful engineering, and honest assessment of limitations.

The companies that get this right will have significant competitive advantages. Being able to automate complex, knowledge-intensive workflows autonomously is genuinely valuable. But the path there is longer and more engineering-intensive than most teams expect.

If you’re considering agent systems for your organization, start with a specific, bounded problem. Don’t try to solve everything at once. Invest in orchestration, observability, and safety mechanisms upfront. Assume agents will occasionally be wrong and build detection and correction into your workflows. And hire systems engineers, not just AI enthusiasts.

The technology is ready. The hard part is building the systems discipline to use it safely. The teams that respect that challenge will ship production agent systems that actually work. The teams that see it as “just add AI” will learn some expensive lessons the hard way.

Production LLM agents are here. The question isn’t whether to use them. It’s how to use them responsibly at scale.

AI Models & Releases - This article is part of a series.

Part : Running LLMs Locally — The Latest Quantization Breakthroughs

Part : AI Model Optimization & Efficiency — Making AI Accessible

Part : This Article

Part : Claude's In-Context Learning — The End of Fine-Tuning

Part : AI/LLM Models & Capabilities — In-Context Learning to Reasoning

Part : Google Gemini 2.0 — A New Chapter in Multimodal AI

Part : GPT-5 Is Here — A Developer's First Look at What Actually Changed

Part : OpenAI's o3 and o4-mini — Reasoning Models Get Real

Part : Claude 3.7 Sonnet — Extended Thinking Changes AI Development

Part : Claude 3.5 Gets a Computer — Anthropic's New 'Computer Use'