April 20, 20264 min read

Demos are Easy, Production is War

Your LLM isn't failing; your infrastructure is. An engineer's guide to the 11 pillars of production-grade AI agents.

Agentic AISystem ArchitectureLLMOpsInfrastructure

Demos are cute. Production is a warzone.

A basic ReAct loop is fine for a weekend project, but scale it to 10+ autonomous steps, like having an agent orchestrate complex workflows across a suite of micro-apps, and watch it silently choke. When an agent hallucinates a tool call, forgets its objective, or gets stuck in an infinite loop, developers are quick to blame the model.

Usually, the model isn't the problem. Your infrastructure is.

Consider this: LangChain jumped from outside the top 30 to rank 5 on TerminalBench 2.0 without changing the model weights. They just overhauled the infrastructure.

As Beren Millidge put it: a raw LLM is just a CPU. It has no RAM, no disk, and no I/O. In an agentic architecture, your context window is the RAM, your databases are the disk, and your tools are the device drivers.

The code binding it all together, whether you're writing it in Go, TypeScript, or Python, is the operating system. We call it The Harness.

"If you're not the model, you're the harness." — Vivek Trivedy

The Three Levels of Agent Engineering

To graduate from toy scripts to autonomous systems, you have to master three concentric layers:

Prompt Engineering: Tweaking the instructions and few-shot examples. (The table stakes).
Context Engineering: Curating what the model sees and when it sees it to maximize signal-to-noise.
Harness Engineering: Building the heavy-duty infrastructure, state persistence, orchestration, error recovery, and safety.

The 11 Pillars of a Production Harness

A production-grade harness isn't just a while loop. It’s a complex state machine built on 11 core components:

The Engine & State

Orchestration Loop: The heartbeat (Thought-Action-Observation). It manages the exact turn-by-turn handoffs between the LLM and the environment.
State Management: Time-travel debugging. Model your state as typed dictionaries with git-style checkpoints. If a branch fails, you roll back, not start over.
Memory: Multi-timescale persistence. Don't dump everything into context. Use a hierarchy: a tiny always-loaded index, on-demand topic files, and search-only raw transcripts.

I/O & Context

Tools: The agent's hands. This requires strict schema registration, sandboxed execution environments, and flawless result capture.
Context Management: Context rot is real; performance drops 30%+ when vital info is buried in the middle of a prompt. Use just-in-time retrieval to keep token counts lean.
Prompt Construction: Hierarchical assembly. Cap your developer instructions at 32 KiB to prevent the model from "forgetting" how to follow instructions.
Output Parsing: Regex won't save you here. Use native tool calling with strict schema constraints (like Pydantic) for machine-readable reliability.

Safety & Reliability

Error Handling: The math of failure is brutal. A 10-step task with 99% per-step success only succeeds 90% of the time end-to-end. Your harness must auto-retry transient errors and feed LLM-recoverable errors back into the context so the model can self-correct.
Guardrails: The tripwires. Architecturally separate reasoning from permissions. Never let the model approve its own high-risk actions (like dropping a database).
Verification Loops: The "Ralph Loop." Use linters or LLM-as-judge checks before outputting to the user. This alone can 2x output quality.
Subagent Orchestration: Specialist delegation. When tasks get too broad, spin up isolated sub-agents with narrow scopes to handle specific domains.

The Scaffolding Metaphor

Think of your harness as scaffolding. It’s temporary infrastructure that allows a model to reach heights it can’t yet achieve natively.

As underlying foundation models get smarter, your harness should naturally thin out. If you plug a next-generation model into your architecture and performance scales up without needing new code complexity, your harness is architecturally sound.

The Architect’s Cheat Sheet

Before you deploy your next agentic system, run through this checklist:

Agent Count: Keep it single-agent unless you have 10+ tools or strictly isolated domains. Multi-agent introduces massive overhead.
Context Strategy: Aggressively compact your context. Prioritize reasoning traces over raw, verbose tool outputs.
Tool Scoping: Only expose the exact tools needed for the current step. Extra tools are just "distraction tokens."
Thickness: Decide upfront: what logic is strictly deterministic (stays in code) vs. what is probabilistic (delegated to the model)?

The model is the brains, but the harness is where the actual engineering lives. Build it right, or prepare for casualties in production.