InkdownInkdown
Start writing

system_design_for_agentic_apps

Last updated on May 20, 2026

Why AI Agents Need Classic System Design

There is a common belief floating around the engineering community: In the age of AI, system design is no longer important. Prompts, vector databases, and LLM API calls are all you need to ship a product.

If you believe this, just wait until your two-hour autonomous agent run fails at step 95 of 100.

This scenario is not a hypothetical headache. It is a daily reality for anyone trying to move AI agents from simple playground demonstrations into production environments. As engineering leader Arpit Bhayani recently noted:

"System design is not important anymore - if you believe this, just wait until your 2-hour agent run fails at step 95 of 100 :) I hope it does not happen, but it actually might. Agentic apps have two classes: short-running and long-running."

When we move from short-running helpers to long-running, autonomous agents, building the app becomes a distributed systems engineering problem, not a prompt engineering one.

Here is why system design is more critical than ever in the age of AI, and how we can apply classic engineering patterns to build resilient agentic systems.


The Two Classes of Agentic Applications

To understand why agents fail, we must first categorize them by their execution lifecycle.

1. Short-Running Agents

These are stateless, low-latency applications designed for quick, single-turn or few-turn tasks.

  • Examples: Customer support chatbots, translation assistants, and text summarizers.
  • System Complexity: Low. They typically wrap an LLM API in a simple synchronous HTTP request.
  • Failure Recovery: If a call fails due to a network timeout or API rate limit, the client simply retries the request. The user suffers a delay of a few seconds, and the token cost of retrying is negligible.

2. Long-Running Agents

These are stateful, high-latency applications designed to execute complex, multi-step workflows autonomously over minutes or hours.

  • Examples: Autonomous software engineering workspaces (like Devin or Edward), recursive research agents, multi-hour web scraping pipelines, and multi-agent workflow coordinators.
  • System Complexity: High. They execute loops, run code, use tools, self-correct, and make non-deterministic routing decisions.
  • Failure Recovery: If a network blip or API rate limit occurs at step 95 of a 100-step run, restarting from step 1 is unacceptable. It wastes valuable time, burns thousands of LLM tokens, and completely destroys user trust.

To make long-running agents viable, we must design them using classic backend architectural patterns.


The Resilient Agent Blueprint: Core System Design Patterns

Building a dependable long-running agent requires five foundational system design pillars.

Mermaid
graph TD
    A[Client Request] --> B[API Gateway]
    B --> C[Asynchronous Task Queue - BullMQ/Redis]
    C --> D[Agent Execution Loop]
    D --> E[State Store - Postgres Checkpointing]
    D --> F[Secure Docker Sandbox - Isolated Code Run]
    D --> G[Streaming State-Machine Parser]
    D --> H[Semantic Cache - Redis]
    G --> I[LLM API]
    H --> I
    D --> J[Observability Engine - Kafka/ClickHouse Logs]

1. Checkpointing and State Persistence (Durability)

An agent loop must write its state to a persistent datastore after every successful action. If the system crashes, the agent must be able to restore its memory, directory structure, task list, and execution history from the last known checkpoint.

  • How it works in practice: In Edward (an AI coding workspace), we implemented a preview and recovery architecture. By persisting the complete workspace state to PostgreSQL, we ensured that the agent could resume its work without starting over. If a multi-turn run is interrupted, the system restores the environment and picks up from the exact step where it stopped.

2. Decoupled Asynchronous Processing (Fault Tolerance)

A client should never hold an HTTP connection open while waiting for an agent to run a multi-hour loop. The execution must be fully decoupled from the request-response cycle.

  • How it works in practice: We use an asynchronous background worker architecture. By routing agent runs through BullMQ and Redis, we isolate the execution loop from web server restarts or network drops. If an individual worker node dies, another worker claims the active job and loads the latest checkpoint from the database.

3. Structured Streaming Parsers (Input/Output Safety)

LLMs stream their outputs token by token. An agent relies on parsing these streams in real-time to detect tool calls (e.g., running a terminal command or writing a file). If your parser is fragile, a single unexpected markdown character from the LLM can crash the entire execution loop.

  • How it works in practice: In Edward, we built a custom AI orchestration layer featuring a streaming state-machine parser. Instead of waiting for the full LLM response, the parser processes raw text chunks on the fly, converting them into structured, type-safe events. This keeps the execution loop safe from malformed model outputs.

4. Sandbox Isolation and Resource Limits (Security)

Autonomous agents write and execute code. Running generated code directly on your application server is a major security risk. It can lead to infinite loops, memory exhaustion, or malicious system access.

  • How it works in practice: Agent execution must occur in an isolated environment. In Edward, we designed a sandbox execution pipeline. Every command run and file compile happens inside a dedicated Docker container. The write pipeline is buffered via Redis to handle burst loads, and each container has strict memory and CPU limits to prevent resource exhaustion.

5. Semantic Caching and Cost Control (Efficiency)

Long-running agents query LLMs repeatedly in loops. This leads to massive API bills and high latency.

  • How it works in practice: To control costs, we implement semantic caching. In Agentic Chat, we built a semantic caching layer that checks past queries for semantic similarity before hitting the LLM. This pattern reduced API costs by 40% and enabled an intent-classification router to run in under 100ms with 92% path-selection accuracy.

Observability: The Key to Debugging Non-Determinism

Even with checkpointing and sandboxing, agents will make logical errors because LLM outputs are non-deterministic. Debugging a system that behaves differently on every run requires deep observability.

To audit an agent, you need to see:

  1. The exact prompt sent at each step.
  2. The raw text and tool calls returned by the model.
  3. The console logs and exit codes from the sandboxed tools.

For this, standard application logs are insufficient. You need real-time, structured trace logs. In our systems, we use a streaming pipeline—inspired by the architecture of DeployNinja—where live execution logs are streamed through Kafka and written to ClickHouse. This allows us to inspect agent steps in real-time, analyze performance regressions, and trace failure points instantly.


Conclusion: AI is 10% LLM, 90% Engineering

The next wave of AI adoption will not be driven by smarter models alone. It will be driven by systems that make existing models reliable.

An agentic application is ultimately a software system. The AI is simply a non-deterministic component within that system. To build agents that survive in production, we must stop focusing solely on prompt templates and start focusing on state management, message queues, secure sandboxes, and trace logging.

System design is not dead. It is the only thing keeping your agents alive.

Shared via Inkdown