Chapter 8: Agent Architectures In Action

From Architecture to Execution Reality

Overview

Enterprise agent architectures are defined less by model choice and more by control flow: how state evolves, how reasoning transitions into action, how failures are absorbed, and how humans remain meaningfully in the loop.

Key 2025-2026 Shifts

  • Orchestration Evolution: From chains to explicit graphs and state machines for controllable long-running execution
  • Evaluation Maturity: Beyond single-run success toward repeated-trial stability, robustness to paraphrases, and tolerance to tool faults
  • Governance Integration: Security posture resembles non-human identity management with risk-tiered actions

🎯 Control Flow Design

Reactive vs deliberative as state and execution decisions

🤝 Multi-Agent Patterns

Practical coordination for reliability and governance

📊 LangGraph Production

Graph-based orchestration with explicit state channels

💾 Durable Architectures

Persistent state for long-running, resumable execution

Reactive vs Deliberative Agents

Control Flow and State as Primary Design Axes

Reactive Agents

Implement tight observe → act loop with minimal internal planning.

  • Strong tool contracts and deterministic routing
  • Bounded retries and fast execution
  • Externalized state (request IDs, short-lived caches)

Deliberative Agents

Construct explicit intermediate artifacts for multi-step execution.

  • Plans, hypotheses, task graphs, validation rubrics
  • Internalized state (checkpointed, meaningful on its own)
  • Support for pause, resume, and time-travel debugging

Execution Trade-offs and Failure Modes

Pattern Wins Common Failures
Reactive Throughput, cost, latency Overcommit under ambiguity, retry storms
Deliberative Robustness, auditability Plan bloat, brittleness to stale assumptions

Hybrid Mode Switching

Strong systems start reactively, then escalate to deliberation when needed.

mode_selector:
  default: reactive
  escalate_to_deliberative_if:
    - missing_required_fields: true
    - conflicting_constraints: true
    - risk_tier >= medium
    - tool_failure.class in [deterministic, policy_violation]
    - retries_used >= 1
  degrade_to_reactive_if:
    - task_is_transactional: true
    - plan_steps <= 2
    - latency_slo_ms <= 1200
  budgets:
    max_tool_calls: 12
    max_tokens: 15000

Case: Support Triage

Reactive triage agent resolves high-volume, low-risk requests (status lookups, invoice retrieval). Ambiguous or mixed-intent requests escalate to deliberative planner-executor workflow with critic gate before irreversible actions.

Practical Multi-Agent Patterns

Multi-agent is a reliability and governance strategy: separate concerns so each agent has narrower tool surface, clearer success criteria, and bounded autonomy envelope.

Decomposition vs Specialization

Decomposition

Splits the work into steps or subproblems (planner-executor, map-reduce retrieval).

Objective: Reduced cognitive load, clearer intermediate artifacts

Specialization

Assigns stable roles that persist across tasks (planner, executor, verifier, safety critic).

Objective: Predictability through role-specific prompts and tool allowlists

Coordination Cost and Observability

Hidden cost of multi-agent systems is coordination, not tokens.

coordination_metrics:
  per_agent:
    - role
    - model
    - tool_allowlist
    - avg_turns
    - avg_tool_calls
    - repair_rate
    - harm_rate
  handoffs:
    - schema_valid_pass
    - missing_fields_rate
    - contradiction_rate
  system:
    - p95_latency_ms
    - escalation_rate
    - rollback_rate
    - pass_k_at_10

Key Insight

If you cannot answer "which agent decided what, and based on which evidence," you do not have a multi-agent system. You have a distributed rumor mill.

Pattern 1: Supervisor and Worker Specialists

Most common enterprise pattern with centralized governance.

  • Supervisor classifies intent, selects workers, consolidates outcomes
  • Workers are narrow specialists with constrained tool access
  • Single governance choke point at supervisor level

Case: Enterprise IT Help Desk

Supervisor plus specialists for identity, device management, ticketing. Device worker accesses MDM tools, ticketing worker only writes to ITSM. Prevents credential resets during device troubleshooting.

Pattern 2: Planner-Executor with Verification Gates

Multi-agent analog of clean architecture with explicit validation.

handoff_contract:
  planner_output:
    objective: string
    steps:
      - id: string
        tool: string
        inputs: object
        success_check: string
    risk_tier: low|medium|high
    stop_conditions:
      - budget_exceeded
      - missing_required_id
      - policy_gate_triggered
  executor_rules:
    - execute_steps_in_order
    - no_tool_outside_allowlist
    - log_all_inputs_outputs
    - on_failure_return_structured_error

Why it works: Planner interprets failures and decides retry/replan/escalate; executor is deterministic and safe.

Pattern 3: Critic and Judge Loops

Separates generation from evaluation for high-impact actions.

  • Pre-action critics: Validate plan before any tool call
  • Post-action critics: Validate end state, not narrative

Use when: Action is irreversible, tool output needs semantic validation, compliance requirement exists

Pattern 4: Router-Worker for Cost and Latency Control

Cheap classifier assigns tasks to workers based on intent, risk tier, required tools.

Router owns:

  • Model selection and fallbacks
  • Risk-tier assignment
  • Reactive vs deliberative determination
  • Human approval requirements

Quick Engineering Heuristic

  • Ambiguous or multi-step with external dependencies → planner-executor
  • Broad domain with different tool surfaces → supervisor-worker
  • High volume, low cost requirements → router-worker
  • High cost of wrong action → add critic gates

LangGraph Production Pipelines

LangGraph represents a shift from prompt-centric orchestration to explicit execution graphs. Most production failures occur at boundaries between steps, not inside reasoning.

Why Graphs Beat Linear Chains

Graphs Externalize Control Flow

  • Explicit recovery paths: Failures branch to recovery nodes instead of blind retries
  • Composable governance: Policy checks, budget enforcement as nodes
  • Deterministic reasoning: Analyze worst-case paths, latency, cost

Nodes, Edges, and Channels

graph:
  nodes:
    plan:
      type: reasoning
    execute:
      type: tool_call
    validate:
      type: critic
    recover:
      type: recovery
  edges:
    plan -> execute
    execute -> validate
    validate -> plan
    validate -> recover
  channels:
    plan_state
    execution_results
    validation_signals

Key Advantage: Channels make state explicit, enabling versioning, validation, and replay

Determinism, Replay, and Debuggability

Graph-based orchestration enables deterministic replay. When state transitions and tool calls are explicit, failed executions can be reconstructed step by step.

  • Persist graph state and channel contents to durable backends (DynamoDB, PostgreSQL)
  • Enable pause, resume, and replay across infrastructure restarts
  • Reconstruct exact reasoning and execution paths for debugging

Integrating ARC+ Patterns

ARC+ execution patterns map naturally onto graph structures:

  • Failure classification → validation node
  • Recovery ladders → explicit recovery branches
  • Budget enforcement → gating node that halts or reroutes

Case: Incident Remediation

Infrastructure agent uses LangGraph with explicit branches for transient vs deterministic failures. Transient failures route to retry with exponential backoff. Deterministic failures route to replanning. Policy violations route to human approval. Eliminated retry storms from earlier chain-based implementation.

When LangGraph Is the Right Choice

  • Short, single-step tasks: Chains or reactive execution sufficient
  • Multi-step with retries, branching, governance: Graph-based orchestration essential

Rule of thumb: If you care about resumability, auditability, or bounded recovery, you want a graph.

Persistent and Durable Agent Architectures

Durability is the point at which agent systems stop behaving like demos and start behaving like software.

Why Durability Changes the Architecture

Stateless Agents

Fail silently when process crashes or deployment rolls. System forgets what it was doing.

Durable Agents

Fail recoverably. Resume with context, intent, and prior decisions intact.

Three Non-Negotiable Requirements

  • Explicit state representation: Progress captured as structured state, not implied by prompt history
  • Persistent storage: State survives process restarts and infrastructure failures
  • Resumable execution: System can continue from last checkpoint without data loss

Durable Execution Patterns

Checkpointed State Machines

Execution progress saved at key transition points. Enables pause, resume, and rollback.

Event Sourcing

All state changes captured as immutable events. Complete audit trail and time-travel debugging.

Distributed Workflow Orchestration

Managed services (AWS Step Functions, Temporal) handle durability, retries, compensation.

Production Durability Requirements

  • State persisted to durable storage (PostgreSQL, DynamoDB, managed workflow stores)
  • Execution traces with correlation IDs for distributed tracing
  • Replay capability from any checkpoint
  • Graceful degradation when external dependencies fail
  • Human approval workflows with asynchronous continuations

Why Agent Architectures In Action Matter

Agent architectures in action transform probabilistic reasoning into deterministic, observable, and resilient production systems that balance autonomy with oversight.