Chapter 8: Agent Architectures In Action

Overview

Enterprise agent architectures are defined less by model choice and more by control flow: how state evolves, how reasoning transitions into action, how failures are absorbed, and how humans remain meaningfully in the loop.

Key 2025-2026 Shifts

Orchestration Evolution: From chains to explicit graphs and state machines for controllable long-running execution
Evaluation Maturity: Beyond single-run success toward repeated-trial stability, robustness to paraphrases, and tolerance to tool faults
Governance Integration: Security posture resembles non-human identity management with risk-tiered actions

🎯 Control Flow Design

Reactive vs deliberative as state and execution decisions

🤝 Multi-Agent Patterns

Practical coordination for reliability and governance

📊 LangGraph Production

Graph-based orchestration with explicit state channels

💾 Durable Architectures

Persistent state for long-running, resumable execution

Reactive vs Deliberative Agents

Control Flow and State as Primary Design Axes

Reactive Agents

Implement tight observe → act loop with minimal internal planning.

Strong tool contracts and deterministic routing
Bounded retries and fast execution
Externalized state (request IDs, short-lived caches)

Deliberative Agents

Construct explicit intermediate artifacts for multi-step execution.

Plans, hypotheses, task graphs, validation rubrics
Internalized state (checkpointed, meaningful on its own)
Support for pause, resume, and time-travel debugging

Execution Trade-offs and Failure Modes

Pattern	Wins	Common Failures
Reactive	Throughput, cost, latency	Overcommit under ambiguity, retry storms
Deliberative	Robustness, auditability	Plan bloat, brittleness to stale assumptions

Hybrid Mode Switching

Strong systems start reactively, then escalate to deliberation when needed.

mode_selector:
  default: reactive
  escalate_to_deliberative_if:
    - missing_required_fields: true
    - conflicting_constraints: true
    - risk_tier >= medium
    - tool_failure.class in [deterministic, policy_violation]
    - retries_used >= 1
  degrade_to_reactive_if:
    - task_is_transactional: true
    - plan_steps <= 2
    - latency_slo_ms <= 1200
  budgets:
    max_tool_calls: 12
    max_tokens: 15000

Case: Support Triage

Reactive triage agent resolves high-volume, low-risk requests (status lookups, invoice retrieval). Ambiguous or mixed-intent requests escalate to deliberative planner-executor workflow with critic gate before irreversible actions.

Practical Multi-Agent Patterns

Multi-agent is a reliability and governance strategy: separate concerns so each agent has narrower tool surface, clearer success criteria, and bounded autonomy envelope.

Decomposition vs Specialization

Decomposition

Splits the work into steps or subproblems (planner-executor, map-reduce retrieval).

Objective: Reduced cognitive load, clearer intermediate artifacts

Specialization

Assigns stable roles that persist across tasks (planner, executor, verifier, safety critic).

Objective: Predictability through role-specific prompts and tool allowlists

Coordination Cost and Observability

Hidden cost of multi-agent systems is coordination, not tokens.

coordination_metrics:
  per_agent:
    - role
    - model
    - tool_allowlist
    - avg_turns
    - avg_tool_calls
    - repair_rate
    - harm_rate
  handoffs:
    - schema_valid_pass
    - missing_fields_rate
    - contradiction_rate
  system:
    - p95_latency_ms
    - escalation_rate
    - rollback_rate
    - pass_k_at_10

Key Insight

If you cannot answer "which agent decided what, and based on which evidence," you do not have a multi-agent system. You have a distributed rumor mill.

Pattern 1: Supervisor and Worker Specialists

Most common enterprise pattern with centralized governance.

Supervisor classifies intent, selects workers, consolidates outcomes
Workers are narrow specialists with constrained tool access
Single governance choke point at supervisor level

Case: Enterprise IT Help Desk

Supervisor plus specialists for identity, device management, ticketing. Device worker accesses MDM tools, ticketing worker only writes to ITSM. Prevents credential resets during device troubleshooting.

Pattern 2: Planner-Executor with Verification Gates

Multi-agent analog of clean architecture with explicit validation.

handoff_contract:
  planner_output:
    objective: string
    steps:
      - id: string
        tool: string
        inputs: object
        success_check: string
    risk_tier: low|medium|high
    stop_conditions:
      - budget_exceeded
      - missing_required_id
      - policy_gate_triggered
  executor_rules:
    - execute_steps_in_order
    - no_tool_outside_allowlist
    - log_all_inputs_outputs
    - on_failure_return_structured_error

Why it works: Planner interprets failures and decides retry/replan/escalate; executor is deterministic and safe.

Pattern 3: Critic and Judge Loops

Separates generation from evaluation for high-impact actions.

Pre-action critics: Validate plan before any tool call
Post-action critics: Validate end state, not narrative

Use when: Action is irreversible, tool output needs semantic validation, compliance requirement exists

Pattern 4: Router-Worker for Cost and Latency Control

Cheap classifier assigns tasks to workers based on intent, risk tier, required tools.

Router owns:

Model selection and fallbacks
Risk-tier assignment
Reactive vs deliberative determination
Human approval requirements

Quick Engineering Heuristic

Ambiguous or multi-step with external dependencies → planner-executor
Broad domain with different tool surfaces → supervisor-worker
High volume, low cost requirements → router-worker
High cost of wrong action → add critic gates

LangGraph Production Pipelines

LangGraph represents a shift from prompt-centric orchestration to explicit execution graphs. Most production failures occur at boundaries between steps, not inside reasoning.

Why Graphs Beat Linear Chains

Graphs Externalize Control Flow

Explicit recovery paths: Failures branch to recovery nodes instead of blind retries
Composable governance: Policy checks, budget enforcement as nodes
Deterministic reasoning: Analyze worst-case paths, latency, cost

Nodes, Edges, and Channels

graph:
  nodes:
    plan:
      type: reasoning
    execute:
      type: tool_call
    validate:
      type: critic
    recover:
      type: recovery
  edges:
    plan -> execute
    execute -> validate
    validate -> plan
    validate -> recover
  channels:
    plan_state
    execution_results
    validation_signals

Key Advantage: Channels make state explicit, enabling versioning, validation, and replay

Determinism, Replay, and Debuggability

Graph-based orchestration enables deterministic replay. When state transitions and tool calls are explicit, failed executions can be reconstructed step by step.

Persist graph state and channel contents to durable backends (DynamoDB, PostgreSQL)
Enable pause, resume, and replay across infrastructure restarts
Reconstruct exact reasoning and execution paths for debugging

Integrating ARC+ Patterns

ARC+ execution patterns map naturally onto graph structures:

Failure classification → validation node
Recovery ladders → explicit recovery branches
Budget enforcement → gating node that halts or reroutes

Case: Incident Remediation

Infrastructure agent uses LangGraph with explicit branches for transient vs deterministic failures. Transient failures route to retry with exponential backoff. Deterministic failures route to replanning. Policy violations route to human approval. Eliminated retry storms from earlier chain-based implementation.

When LangGraph Is the Right Choice

Short, single-step tasks: Chains or reactive execution sufficient
Multi-step with retries, branching, governance: Graph-based orchestration essential

Rule of thumb: If you care about resumability, auditability, or bounded recovery, you want a graph.

Persistent and Durable Agent Architectures

Durability is the point at which agent systems stop behaving like demos and start behaving like software.

Why Durability Changes the Architecture

Stateless Agents

Fail silently when process crashes or deployment rolls. System forgets what it was doing.

Durable Agents

Fail recoverably. Resume with context, intent, and prior decisions intact.

Three Non-Negotiable Requirements

Explicit state representation: Progress captured as structured state, not implied by prompt history
Persistent storage: State survives process restarts and infrastructure failures
Resumable execution: System can continue from last checkpoint without data loss

Durable Execution Patterns

Checkpointed State Machines

Execution progress saved at key transition points. Enables pause, resume, and rollback.

Event Sourcing

All state changes captured as immutable events. Complete audit trail and time-travel debugging.

Distributed Workflow Orchestration

Managed services (AWS Step Functions, Temporal) handle durability, retries, compensation.

Production Durability Requirements

State persisted to durable storage (PostgreSQL, DynamoDB, managed workflow stores)
Execution traces with correlation IDs for distributed tracing
Replay capability from any checkpoint
Graceful degradation when external dependencies fail
Human approval workflows with asynchronous continuations

Why Agent Architectures In Action Matter

Production Readiness: Move from demos to real systems with failure handling
Control Flow Mastery: Explicit state transitions and recovery paths
Observable Behavior: Deterministic replay and debugging capabilities
Governance Integration: Policy enforcement at architectural boundaries
Operational Resilience: Durability, resumability, and graceful degradation

Agent architectures in action transform probabilistic reasoning into deterministic, observable, and resilient production systems that balance autonomy with oversight.