Overview
Enterprise agent architectures are defined less by model choice and more by control flow: how state evolves, how reasoning transitions into action, how failures are absorbed, and how humans remain meaningfully in the loop.
Key 2025-2026 Shifts
- Orchestration Evolution: From chains to explicit graphs and state machines for controllable long-running execution
- Evaluation Maturity: Beyond single-run success toward repeated-trial stability, robustness to paraphrases, and tolerance to tool faults
- Governance Integration: Security posture resembles non-human identity management with risk-tiered actions
🎯 Control Flow Design
Reactive vs deliberative as state and execution decisions
🤝 Multi-Agent Patterns
Practical coordination for reliability and governance
📊 LangGraph Production
Graph-based orchestration with explicit state channels
💾 Durable Architectures
Persistent state for long-running, resumable execution
Reactive vs Deliberative Agents
Control Flow and State as Primary Design Axes
Reactive Agents
Implement tight observe → act loop with minimal internal planning.
- Strong tool contracts and deterministic routing
- Bounded retries and fast execution
- Externalized state (request IDs, short-lived caches)
Deliberative Agents
Construct explicit intermediate artifacts for multi-step execution.
- Plans, hypotheses, task graphs, validation rubrics
- Internalized state (checkpointed, meaningful on its own)
- Support for pause, resume, and time-travel debugging
Execution Trade-offs and Failure Modes
| Pattern | Wins | Common Failures |
|---|---|---|
| Reactive | Throughput, cost, latency | Overcommit under ambiguity, retry storms |
| Deliberative | Robustness, auditability | Plan bloat, brittleness to stale assumptions |
Hybrid Mode Switching
Strong systems start reactively, then escalate to deliberation when needed.
mode_selector:
default: reactive
escalate_to_deliberative_if:
- missing_required_fields: true
- conflicting_constraints: true
- risk_tier >= medium
- tool_failure.class in [deterministic, policy_violation]
- retries_used >= 1
degrade_to_reactive_if:
- task_is_transactional: true
- plan_steps <= 2
- latency_slo_ms <= 1200
budgets:
max_tool_calls: 12
max_tokens: 15000
Case: Support Triage
Reactive triage agent resolves high-volume, low-risk requests (status lookups, invoice retrieval). Ambiguous or mixed-intent requests escalate to deliberative planner-executor workflow with critic gate before irreversible actions.
Practical Multi-Agent Patterns
Multi-agent is a reliability and governance strategy: separate concerns so each agent has narrower tool surface, clearer success criteria, and bounded autonomy envelope.
Decomposition vs Specialization
Decomposition
Splits the work into steps or subproblems (planner-executor, map-reduce retrieval).
Objective: Reduced cognitive load, clearer intermediate artifacts
Specialization
Assigns stable roles that persist across tasks (planner, executor, verifier, safety critic).
Objective: Predictability through role-specific prompts and tool allowlists
Coordination Cost and Observability
Hidden cost of multi-agent systems is coordination, not tokens.
coordination_metrics:
per_agent:
- role
- model
- tool_allowlist
- avg_turns
- avg_tool_calls
- repair_rate
- harm_rate
handoffs:
- schema_valid_pass
- missing_fields_rate
- contradiction_rate
system:
- p95_latency_ms
- escalation_rate
- rollback_rate
- pass_k_at_10
Key Insight
If you cannot answer "which agent decided what, and based on which evidence," you do not have a multi-agent system. You have a distributed rumor mill.
Pattern 1: Supervisor and Worker Specialists
Most common enterprise pattern with centralized governance.
- Supervisor classifies intent, selects workers, consolidates outcomes
- Workers are narrow specialists with constrained tool access
- Single governance choke point at supervisor level
Case: Enterprise IT Help Desk
Supervisor plus specialists for identity, device management, ticketing. Device worker accesses MDM tools, ticketing worker only writes to ITSM. Prevents credential resets during device troubleshooting.
Pattern 2: Planner-Executor with Verification Gates
Multi-agent analog of clean architecture with explicit validation.
handoff_contract:
planner_output:
objective: string
steps:
- id: string
tool: string
inputs: object
success_check: string
risk_tier: low|medium|high
stop_conditions:
- budget_exceeded
- missing_required_id
- policy_gate_triggered
executor_rules:
- execute_steps_in_order
- no_tool_outside_allowlist
- log_all_inputs_outputs
- on_failure_return_structured_error
Why it works: Planner interprets failures and decides retry/replan/escalate; executor is deterministic and safe.
Pattern 3: Critic and Judge Loops
Separates generation from evaluation for high-impact actions.
- Pre-action critics: Validate plan before any tool call
- Post-action critics: Validate end state, not narrative
Use when: Action is irreversible, tool output needs semantic validation, compliance requirement exists
Pattern 4: Router-Worker for Cost and Latency Control
Cheap classifier assigns tasks to workers based on intent, risk tier, required tools.
Router owns:
- Model selection and fallbacks
- Risk-tier assignment
- Reactive vs deliberative determination
- Human approval requirements
Quick Engineering Heuristic
- Ambiguous or multi-step with external dependencies → planner-executor
- Broad domain with different tool surfaces → supervisor-worker
- High volume, low cost requirements → router-worker
- High cost of wrong action → add critic gates
LangGraph Production Pipelines
LangGraph represents a shift from prompt-centric orchestration to explicit execution graphs. Most production failures occur at boundaries between steps, not inside reasoning.
Why Graphs Beat Linear Chains
Graphs Externalize Control Flow
- Explicit recovery paths: Failures branch to recovery nodes instead of blind retries
- Composable governance: Policy checks, budget enforcement as nodes
- Deterministic reasoning: Analyze worst-case paths, latency, cost
Nodes, Edges, and Channels
graph:
nodes:
plan:
type: reasoning
execute:
type: tool_call
validate:
type: critic
recover:
type: recovery
edges:
plan -> execute
execute -> validate
validate -> plan
validate -> recover
channels:
plan_state
execution_results
validation_signals
Key Advantage: Channels make state explicit, enabling versioning, validation, and replay
Determinism, Replay, and Debuggability
Graph-based orchestration enables deterministic replay. When state transitions and tool calls are explicit, failed executions can be reconstructed step by step.
- Persist graph state and channel contents to durable backends (DynamoDB, PostgreSQL)
- Enable pause, resume, and replay across infrastructure restarts
- Reconstruct exact reasoning and execution paths for debugging
Integrating ARC+ Patterns
ARC+ execution patterns map naturally onto graph structures:
- Failure classification → validation node
- Recovery ladders → explicit recovery branches
- Budget enforcement → gating node that halts or reroutes
Case: Incident Remediation
Infrastructure agent uses LangGraph with explicit branches for transient vs deterministic failures. Transient failures route to retry with exponential backoff. Deterministic failures route to replanning. Policy violations route to human approval. Eliminated retry storms from earlier chain-based implementation.
When LangGraph Is the Right Choice
- Short, single-step tasks: Chains or reactive execution sufficient
- Multi-step with retries, branching, governance: Graph-based orchestration essential
Rule of thumb: If you care about resumability, auditability, or bounded recovery, you want a graph.
Persistent and Durable Agent Architectures
Durability is the point at which agent systems stop behaving like demos and start behaving like software.
Why Durability Changes the Architecture
Stateless Agents
Fail silently when process crashes or deployment rolls. System forgets what it was doing.
Durable Agents
Fail recoverably. Resume with context, intent, and prior decisions intact.
Three Non-Negotiable Requirements
- Explicit state representation: Progress captured as structured state, not implied by prompt history
- Persistent storage: State survives process restarts and infrastructure failures
- Resumable execution: System can continue from last checkpoint without data loss
Durable Execution Patterns
Checkpointed State Machines
Execution progress saved at key transition points. Enables pause, resume, and rollback.
Event Sourcing
All state changes captured as immutable events. Complete audit trail and time-travel debugging.
Distributed Workflow Orchestration
Managed services (AWS Step Functions, Temporal) handle durability, retries, compensation.
Production Durability Requirements
- State persisted to durable storage (PostgreSQL, DynamoDB, managed workflow stores)
- Execution traces with correlation IDs for distributed tracing
- Replay capability from any checkpoint
- Graceful degradation when external dependencies fail
- Human approval workflows with asynchronous continuations
Why Agent Architectures In Action Matter
- Production Readiness: Move from demos to real systems with failure handling
- Control Flow Mastery: Explicit state transitions and recovery paths
- Observable Behavior: Deterministic replay and debugging capabilities
- Governance Integration: Policy enforcement at architectural boundaries
- Operational Resilience: Durability, resumability, and graceful degradation
Agent architectures in action transform probabilistic reasoning into deterministic, observable, and resilient production systems that balance autonomy with oversight.