AI Agent Patterns — The 9-Tool Toolbox
8 patterns + 1 overlay. Pick the smallest scaffold that fills your actual gap.
TL;DR
AI agent patterns are not a ladder you climb. They are a toolbox you reach into when you have a specific problem.
There are 8 patterns (ReAct · Reflexion · Plan-Execute · Supervisor · Sequential Crew · Hierarchical Crew · Multi-agent Swarm · Debate) and 1 overlay (HITL — Human-in-the-Loop). That’s the full 2026 set.
Each pattern compensates for one specific agent weakness. Pick the smallest scaffold that fills your actual gap.
The cost spread is wild: ~25× between the cheapest (ReAct, ~$0.05-0.20 per task) and the most expensive (Debate, ~$1-5 per task).
The #1 production cost mistake teams make: over-graduation. Adding a heavier pattern than the actual gap requires. Prerequisites being met ≠ pattern being justified.
HITL is NOT a peer pattern. It’s an overlay — wraps any of the other 8 patterns at irreversible-action checkpoints.
For most production teams in 2026, the right starting architecture is Supervisor + HITL with ReAct workers. Graduate to Swarm or Debate only on evidence of a specific problem the simpler pattern can’t solve.
Quick reference — the 9 patterns
# Pattern Weakness it compensates for Cost per task (rough) 1 ReAct (Reasoning + Acting) Agent doesn’t know all answers; needs tools $0.05 – $0.20 2 Reflexion (Actor + Evaluator + Self-Reflection memory) Agent makes correctable mistakes when an evaluator can score them $0.15 – $0.60 3 Plan-and-Execute (planner upfront + cheap executor) Agent is short-sighted on long-horizon tasks $0.05 – $0.10 4 Supervisor (orchestrator + workers) Task decomposes into specialist domains; routing needs reasoning $0.20 – $0.50 5 Sequential Crew (fixed pipeline) Workflow is known and linear; no routing decisions $0.10 – $0.30 6 Hierarchical Crew (manager + workers + review loops) Output quality > throughput; each stage needs validation $0.30 – $1.00 7 Multi-agent Swarm (peer-to-peer handoffs) Supervisor latency dominates; specialists are sharply separated $0.15 – $0.40 8 Debate / Adversarial (multi-agent argument + vote) Single-agent answer carries unacceptable risk $1.00 – $5.00 9 HITL overlay (durable human-approval gate) Some actions are irreversible / high-blast-radius Pattern cost + human time
Decision checklist
Before adding any pattern to your agent design:
What is the agent NOT able to do alone? Name the specific weakness in one sentence.
Is that weakness actually present in production today, or hypothetical? “We might want X someday” is not a justification.
What’s the smallest pattern that fills the gap? Default to less. Graduate only on evidence.
Is the agent making decisions that are irreversible? If yes — add HITL on those checkpoints. Not on every step. Only at the blast-radius boundaries.
Have I confused prerequisites with justification? Prerequisites met (we have observability, we have specialists) ≠ pattern justified (we have evidence the simpler pattern is breaking).
If you can’t answer all five, you’re not ready to pick a pattern.
The detailed walkthrough — each pattern with a concrete example
1. ReAct — Reasoning + Acting
What it is. Single agent in a loop. Each iteration is one LLM call that produces a Thought, an Action (tool call), and an Observation (tool result). The agent decides what to do next based on the observation.
Concrete example. User asks: “What’s the weather in Doha right now?”
THOUGHT: I need real-time weather. I'll call the weather tool.
ACTION: get_weather(city="Doha")
OBSERVATION: { "temp_c": 38, "condition": "sunny" }
THOUGHT: I have the data. I can answer.
RESPONSE: "It's 38°C and sunny in Doha right now."One iteration, one tool call, done. If the question requires chained lookups (“Find the CEO of Hugging Face on LinkedIn”) the loop runs multiple iterations — each observation shapes the next reason+act.
Cost shape. Each iteration is one full LLM call. 6-step ReAct on GPT-4o ≈ $0.15. Cost grows linearly with chain length; token context grows quadratically (each step appends the previous observations to the prompt).
Pick ReAct when. You’re prototyping. The task path can’t be predetermined. Tool inventory is bounded. Cost is tolerable for the iteration count you expect.
Skip ReAct when. Task is known and structured (Plan-Execute is cheaper). Errors compound and you have an evaluator (Reflexion adds value). Volume is high and the chain length is long (cost gets ugly).
2. Reflexion — Verbal reinforcement
What it is. Three models stack on top of the agent: an Actor (runs ReAct), an Evaluator (scores the output), and a Self-Reflection writer (writes a verbal note about what went wrong). The note gets stored in a memory. Next attempt reads the memory first.
Concrete example. User asks: “Find the LinkedIn profile of the CEO of Hugging Face.”
Attempt 1:
Actor: searches "Hugging Face CEO" → tool returns "Yann LeCun" (wrong)
Actor: looks up Yann LeCun's LinkedIn, returns answer
Evaluator: ❌ WRONG. Yann LeCun is at Meta FAIR. The CEO is Clem Delangue.
Reflection: "I trusted the first search without cross-checking.
Next time, verify a CEO/founder claim with a second source."
Memory: [saves the reflection]
Attempt 2:
Actor: reads memory note → cross-checks
searches "Hugging Face CEO" AND "founders of Hugging Face"
Both return "Clem Delangue"
Actor: returns the correct LinkedIn profile
Evaluator: ✅ PASSCost shape. 2-3× ReAct (Actor + Evaluator + Reflection are separate LLM calls).
Pick Reflexion when. You have a credible Evaluator (test results, ground-truth data, judge LLM). Errors are correctable rather than catastrophic. The agent will see the same problem class repeatedly.
Skip Reflexion when. No evaluation signal exists. Single-shot quality is fine. Cost-sensitive.
3. Plan-and-Execute — Upfront planning
What it is. A Planner (strong LLM) produces a multi-step plan upfront. An Executor (cheaper LLM, deterministic code, or a small ReAct agent) carries out each step in order. The reasoning happens once; the execution is cheap.
Concrete example. User asks: “Write me a 5-section research report on the 2026 vector database landscape.”
PLANNER (GPT-4o):
Step 1: List the top 10 vector DBs by 2026 adoption
Step 2: For each, get key differentiators (5 fields)
Step 3: Group into categories (managed / OSS / embedded / hybrid)
Step 4: Write the 5 sections (intro, categories, comparison, picks, conclusion)
Step 5: Format as markdown report
EXECUTOR (Haiku, runs each step):
Step 1: ... → output
Step 2: ... → output
... [each step is cheap]
FINAL: assembled reportCost shape. One expensive Planner call + N cheap Executor calls. Often cheaper than pure ReAct on the same task.
Pick Plan-Execute when. Task structure is reliably knowable. Inference cost matters. The plan is auditable before execution begins (good HITL insertion point).
Skip Plan-Execute when. The world changes mid-execution and the executor can’t adapt. Tasks are genuinely exploratory. Without a replanner, the pattern is brittle.
4. Supervisor — Orchestrator + specialists
What it is. An orchestrator agent decides which specialist agent runs next. Specialists do the actual work. The orchestrator routes based on the input and the running state.
Concrete example. Airline customer service.
┌──────────────────────┐
│ SUPERVISOR │
│ (Claude Opus) │
└──────────┬───────────┘
│ routes to:
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ BOOKING │ │ BAGGAGE │ │ REFUND │
│ AGENT │ │ AGENT │ │ AGENT │
│ (Haiku) │ │ (Haiku) │ │ (Haiku) │
└──────────┘ └──────────┘ └──────────┘User asks: “My flight QR123 was delayed 6 hours and my bag is missing.”
Supervisor sees TWO sub-tasks (flight info + bag tracking) and routes
→ Baggage Agent: track bag → observation: in Doha, delivery tomorrow
→ Booking Agent: explain delay + compensation policy
→ Refund Agent: prepare EU 261 compensation offer
Supervisor synthesizes the final answer
Cost trick (production secret). Strong model (Opus / GPT-4o) for the supervisor (routing needs reasoning); cheap model (Haiku / GPT-4o-mini) for workers (execution doesn’t). Cost typically drops 60-70% with no impact on routing accuracy.
Pick Supervisor when. Production default in 2026. Task decomposes into 3-7 specialist roles. Routing depends on the input. Debuggability matters.
Skip Supervisor when. Workflow order is fixed (Sequential Crew is cheaper). Latency is the bottleneck AND you have observability (Swarm).
5. Sequential Crew — Predetermined pipeline
What it is. Agents arranged in a fixed linear sequence. Agent A → Agent B → Agent C. Each agent’s output is the next agent’s input. No routing decisions.
Concrete example. Content production pipeline.
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│RESEARCH ├──►│ WRITE ├──►│ EDIT ├──►│ PUBLISH ├──► OUTPUT
└─────────┘ └─────────┘ └─────────┘ └─────────┘For “Write a blog post on 2026 inference engines”: Research agent gathers facts → Writer drafts → Editor polishes → Publisher formats and posts. No supervisor needed. The order is the design.
Cost shape. Cheapest multi-agent pattern. One LLM call per stage. No routing overhead.
Pick Sequential Crew when. Workflow order is known and fixed. Predictability matters. Throughput beats flexibility. Examples: research → write → edit; extract → validate → save; classify → enrich → commit.
Skip Sequential Crew when. Order depends on intermediate results. Conditional branching is needed. Backtracking is common.
6. Hierarchical Crew — Manager with review loops
What it is. A manager agent coordinates subordinates and reviews their output. The manager can send work back for revision before accepting it. Validation loops are built in.
Concrete example. Compliance-audited document drafting.
┌──────────────────────┐
│ MANAGER AGENT │
│ - assigns work │
│ - reviews drafts │
│ - sends back if │
│ not compliant │
└──────────┬───────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│RESEARCH│ │ LEGAL │ │ WRITER │
└────────┘ └────────┘ └────────┘For “Draft an updated privacy policy”: - Manager → Researcher → returns findings → Manager reviews: “include UAE PDPL too” → Researcher revises - Manager → Writer → drafts policy → Manager reviews: “Section 4 says we share data; we don’t. Revise.” - Manager → Legal → verifies → flags 2 issues → Manager sends back to Writer - Cycle until Manager is satisfied
Cost shape. 2-3× Supervisor (validation rounds add up).
Pick Hierarchical Crew when. Output quality > throughput. Each stage benefits from explicit review. Compliance-relevant work where mistakes are expensive.
Skip Hierarchical Crew when. Supervisor would suffice. Throughput matters. Risk of “micromanager anti-pattern” (manager re-does subordinate work, bloating cost without improving output).
7. Multi-agent Swarm — Peer-to-peer handoffs
What it is. Agents hand off directly to each other. No central orchestrator. Each agent knows when to transfer control to a peer. LangGraph implements this with Command(goto=...) from handoff tools.
Concrete example. Same airline customer service, but as a Swarm.
USER
│
▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ BOOKING ├───────►│ BAGGAGE ├────────►│ REFUND │
│ AGENT │◄───────┤ AGENT │◄────────┤ AGENT │
└─────────┘ └─────────┘ └─────────┘
▲ │
└───────────────────────────────────────┘
(peer-to-peer handoffs)User starts with Booking → mentions missing bag → Booking transfers control: transfer_to(baggage_agent, context="bag missing") → Baggage takes over → user asks about refund → Baggage transfers to Refund → etc.
Cost shape. Faster than Supervisor (no orchestrator hop). Fewer LLM calls.
Pick Swarm when. Conversational handoff feels natural (customer-service style). High traffic where supervisor latency is the bottleneck. Sharp specialization boundaries. AND you have production-grade observability (LangSmith / Arize) — without it, debugging peer handoffs is functionally impossible.
Skip Swarm when. You don’t have the observability. Routing accuracy matters more than latency. Specialization boundaries are fuzzy. Supervisor is the production default — graduate to Swarm only on evidence the simpler pattern is breaking.
8. Debate / Adversarial — Multiple agents argue + vote
What it is. Multiple agents are given the same question and reason independently. A selector or vote picks the winning answer. Often multi-round: agents see each other’s positions and refine. AG2’s GroupChat is the canonical implementation.
Concrete example. AML transaction flagging.
"Is transaction T-9472 suspicious for AML?"
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ AGENT A │ │ AGENT B │ │ AGENT C │
│ "yes" │ │ "yes" │ │ "no" │
│ pattern │ │ velocity│ │ legit │
│ match" │ │ spike" │ │ vendor │
└────┬─────┘ └────┬─────┘ └─────┬────┘
│ │ │
└───────────────┼─────────────────┘
▼
┌──────────────┐
│ SELECTOR │ ← weighted majority + reasoning
│ vote: YES │
└──────┬───────┘
▼
FLAG FOR REVIEWMultiple perspectives reduce single-agent over-confidence. The disagreement IS the signal — if all three agents agree, the call is robust; if two-of-three say flag, a human reviews; if it’s split, escalate further.
Cost shape. The most expensive pattern by a wide margin. 4 agents × 5 rounds = 20+ LLM calls minimum. Often 50+ with multi-round refinement.
Pick Debate when. High-stakes single decisions where being wrong is much more expensive than 20-50 LLM calls. Examples: AML flagging, legal contract analysis, medical differential diagnosis, audit-grade financial review, regulatory compliance gates.
Skip Debate when. Real-time UX matters. Cost-sensitive. Routine queries. The pattern is overkill for 95% of agent workloads. Don’t use it because it “sounds robust.”
9. HITL overlay — Human-in-the-Loop
The critical thing about HITL: it’s NOT a peer pattern. It’s an OVERLAY. Any of the other 8 patterns can be wrapped with HITL at irreversible-action checkpoints.
What it is. The agent pauses at a defined checkpoint, persists its current state to a checkpointer (database), and waits for a human response. When the human approves (or modifies, or rejects), the agent resumes from the saved state. LangGraph’s interrupt() function is the canonical 2026 implementation.
Concrete example. Customer support agent + HITL on irreversible actions.
USER: "Refund my last 6 months of charges, then close my account."
AGENT (ReAct):
THOUGHT: Refund + account close are irreversible. Pause for approval.
ACTION: interrupt(
reason="refund $1,247 + close account",
options=["approve", "modify", "reject"]
)
LANGGRAPH: pauses graph execution
persists state to checkpointer
returns control to calling application
[time passes — could be 5 min or 10 hours]
HUMAN: reviews the proposal, clicks "approve"
AGENT: resumes from saved state
refund processed; account close scheduled
response sentAnti-rule. Do NOT interrupt on every step. Interrupt only on irreversible, high-blast-radius actions. Over-gating kills agent value. Examples of when to gate: financial disbursements above a threshold, legal agreements, modifications to production infrastructure, communications sent on behalf of executives. Examples of when NOT to gate: every tool call, every search, every routing decision.
Cost shape. Pattern cost + human time. State is durable, so the human can take 10 minutes or 10 hours without losing intermediate work.
Pick HITL when. Any action is irreversible (data deletion, money movement, public communication, legal commitment). Compliance demands an explicit human sign-off. The cost of being wrong materially outweighs the cost of delay.
Skip HITL when. Action is reversible. No regulator or stakeholder demands the gate. The cost of human time exceeds the cost of an occasional autonomous wrong decision.
Cost reality — the ~25× spread
Pattern Typical cost per task ReAct (6 iterations) $0.05 – $0.20 Plan-Execute $0.05 – $0.10 Sequential Crew (5 stages) $0.10 – $0.30 Swarm (4 agents) $0.15 – $0.40 Reflexion (3 attempts) $0.15 – $0.60 Supervisor (4 workers, cost-optimized) $0.20 – $0.50 Hierarchical Crew $0.30 – $1.00 Debate (4 agents × 5 rounds) $1.00 – $5.00 HITL overlay Pattern cost + human-time
The cost insight: the cheapest and most expensive patterns differ by ~25×. Picking a heavier pattern than your gap requires is not a small mistake. It compounds with every task your agent runs.
A team running 10,000 tasks/day on Debate when ReAct would suffice burns ~$10,000-$50,000/day in unnecessary cost. The same team on Supervisor with cost-optimized model split (Opus supervisor + Haiku workers) runs at $2,000-$5,000/day. The pattern choice is a fundamental cost lever.
Banking and sovereign-AI overlay
For Gulf banks, healthcare in regulated jurisdictions, and sovereign-AI customers building agentic systems in 2026:
Constraint Implication Default architecture Regulatory irreversibility (transactions, account changes) HITL mandatory at state-mutation boundaries HITL overlay on whatever underlying pattern runs Multi-channel customer service (Arabic + English) Channel-specific specialists; supervisor routes Supervisor with channel-specific workers Compliance review on every output Each stage needs a checker Hierarchical Crew OR Supervisor + dedicated compliance-reviewer agent Cost ceiling at scale Per-transaction cost must stay under threshold Supervisor (Opus + Haiku split) > Debate (too expensive) Banking PII Agents must not exfiltrate PII even in inter-agent handoffs Supervisor (centralized state easier to PII-scrub) > Swarm (distributed state harder to audit)
Director-level recommendation for a Gulf bank’s first agentic system in 2026: Supervisor + HITL overlay. Anthropic Claude Sonnet for the supervisor (Arabic + reasoning). Haiku for workers. Arize AX BYOC for trace + compliance audit. Swarm only after 6-12 months of supervisor-pattern operational evidence. Debate only on offline legal / regulatory Q&A — never in customer-facing real-time paths.
The over-graduation anti-pattern — the #1 production cost mistake
The single most expensive mistake teams make with agent patterns is over-graduation — adding a heavier pattern than the actual gap requires.
The mechanism: a team learns about Swarm, or Debate, or Hierarchical Crew. The pattern sounds sophisticated. The conditions to use it appear to be met (we have LangSmith → Swarm prerequisites satisfied; we have multiple specialists → Hierarchical Crew prerequisites satisfied). The team graduates.
Prerequisites being met is NOT the same as the pattern being justified.
The right test for graduating: “What problem does the simpler pattern fail to solve that the heavier pattern solves?” If you cannot name a specific, present, observed problem the simpler pattern is failing on — stay simple.
Real examples of over-graduation:
“We have LangSmith, so Swarm is fine” — NO. Latency must be the actual bottleneck. If supervisor latency is acceptable, you don’t need Swarm. You just have its prerequisite checked off.
“We use Claude Opus for everything, so we get Debate-quality answers” — NO. You are paying for it on every task, even the routine ones. The model is not the pattern.
“We added Reflexion because it sounds robust” — NO. Reflexion needs a credible Evaluator. Without one, the Self-Reflection step reflects on noise.
The discipline: stay with the simplest pattern that fills the actual gap. Graduate only on evidence. Engineers love graduating to “more sophisticated” patterns. Production teams love staying with the simpler one until they have evidence to move.
Leadership questions
Before adding any pattern to your design:
What can the agent NOT do alone today? Name the gap in one specific sentence. If you cannot, you do not have a gap; you have an aspiration.
Is the smallest pattern that fills the gap also the simplest one your team can debug at 2am? Pattern selection is not only about capability. It is about operability under stress.
What is the cost of being wrong on this task type, and how does that compare to 20× the cost of running the heavier pattern on every task? If the math does not favor the heavier pattern, stay simple.
If you cannot answer all three, you are not ready to graduate up the stack.
Why this matters now
The 2026 AI agent space is full of pattern hype. New frameworks announce new patterns weekly. Vendors pitch the most sophisticated thing they have. Comparison charts proliferate.
The framing in this piece — smallest sufficient scaffold per actual weakness, with HITL as overlay on irreversible-action checkpoints — is the operational counterweight. It cuts through the marketing layer. It gives Directors a way to ask the right questions in vendor due-diligence and team architecture reviews.
The toolbox is not a ladder. Do not climb it. Reach into it.
Most production teams in 2026 should be running ReAct (for simple tasks) or Supervisor (for multi-specialty tasks) with HITL on irreversible actions. That is it. The other 6 patterns are tools for specific gaps — not defaults to aspire to.
Smaller scaffold beats bigger scaffold when the gap is small. And most gaps are smaller than teams believe.
Innamul Hassan Abdul Azeez

