AI Agent Architecture Patterns
Why architects and developers are shifting from rigid scripts to adaptive agent systems

Over the last two years, I’ve watched teams replace brittle, hard-coded automations with agent systems that can plan, call tools, and recover from failures. The shift feels practical rather than hype-driven: we want systems that adapt to changing context rather than break when the first unexpected input arrives. If you’ve ever maintained a long chain of if-then rules that ballooned with every new edge case, you know why this matters.
In this post, I’ll walk through common AI agent architecture patterns with a focus on how they actually behave in production. We’ll look at when to use a simple chain, when to add tools, when memory is worth the complexity, and how to handle safety and cost. I’ll include Python examples based on real project work, and I’ll note where I’ve personally made mistakes and learned better approaches. The goal is to give you a mental model you can apply to your own systems, not just a glossary of terms.
Where agent systems fit today
Agent systems sit between traditional automation and full decision-making software. In practice, teams use them for workflows that require judgment under uncertainty: triaging customer support tickets, parsing semi-structured documents, or coordinating multi-step tasks across APIs. They are especially useful when inputs vary, when tool usage spans multiple systems, and when the “right answer” depends on context that’s hard to encode in advance.
Who typically builds them? Platform engineers, backend developers, and ML engineers. The difference from pure automation is the addition of a planning loop and the capability to call external tools. Compared to scheduled jobs or event-driven microservices, agents are more flexible but introduce new failure modes: unreliable tool calls, hallucinations, and unpredictable latency. Compared to fully custom decision trees, agents are easier to maintain when the domain changes frequently, but they demand guardrails.
The most common production setup today is a Python-based service wrapping an LLM call, with tool definitions grounded in OpenAPI specs or function signatures. We orchestrate with lightweight frameworks when needed, but many teams start with a small custom loop to understand cost and reliability before adopting heavier tooling.
Core architecture patterns
Let’s ground the patterns in practical behavior rather than abstract definitions. The pattern you choose depends on three factors: the complexity of the planning needed, the number and type of tools, and whether state must persist across interactions.
Single-turn chain
This is the simplest pattern: a prompt that processes input and produces output. It’s useful for deterministic transformations like summarizing, rewriting, or classifying short text. The risk is overusing it when you need tool calling or memory.
Example: a clean, minimal chain that classifies incoming tickets.
from openai import OpenAI
client = OpenAI()
def classify_ticket(text: str) -> str:
system = "You are a support classifier. Output one category: billing, technical, or other."
user = f"Ticket: {text}"
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "system": system},
{"role": "user", "content": user},
],
temperature=0.0,
)
return resp.choices[0].message.content.strip()
# Usage
ticket = "Customer cannot access invoice PDF from account page."
print(classify_ticket(ticket))
Notes from real use: keep temperature low for classification, validate output against an enum, and log both the input and the model’s response for later review. This is not an “agent” yet, but many systems start here and only add complexity if the failure rate is non-trivial.
Tool-using agent (reactive)
Here the agent decides when to call one or more tools based on the input. This pattern is useful when tasks require lookups (e.g., fetch user data, query a database, search documentation). The simplest loop is: plan, act (tool call), observe (result), and repeat until the final answer is ready.
Example: a small agent that calls a weather API. The key is to return the function result into the conversation and ask the model to generate a final answer.
import json
import httpx
from openai import OpenAI
client = OpenAI()
def get_weather(city: str) -> str:
# Replace with a real API key and endpoint in production.
url = f"https://api.openweathermap.org/data/2.5/weather"
params = {"q": city, "appid": "YOUR_API_KEY", "units": "metric"}
try:
resp = httpx.get(url, params=params, timeout=10)
data = resp.json()
if resp.status_code != 200:
return f"Error: {data.get('message', 'unknown')}"
return json.dumps(data, indent=2)
except Exception as e:
return f"Error calling weather API: {e}"
def run_weather_agent(city: str) -> str:
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}
]
messages = [
{"role": "system", "content": "You answer weather questions. Use get_weather when you need current conditions."},
{"role": "user", "content": f"What's the weather in {city}?"}
]
# First turn: request tool call
resp = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto",
)
msg = resp.choices[0].message
if not msg.tool_calls:
# Model chose to answer directly
return msg.content or "No answer."
messages.append(msg) # Include assistant tool_calls in history
# Execute each tool call and append results
for tool_call in msg.tool_calls:
func_name = tool_call.function.name
args = json.loads(tool_call.function.arguments)
if func_name == "get_weather":
result = get_weather(args["city"])
else:
result = "Unknown tool"
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"name": func_name,
"content": result,
})
# Second turn: generate final answer
final = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
return final.choices[0].message.content
# Usage
print(run_weather_agent("San Francisco"))
In production, I’ve seen teams add retries and fallbacks here. If the API fails, the agent should be instructed to say it couldn’t fetch the data rather than invent numbers.
Planning and decomposition
When a task is complex, the agent can plan first: break the request into steps, then execute them. This is useful for multi-stage tasks like generating a report from data across multiple systems.
Example: a simple planner that outputs a step list, then executes them sequentially.
from openai import OpenAI
client = OpenAI()
def plan_and_execute(request: str):
system = (
"You break a request into steps. Output each step on a new line as 'Step N: description'. "
"Steps should be actionable and small."
)
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": request},
],
temperature=0.0,
)
content = resp.choices[0].message.content
steps = [line.strip() for line in content.splitlines() if line.strip().startswith("Step")]
# In a real system, parse and map each step to a tool or function.
# For demonstration, we just echo the steps and simulate execution.
execution_log = []
for step in steps:
execution_log.append(f"Executing: {step}")
# Replace with actual tool calls; here we pretend.
execution_log.append(f"Result: OK")
return {"plan": steps, "execution": execution_log}
# Usage
print(plan_and_execute("Create a summary of Q3 sales for Acme Corp and email it to ops@example.com"))
Planning adds latency and requires strict validation of step formats. In practice, I often constrain the plan to a fixed schema or generate a plan via a structured output endpoint when available.
Memory and context management
Agents often need to remember things across turns: user preferences, conversation history, or intermediate results. Memory can be short-term (in-session) or long-term (persisted).
Short-term memory is typically the messages list. Long-term memory requires storage and retrieval. A common pattern is vector memory: store embeddings of past interactions and retrieve the top-k relevant items based on the current query.
Example: a simple vector memory using sentence embeddings. In production, you would use a managed vector database and embeddings service, but this illustrates the concept.
from typing import List, Dict
import numpy as np
from openai import OpenAI
client = OpenAI()
def embed(text: str) -> List[float]:
resp = client.embeddings.create(model="text-embedding-3-small", input=text)
return resp.data[0].embedding
def cosine_similarity(a: List[float], b: List[float]) -> float:
a = np.array(a)
b = np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
class Memory:
def __init__(self):
self.items: List[Dict] = [] # each dict: {"text": str, "embedding": List[float]}
def add(self, text: str):
emb = embed(text)
self.items.append({"text": text, "embedding": emb})
def retrieve(self, query: str, top_k: int = 3) -> List[str]:
q_emb = embed(query)
scores = [(self.items[i]["text"], cosine_similarity(q_emb, self.items[i]["embedding"]))
for i in range(len(self.items))]
scores.sort(key=lambda x: x[1], reverse=True)
return [text for text, _ in scores[:top_k]]
def run_with_memory(memory: Memory, query: str) -> str:
context = memory.retrieve(query)
prompt = "Context:\n" + "\n".join(context) + "\n\nQuestion: " + query
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
)
return resp.choices[0].message.content
# Usage
mem = Memory()
mem.add("User prefers concise summaries under 100 words.")
mem.add("User works at Acme Corp and reports on quarterly sales.")
print(run_with_memory(mem, "Summarize Q3 sales for Acme."))
Note on tradeoffs: vector memory introduces new costs (embeddings, storage) and failure modes (retrieval quality). Start with session memory and only add long-term retrieval when you see a clear need.
Multi-agent coordination
Sometimes you want multiple specialized agents that hand off to each other. A common pattern is a supervisor that routes tasks to subagents, or a pipeline where each agent refines the output.
Example: a supervisor that routes to a “data” agent or “email” agent based on the request.
from openai import OpenAI
client = OpenAI()
def supervisor_router(request: str) -> str:
system = (
"You are a supervisor. If the request is about data retrieval or analysis, reply with ROUTE: data. "
"If the request is about sending emails, reply with ROUTE: email. Otherwise, reply with ROUTE: general."
)
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": request},
],
temperature=0.0,
)
return resp.choices[0].message.content.strip()
def data_agent(request: str) -> str:
# Placeholder for a data agent that might query a database or file.
return f"Data agent: processed '{request}'"
def email_agent(request: str) -> str:
# Placeholder for an email agent that constructs and sends emails.
return f"Email agent: prepared email for '{request}'"
def route_request(request: str) -> str:
route = supervisor_router(request)
if "ROUTE: data" in route:
return data_agent(request)
elif "ROUTE: email" in route:
return email_agent(request)
else:
return "General agent: please clarify your request."
# Usage
print(route_request("Pull Q3 sales figures"))
print(route_request("Send a summary to ops@example.com"))
Multi-agent systems can become opaque. In production, I log every handoff and include an explanation field in the supervisor’s response to make audits easier.
Practical examples and workflows
Structured extraction with validation
A frequent real-world need is extracting structured data from text and validating it. This is common in document processing and support workflows.
Example: extract a support ticket and validate fields.
import json
from typing import TypedDict, Optional
from openai import OpenAI
class Ticket(TypedDict):
customer_id: str
category: str
priority: str
summary: str
client = OpenAI()
def extract_ticket(text: str) -> Optional[Ticket]:
system = (
"You extract a support ticket JSON with fields: customer_id, category, priority, summary. "
"category must be billing, technical, or other. priority must be low, medium, or high."
)
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": text},
],
response_format={"type": "json_object"},
temperature=0.0,
)
try:
data = json.loads(resp.choices[0].message.content)
except Exception:
return None
# Basic validation
if data.get("category") not in {"billing", "technical", "other"}:
return None
if data.get("priority") not in {"low", "medium", "high"}:
return None
return Ticket(customer_id=data.get("customer_id", ""), category=data["category"], priority=data["priority"], summary=data.get("summary", ""))
# Usage
ticket_text = "Customer 12345 reports missing invoice PDF. Priority medium."
print(extract_ticket(ticket_text))
In real projects, I often add a second validation pass with traditional rules or schema validation libraries. The LLM handles ambiguity well, but deterministic checks prevent silent errors.
Tool orchestration across APIs
When agents call multiple tools, you need predictable behavior: retries, timeouts, and backoff. I like to write thin wrappers around each tool to isolate failures.
Example: a tool wrapper with retries.
import httpx
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def call_api(url: str, params: dict, timeout: float = 10.0) -> dict:
resp = httpx.get(url, params=params, timeout=timeout)
resp.raise_for_status()
return resp.json()
Here, “tenacity” is a helpful library for retry patterns. In production, I’ve paired this with structured logging so we know which tool failed and why.
Honest evaluation: strengths, weaknesses, and tradeoffs
Strengths:
- Flexibility: agents adapt to varied inputs and can coordinate multiple tools.
- Maintainability: planning and tool abstraction can centralize logic rather than scattering it across many scripts.
- Human-in-the-loop: you can design checkpoints where the agent asks for clarification.
Weaknesses:
- Unpredictability: LLMs can be inconsistent; outputs require validation.
- Latency: planning loops and tool calls add overhead.
- Cost: more tokens and more API calls increase cost, especially with long contexts.
- Opacity: multi-agent systems can be hard to debug without thorough logging.
When to use agents:
- When tasks are semi-structured and require judgment (e.g., classifying, routing, extracting).
- When you need to coordinate multiple tools or systems with a natural language interface.
- When the domain changes frequently and maintaining hard-coded rules is costly.
When not to use agents:
- When you need deterministic, low-latency responses (e.g., real-time bidding).
- When the toolset is simple and can be handled by traditional automation.
- When you cannot afford the cost or compliance risks of external LLM calls (consider on-prem or smaller models).
Tradeoffs to consider:
- Tool granularity: smaller, focused tools are easier to test and reuse.
- Context window: keep prompts lean to manage cost and latency; retrieve context as needed.
- Safety: always validate tool outputs and agent responses; keep a human review step for critical actions.
Personal experience: learning curves and common mistakes
The first agent I shipped was a simple support triager. I made three mistakes:
- I didn’t validate the classification output. The model sometimes invented a fourth category, which broke downstream processing. Adding an enum check fixed this.
- I treated the first version as “done.” As volume grew, costs spiked because I included too much context in every call. I switched to retrieval of relevant tickets instead of always sending full history.
- I assumed tool calls were reliable. One API started returning 500s at peak times, and my agent retried too aggressively. Adding backoff and circuit breakers saved the day.
What surprised me most was how often agent behavior improved when I rewrote prompts to be explicit about output formats and failure handling. Small constraints, like “return JSON with keys a, b, c,” gave big gains in reliability. Another lesson was to invest early in logging: capture inputs, outputs, tool responses, and timings. That data made it possible to improve the system with confidence.
Getting started: setup and mental models
You don’t need a heavy framework to start. A simple Python service with an LLM client and a small set of tools is enough to learn the core loop.
Project structure:
agentsvc/
├─ app/
│ ├─ __init__.py
│ ├─ main.py # Entry point (FastAPI or Flask)
│ ├─ agent.py # Core agent loop
│ ├─ tools.py # Tool wrappers
│ ├─ memory.py # Short-term and vector memory
│ └─ schema.py # Pydantic models for validation
├─ tests/
│ ├─ test_agent.py
│ └─ test_tools.py
├─ config/
│ ├─ settings.py # API keys, model names
│ └─ prompts/ # Prompt templates
├─ requirements.txt
└─ README.md
Mental model:
- The agent is a loop: observe, decide, act, and evaluate.
- Tools are functions with clear inputs and outputs; they must be safe to call repeatedly.
- Memory should be retrieved, not dumped; prioritize relevance over volume.
- Validate all outputs before side effects (sending emails, writing DBs).
- Log everything you need to debug and audit.
Example skeleton of an agent loop in a service:
# app/agent.py
from typing import Any, Dict, List
from openai import OpenAI
from .tools import TOOL_REGISTRY
from .memory import Memory
client = OpenAI()
class Agent:
def __init__(self, memory: Memory):
self.memory = memory
def run(self, user_input: str) -> Dict[str, Any]:
# Retrieve context
context = self.memory.retrieve(user_input)
messages = [
{"role": "system", "content": "You are a helpful assistant with access to tools."},
{"role": "user", "content": user_input}
]
if context:
messages.insert(1, {"role": "system", "content": "Context:\n" + "\n".join(context)})
# Plan or act
resp = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=[t.schema for t in TOOL_REGISTRY.values()],
tool_choice="auto",
)
msg = resp.choices[0].message
if not msg.tool_calls:
return {"answer": msg.content}
messages.append(msg)
results = []
for tool_call in msg.tool_calls:
tool = TOOL_REGISTRY.get(tool_call.function.name)
if not tool:
results.append({"tool": tool_call.function.name, "error": "not found"})
continue
try:
output = tool.call(tool_call.function.arguments)
results.append({"tool": tool_call.function.name, "output": output})
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"name": tool_call.function.name,
"content": output,
})
except Exception as e:
results.append({"tool": tool_call.function.name, "error": str(e)})
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"name": tool_call.function.name,
"content": f"Error: {str(e)}",
})
# Final answer
final = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=[t.schema for t in TOOL_REGISTRY.values()],
)
return {"answer": final.choices[0].message.content, "results": results}
Configuration is best handled via environment variables or a simple settings module:
# config/settings.py
import os
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
MODEL = os.getenv("MODEL", "gpt-4o")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")
Tool registry example:
# app/tools.py
import json
from typing import Callable, Dict, Any
class Tool:
def __init__(self, name: str, schema: Dict[str, Any], func: Callable[[str], str]):
self.name = name
self.schema = schema
self.func = func
def call(self, args_json: str) -> str:
args = json.loads(args_json)
return self.func(args)
def weather_tool_func(args: Dict[str, Any]) -> str:
# Placeholder; replace with real implementation.
city = args.get("city", "unknown")
return f"Weather data for {city}"
TOOL_REGISTRY: Dict[str, Tool] = {
"get_weather": Tool(
name="get_weather",
schema={
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
},
func=weather_tool_func,
)
}
What makes agent systems stand out
The standout feature is the abstraction layer between intent and action. Instead of writing separate scripts for every use case, you define tools and let the agent decide how to use them. This reduces code duplication and centralizes control.
From a developer experience perspective, agent systems are a mixed bag. The flexibility is empowering, but you trade certainty for adaptability. Maintainability improves when you:
- Keep tools small and well-tested.
- Use typed schemas for tool inputs and outputs.
- Validate and sanitize results before side effects.
- Document the decision loop and failure modes.
Ecosystem strengths:
- Python has mature tooling for HTTP clients, structured outputs, and embeddings.
- Managed vector databases simplify retrieval; many offer open-source versions you can self-host.
- Lightweight frameworks exist when you need them, but starting custom is often the best way to learn.
Real outcomes I’ve seen:
- Faster iteration: product teams can tweak prompts and tool descriptions without redeploying logic.
- Better user experience: agents ask clarifying questions instead of failing.
- Cost control: when retrieval and validation are in place, you avoid unnecessary LLM calls.
Free learning resources
-
OpenAI API documentation: https://platform.openai.com/docs
- Useful for understanding tool calling, structured outputs, and embedding models.
-
LangChain conceptual guides: https://python.langchain.com/docs/concepts/
- Helpful for agent patterns like chains, tools, and memory; use them as inspiration even if you implement custom loops.
-
Hugging Face Transformers docs: https://huggingface.co/docs/transformers
- For running smaller models on-prem or exploring embedding models beyond OpenAI.
-
Tenacity documentation: https://tenacity.readthedocs.io/
- Essential for building resilient tool calls with retries and backoff.
-
Pydantic documentation: https://docs.pydantic.dev/
- Best for validating tool outputs and enforcing schemas.
-
OpenAPI specification: https://swagger.io/specification/
- Many tools are derived from APIs; using OpenAPI ensures consistent schemas.
Summary: who should use agents, and who might skip them
Use agent architecture when your tasks are varied and require judgment, when you have multiple tools to orchestrate, and when your domain evolves quickly. Start simple: single-turn chains, then add tools and memory. Validate everything, log thoroughly, and monitor costs and latencies.
Skip agents if you need strict determinism, ultra-low latency, or if your problem is already solved well by a small, fixed set of scripts. Also skip if you cannot meet compliance or data privacy requirements with external LLMs; consider on-prem or smaller models instead.
The most valuable takeaway from building agent systems is to treat them like any other software: design interfaces, write tests, and build observability. The LLM is a component, not a magic wand. When you ground it with solid tools, clear prompts, and robust validation, it becomes a practical part of a reliable system.




