Agent Memory & Planning
Цей контент ще не доступний вашою мовою.
Agent Memory & Planning
Section titled “Agent Memory & Planning”AI/ML Engineering Track | Complexity:
[COMPLEX]| Time: 5-6 hoursReading Time: 8-9 hours
Prerequisites: Prompt engineering fundamentals, retrieval-augmented generation, basic Python, API error handling, and Module 1.5
Heureka Moment: An agent becomes reliable when memory, planning, tools, budgets, and verification are engineered as one controlled system.
Learning Outcomes
Section titled “Learning Outcomes”By the end of this module, you will be able to:
- Design a hybrid memory architecture that separates working context, durable facts, episodic history, and compressed summaries according to task risk and retrieval needs.
- Compare Plan-and-Execute, ReWOO, Tree of Thought, and reactive planning patterns using latency, token cost, failure recovery, and dependency structure as decision criteria.
- Trace a complex agent task from user request through memory retrieval, plan construction, tool execution, replanning, verification, and durable learning.
- Evaluate multi-agent coordination topologies and choose supervisor, swarm, hierarchical, or debate patterns for realistic production scenarios.
- Debug runaway agent behavior by applying execution budgets, reflection limits, memory hygiene, tool governance, and observability metrics.
Why This Module Matters
Section titled “Why This Module Matters”In February 2024, Air Canada was held legally liable for hallucinations generated by its customer service chatbot after the system incorrectly promised a bereavement discount that the company later refused to honor. The dollar amount was small compared with the cost of a major outage, but the lesson was larger than the award: a conversational system can create real obligations when users rely on it. A stateless chatbot can already cause harm when it gives a confident wrong answer, and an autonomous agent with memory, tools, and repeated execution loops can multiply that harm across databases, payment systems, support workflows, and infrastructure APIs.
A senior engineer should treat an agent as a distributed system, not as a clever prompt. Once the model can remember user facts, call tools, revise plans, and keep working after an error, it has state, side effects, control flow, and failure modes. Those are software engineering concerns, and they need the same seriousness you would bring to queues, retries, transactions, rate limits, audit logs, and incident response. The agent’s intelligence does not remove the need for architecture; it increases the cost of weak architecture.
This module teaches agent memory and planning from first principles, then connects those ideas to production-grade constraints. You will start with the simplest memory problem, add retrieval and summarization, compare planning algorithms, examine multi-agent topologies, and finish with a fully traced worked example. The goal is not to memorize pattern names. The goal is to make defensible design decisions when an agent must complete a complex task safely, cheaply, and observably.
Part 1: What Changes When a Model Becomes an Agent
Section titled “Part 1: What Changes When a Model Becomes an Agent”A plain chat model receives a prompt and returns a response, but an agent repeatedly chooses actions that affect the world. That difference sounds small until you inspect the control loop. A chatbot can be wrong once, while an agent can be wrong, store the wrong conclusion, retrieve it later, use it to choose a tool, call the tool, observe a failure, replan from the bad observation, and then persist the whole episode as a misleading precedent.
The core engineering question is therefore not “Can the model reason?” but “What state and authority does the system give to that reasoning?” Memory determines what the model sees, planning determines what sequence it attempts, tools determine what side effects it can create, and budgets determine when the system stops. A reliable agent is built by designing those boundaries explicitly instead of hoping the model will remain well behaved.
flowchart TD User[User request] --> Intake[Intent and risk intake] Intake --> Memory[Memory retrieval] Memory --> Planner[Planner] Planner --> Executor[Tool executor] Executor --> Observer[Observation parser] Observer --> Verifier[Verifier] Verifier -->|pass| Response[Final response] Verifier -->|recoverable issue| Replan[Localized replan] Replan --> Executor Verifier -->|unsafe or exhausted| Escalate[Human escalation] Response --> Learn[Memory write-back] Learn --> MemoryThe diagram is deliberately circular because production agents are loops. Every loop needs a stopping condition, and every side effect needs a policy boundary. If you do not specify those boundaries, you have still made a design decision; you have chosen unbounded behavior by default.
The beginner mental model is “memory helps the agent remember things.” The senior mental model is “memory is a state management layer with consistency, retention, retrieval, privacy, and staleness problems.” The beginner mental model is “planning helps the agent break down tasks.” The senior mental model is “planning is a control strategy whose cost and failure recovery depend on the dependency graph between tool calls.”
Before we build the parts, keep one principle in view: an agent should only receive the context and authority needed for the current task. More memory, more tools, more reflection, and more autonomy are not automatically better. They increase capability, but they also increase latency, token cost, attack surface, and the number of ways the system can fail.
Active learning prompt: Imagine a support agent that can issue refunds, look up orders, and remember customer preferences. Which single design mistake would be most dangerous: no long-term memory, no planning step, no tool permission policy, or no execution budget? Choose one before reading further, then compare your answer against the failure modes in later sections.
Part 2: Agent Memory Systems
Section titled “Part 2: Agent Memory Systems”An agent without memory behaves like a brilliant temporary worker who forgets every previous shift. It may answer the current question well, but it cannot carry forward a user’s preferences, previous failed attempts, open tasks, or lessons from past incidents. The tempting fix is to store everything, but that creates a different problem: the agent drowns in outdated facts, duplicate messages, and low-value chatter.
The right design separates memory by purpose. Recent dialogue belongs in working context because exact wording matters. Durable facts belong in long-term memory because the agent may need them tomorrow. Completed workflows belong in episodic memory because the sequence and outcome matter. Older conversation belongs in summaries because raw logs are too expensive to keep injecting into prompts.
flowchart TD subgraph Agent Memory Architecture direction TB ShortTerm[SHORT-TERM BUFFER\nrecent exact turns] Summary[SUMMARY MEMORY\ncompressed older context] LongTerm[LONG-TERM VECTOR MEMORY\ndurable facts and preferences] Episodic[EPISODIC MEMORY\ncompleted workflows and outcomes] Policy[MEMORY POLICY\nretention, privacy, importance] Builder[CONTEXT BUILDER\nrank, filter, format]
ShortTerm --> Builder Summary --> Builder LongTerm --> Builder Episodic --> Builder Policy --> ShortTerm Policy --> Summary Policy --> LongTerm Policy --> Episodic endShort-term memory is the rolling buffer of recent messages. It is the highest-fidelity memory because it preserves exact phrasing, tool results, and unresolved references like “that second option.” It is also the easiest memory to misuse because engineers often append messages until the context window is full. When the buffer grows without policy, relevant recent details compete with old filler, and the model may miss the thing it needs.
Long-term memory stores durable facts such as “the user prefers concise output,” “the team deploys with Argo CD,” or “the billing account was migrated last month.” In modern agents this is often implemented with embeddings and vector search, but vector similarity is not a database transaction log. A sentence about an old address can look semantically similar to a sentence about a new address, so metadata, recency, and conflict resolution matter.
Episodic memory stores complete experiences rather than isolated facts. An episode might capture “the agent investigated a failed rollout, found that a readiness probe path changed, patched the manifest, and verified recovery.” That memory is useful because future debugging tasks often resemble previous incidents structurally, even when the exact service name changes. Episodic memory helps the agent reuse a successful strategy without pretending that the old answer is automatically the new answer.
Summary memory compresses older conversation into a small representation. It is not a replacement for source-of-truth data, and it should not be treated as legally exact. A summary is useful for continuity, but it can omit details, overgeneralize, or freeze a conclusion that later became stale. Good agents mark summaries as summaries and retrieve original records when exactness matters.
| Memory Type | Best Use | Main Risk | Production Guardrail |
|---|---|---|---|
| Short-term buffer | Exact recent dialogue, unresolved references, immediate tool outputs | Token exhaustion and distraction from stale turns | Fixed window, role filtering, and task-aware trimming |
| Long-term vector memory | Durable facts, preferences, reusable knowledge, semantic recall | Stale or conflicting facts retrieved by similarity alone | Metadata filters, recency weighting, conflict resolution |
| Episodic memory | Past workflows, incident patterns, successful procedures, lessons learned | Treating old outcomes as guaranteed current truth | Store outcome, environment, date, and verification evidence |
| Summary memory | Continuity across long conversations without full logs | Compression loses edge cases and exact commitments | Link summaries to source turns and refresh after major decisions |
| Policy memory | User consent, retention rules, sensitivity classifications | Sensitive data retained or injected into prompts incorrectly | Privacy tags, TTLs, redaction, and access checks |
A reliable memory layer makes write decisions as carefully as read decisions. Many weak agents ask “what should I retrieve?” but ignore “what should I store?” That omission creates memory hoarding, where every greeting, repeated question, and temporary draft becomes searchable forever. A mature design evaluates importance, sensitivity, freshness, and conflict before writing.
flowchart LR Message[Incoming message or tool result] --> Classify{Classify} Classify -->|recent task state| Buffer[Short-term buffer] Classify -->|durable fact| Validate[Validate and normalize] Classify -->|completed workflow| Episode[Create episode] Classify -->|low value| Drop[Do not persist] Validate --> Conflict{Conflicts with existing fact?} Conflict -->|yes| Resolve[Resolve by timestamp, source, or user confirmation] Conflict -->|no| Store[Store with metadata] Resolve --> Store Store --> Index[Vector and metadata indexes] Episode --> IndexThe classification step is where many production systems succeed or fail. If the user says, “My temporary hotel address this week is 12 Pine Street,” the memory should not overwrite the user’s permanent address unless the system has a field model that distinguishes temporary and permanent addresses. If the user says, “Forget my previous phone number,” the system needs deletion or invalidation semantics, not another vector entry that says the previous phone number should be forgotten.
A useful retrieval pipeline ranks candidate memories by more than similarity. It can apply hard filters first, such as tenant, user, sensitivity, and data domain. It can then score by semantic relevance, recency, source reliability, and importance. Finally, it should format retrieved memory for the model with clear labels so the model can distinguish facts, summaries, and similar past episodes.
from __future__ import annotations
from dataclasses import dataclass, fieldfrom datetime import datetime, timezonefrom math import sqrtfrom typing import Iterable
@dataclassclass MemoryRecord: content: str embedding: list[float] kind: str user_id: str source: str importance: float created_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc)) expires_at: datetime | None = None invalidated: bool = False
class SimpleEmbeddingModel: """Deterministic toy embedding model for local examples, not production use."""
def embed(self, text: str) -> list[float]: buckets = [0.0, 0.0, 0.0, 0.0] for index, char in enumerate(text.lower()): buckets[index % len(buckets)] += (ord(char) % 31) / 31 return buckets
class MemoryStore: """Small runnable memory store that combines metadata filtering and similarity scoring."""
def __init__(self, embedding_model: SimpleEmbeddingModel) -> None: self.embedding_model = embedding_model self.records: list[MemoryRecord] = []
def store(self, content: str, *, kind: str, user_id: str, source: str, importance: float) -> None: if importance < 0.3: return record = MemoryRecord( content=content, embedding=self.embedding_model.embed(content), kind=kind, user_id=user_id, source=source, importance=min(importance, 2.0), ) self.records.append(record)
def retrieve(self, query: str, *, user_id: str, allowed_kinds: Iterable[str], limit: int = 5) -> list[MemoryRecord]: now = datetime.now(timezone.utc) query_embedding = self.embedding_model.embed(query) allowed = set(allowed_kinds) scored: list[tuple[float, MemoryRecord]] = []
for record in self.records: if record.user_id != user_id or record.kind not in allowed or record.invalidated: continue if record.expires_at is not None and record.expires_at <= now: continue similarity = self._cosine_similarity(query_embedding, record.embedding) age_hours = max((now - record.created_at).total_seconds() / 3600, 0.0) recency = 1.0 / (1.0 + age_hours / 72.0) score = similarity * 0.65 + record.importance * 0.25 + recency * 0.10 scored.append((score, record))
scored.sort(key=lambda item: item[0], reverse=True) return [record for _, record in scored[:limit]]
@staticmethod def _cosine_similarity(left: list[float], right: list[float]) -> float: dot = sum(a * b for a, b in zip(left, right)) left_norm = sqrt(sum(a * a for a in left)) right_norm = sqrt(sum(b * b for b in right)) if left_norm == 0 or right_norm == 0: return 0.0 return dot / (left_norm * right_norm)
if __name__ == "__main__": store = MemoryStore(SimpleEmbeddingModel()) store.store( "User prefers release summaries with risk, rollback, and verification sections.", kind="preference", user_id="u-123", source="chat", importance=1.4, ) store.store( "User said hello during onboarding.", kind="conversation", user_id="u-123", source="chat", importance=0.1, ) matches = store.retrieve( "How should I format the release note?", user_id="u-123", allowed_kinds=["preference", "episode"], ) for match in matches: print(f"{match.kind}: {match.content}")The example is intentionally small, but it shows the mechanism that matters. The store refuses low-importance chatter, filters by user and kind before ranking, and mixes similarity with importance and recency. A production system would use a real embedding model and database, but the design pressure is the same: retrieval should be relevant, authorized, fresh enough, and labeled.
Active learning prompt: Suppose a user says, “I used to prefer Terraform, but our team has moved all new infrastructure work to Crossplane.” Which memory records should be created, updated, or invalidated? Write your answer in terms of facts, timestamps, and conflict handling rather than vague “remember this” behavior.
Memory also needs observability because failed recall often looks like model incompetence from the outside. A user asks, “What repository did we decide to use?” and the agent answers incorrectly. The root cause may be a missing write, an overly strict metadata filter, a weak query rewrite, a stale summary, a vector ranking issue, or a context builder that dropped the correct record after retrieval. Debugging requires logging each stage separately.
flowchart TD Query[User query] --> Rewrite[Query rewrite] Rewrite --> Filters[Metadata filters] Filters --> CandidateSearch[Candidate search] CandidateSearch --> Rank[Rank by similarity, importance, recency] Rank --> ContextBudget[Fit into context budget] ContextBudget --> Prompt[Prompt assembly] Prompt --> Model[Model response] Model --> Audit[Recall audit] Audit -->|wrong answer| StageLogs[Inspect stage-level logs]Senior teams often test memory with scenario fixtures rather than isolated unit tests. A fixture can create three conflicting addresses, one deletion request, one preference update, and one old episode, then verify that the context builder includes the current address and excludes the deleted one. This is how you catch the difference between “the vector store works” and “the agent remembers correctly under realistic conflict.”
Part 3: Planning Algorithms
Section titled “Part 3: Planning Algorithms”Planning turns an agent from a reactive responder into a task executor. The planner decides which steps are needed, which tools to use, which dependencies exist between steps, and when to stop. A planning algorithm is not just a prompt style; it is an execution strategy with cost, latency, and recovery behavior.
The simplest agent pattern is reactive: observe the current state, decide the next action, execute it, observe the result, and repeat. This is flexible, but it spends a model call at each turn and can drift when the task is long. More structured patterns create a plan first, execute known steps, and only replan when evidence changes. More deliberative patterns explore multiple possible solutions before committing to one.
flowchart LR Task[Task] --> Strategy{Choose planning strategy} Strategy -->|small uncertain task| ReAct[Reactive loop] Strategy -->|known multi-step workflow| PlanExecute[Plan-and-Execute] Strategy -->|predictable tool chain| ReWOO[ReWOO] Strategy -->|ambiguous high-stakes reasoning| ToT[Tree of Thought] Strategy -->|large specialized workload| MultiAgent[Multi-agent plan]The key decision is dependency structure. If each step depends heavily on the previous observation, a reactive loop or localized replanning may be necessary. If most tool calls are predictable from the start, upfront planning reduces repeated reasoning. If the task requires comparing several candidate strategies, Tree of Thought can improve answer quality at the cost of extra model calls. If the task requires different expertise areas, multi-agent coordination can reduce cognitive overload.
| Planning Pattern | Best Fit | Weak Fit | Cost Profile | Failure Recovery |
|---|---|---|---|---|
| Reactive loop | Uncertain tasks where each observation changes the next action | Predictable workflows with many repetitive tool calls | Many small model calls and growing context | Natural but can drift or loop |
| Plan-and-Execute | Workflows with clear ordered steps and moderate uncertainty | Tasks where early observations invalidate the whole plan | One planning call plus execution calls | Needs explicit replan hooks |
| ReWOO | Deterministic tool chains where evidence can be gathered upfront | Exploratory debugging where tool output changes strategy | Low model-call count and predictable execution | Weak unless planner anticipated branches |
| Tree of Thought | Hard reasoning, design trade-offs, and ambiguous decisions | Simple lookup or high-volume tasks | Expensive because branches are generated and scored | Strong for reasoning, not side effects |
| Multi-agent planning | Work requiring distinct specialist perspectives | Small tasks where coordination overhead dominates | More calls and orchestration complexity | Depends on supervisor or routing constraints |
Plan-and-Execute is usually the first production pattern to implement because it is understandable, testable, and easy to constrain. The planner produces structured steps, the executor runs each step, and the verifier checks whether the result satisfies the task. The mistake is treating the generated plan as sacred. A plan is a hypothesis about how to solve the task, and execution may reveal that the hypothesis needs local repair.
flowchart TD Task[Task] --> Planner[Planner creates structured plan] Planner --> Plan[Plan with steps, dependencies, tools] Plan --> Step1[Step 1] Step1 --> Verify1{Step verified?} Verify1 -->|yes| Step2[Step 2] Verify1 -->|no, recoverable| Replan1[Replan failed step] Replan1 --> Step1 Step2 --> Verify2{Step verified?} Verify2 -->|yes| Step3[Step 3] Verify2 -->|no, exhausted| Escalate[Escalate] Step3 --> FinalVerify[Final verification] FinalVerify --> Result[Result]A structured plan should include step identifiers, dependencies, tool names, typed inputs, expected outputs, retry policy, and verification criteria. Without those fields, the executor is forced to infer too much from natural language. That inference becomes brittle when the model changes wording, a tool returns an unexpected shape, or a later step needs evidence from an earlier one.
from __future__ import annotations
from dataclasses import dataclass, fieldfrom enum import Enumfrom typing import Callable
class StepStatus(str, Enum): PENDING = "pending" COMPLETE = "complete" FAILED = "failed"
@dataclassclass PlanStep: step_id: str description: str tool: str tool_input: dict[str, str] depends_on: list[str] = field(default_factory=list) expected_key: str = "result" retries: int = 1 status: StepStatus = StepStatus.PENDING result: dict[str, str] = field(default_factory=dict)
class PlanExecutor: """Minimal structured executor with dependency checks and bounded retries."""
def __init__(self, tools: dict[str, Callable[[dict[str, str]], dict[str, str]]]) -> None: self.tools = tools
def execute(self, steps: list[PlanStep]) -> dict[str, dict[str, str]]: completed: dict[str, dict[str, str]] = {} by_id = {step.step_id: step for step in steps}
for step in steps: missing = [dep for dep in step.depends_on if dep not in completed] if missing: step.status = StepStatus.FAILED step.result = {"error": f"Missing dependencies: {', '.join(missing)}"} raise RuntimeError(step.result["error"])
tool = self.tools.get(step.tool) if tool is None: step.status = StepStatus.FAILED step.result = {"error": f"Unknown tool: {step.tool}"} raise RuntimeError(step.result["error"])
last_error = "" for _ in range(step.retries + 1): try: step.result = tool(step.tool_input) if step.expected_key not in step.result: raise ValueError(f"Missing expected key: {step.expected_key}") step.status = StepStatus.COMPLETE completed[step.step_id] = step.result break except Exception as exc: last_error = str(exc)
if step.status != StepStatus.COMPLETE: step.result = {"error": last_error} by_id[step.step_id] = step raise RuntimeError(f"Step {step.step_id} failed: {last_error}")
return completedReWOO, or Reason Without Observation, separates the reasoning phase from the evidence-gathering phase. The planner creates tool calls with evidence variables, the executor fills those variables, and a final solver synthesizes the answer. This is powerful when the tool chain is predictable, such as “search documents, fetch metadata, summarize fields, compare values.” It is weaker when each observation may change the investigation strategy.
sequenceDiagram participant User participant Planner participant Executor participant SearchTool participant DatabaseTool participant Solver User->>Planner: Complex task Planner->>Executor: Plan with #E1 and #E2 Executor->>SearchTool: Execute tool call for #E1 SearchTool-->>Executor: Evidence #E1 Executor->>DatabaseTool: Execute tool call using #E1 DatabaseTool-->>Executor: Evidence #E2 Executor->>Solver: Task, plan, evidence map Solver-->>User: Final answer grounded in evidenceReWOO reduces repeated model calls because the model does not reason after every observation. That is valuable for high-volume extraction, predictable research, and workflows with stable tool contracts. The trade-off is that the first plan must be good enough. If the planner fails to include a needed branch, the executor may collect the wrong evidence efficiently, which is still wrong.
Tree of Thought explores multiple reasoning paths and scores them before synthesizing an answer. It is best for problems where choosing the approach is the hard part: architecture trade-offs, root cause hypotheses, migration strategies, and policy decisions. It is usually a poor fit for routine tool execution because branch exploration is expensive and can accidentally multiply side effects if not isolated.
graph TD Root[Problem] --> A[Hypothesis A\nlow migration risk] Root --> B[Hypothesis B\nbetter long-term design] Root --> C[Hypothesis C\nfastest rollback] A --> A1[Check hidden dependency\nscore 0.64] B --> B1[Validate operational cost\nscore 0.86] B --> B2[Validate security boundary\nscore 0.82] C --> C1[Check data loss risk\nscore 0.51] B1 --> Final[Selected reasoning path]The safest Tree of Thought implementations keep branch exploration in a sandbox until a final strategy is chosen. For example, a cloud remediation agent should not actually change production resources while exploring branches. It should simulate, inspect, and evaluate candidate plans first, then request approval or execute only the selected plan under policy.
The practical senior move is to combine patterns. A system might use a classifier to route simple requests to normal chat, predictable workflows to ReWOO, complex operational tasks to Plan-and-Execute with replanning, and high-stakes design decisions to Tree of Thought. The goal is not pattern purity. The goal is matching the control strategy to the task.
flowchart TD Request[Incoming request] --> Risk{Risk and complexity} Risk -->|simple answer| Chat[Stateless or RAG answer] Risk -->|predictable extraction| ReWOOFlow[ReWOO flow] Risk -->|operational workflow| PlanFlow[Plan-and-Execute with verification] Risk -->|architecture decision| ThoughtFlow[Tree of Thought sandbox] Risk -->|unsafe request| Human[Human review] ReWOOFlow --> Metrics[Record cost, latency, success] PlanFlow --> Metrics ThoughtFlow --> Metrics Chat --> MetricsActive learning prompt: Your team needs an agent to inspect failed CI runs, read logs, identify the likely cause, and open a draft pull request when the fix is obvious. Which planning pattern would you choose for log inspection, which pattern would you choose for making code changes, and where would you require human approval?
Part 4: Multi-Agent Architectures
Section titled “Part 4: Multi-Agent Architectures”A single agent can become overloaded when a task requires research, code generation, security review, product judgment, and user communication. Multi-agent systems split work across specialized roles, but specialization introduces coordination problems. You gain narrower prompts and clearer responsibilities, while paying for routing, synthesis, disagreement resolution, and loop prevention.
The supervisor pattern is the easiest multi-agent topology to govern. A central supervisor receives the task, decomposes it, delegates work to specialists, checks their outputs, and synthesizes the final result. It is less flexible than a free-form swarm, but it is easier to audit because one node owns global state and stopping conditions.
flowchart TD Sup[SUPERVISOR\nowns plan and budget] --> R[RESEARCHER\nfinds evidence] Sup --> C[CODER\nimplements change] Sup --> S[SECURITY REVIEWER\nchecks policy] Sup --> Q[QA REVIEWER\nverifies behavior] R --> Sup C --> Sup S --> Sup Q --> SupA supervisor should not blindly concatenate worker outputs. It should validate that each worker answered the assigned question, identify contradictions, request targeted revisions if necessary, and preserve evidence for audit. If the researcher says a library supports a feature and the security reviewer says the feature is unsafe, the supervisor needs an explicit arbitration rule instead of averaging the opinions.
Swarm architectures allow peer agents to hand off work dynamically. They can feel more natural for exploratory tasks because the agent closest to the problem can decide who should handle the next step. The danger is a delegation loop, where a QA agent returns work to a developer, the developer returns it to QA, and neither has authority to finish or escalate.
flowchart LR Researcher[Researcher] <--> Writer[Writer] Writer <--> Reviewer[Reviewer] Reviewer <--> Developer[Developer] Developer <--> Researcher Reviewer --> Escalation[Escalation policy] Developer --> EscalationA swarm needs routing history, handoff limits, role confidence thresholds, and escalation rules. Without those constraints, dynamic collaboration becomes uncontrolled recursion. A good swarm coordinator records which agents handled which task, what changed at each handoff, why the next handoff is justified, and whether the maximum handoff count has been reached.
Hierarchical teams are useful when work naturally decomposes by domain and subdomain. An executive agent can assign broad goals to lead agents, and each lead can manage worker agents. This pattern resembles an organization chart, which can be helpful for large research or migration tasks. It also adds latency and can obscure responsibility if every layer merely restates the task.
flowchart TD Exec[EXECUTIVE\nsets objective and budget] --> PlatformLead[PLATFORM LEAD\ninfra and deployment] Exec --> AppLead[APPLICATION LEAD\ncode and tests] Exec --> RiskLead[RISK LEAD\nsecurity and compliance] PlatformLead --> P1[Cluster worker] PlatformLead --> P2[Observability worker] AppLead --> A1[Backend worker] AppLead --> A2[Test worker] RiskLead --> R1[Policy worker] RiskLead --> R2[Audit worker]Debate architectures assign agents to argue competing positions, then ask a judge to evaluate the arguments. They can expose weak assumptions in design decisions, such as whether to use a vector store, a relational table, or both for memory. Debate is not magic truth discovery. If every participant shares the same blind spot or the judge lacks evidence, the final answer can still be wrong.
| Topology | Use When | Avoid When | Required Control |
|---|---|---|---|
| Supervisor | You need auditability, clear ownership, and bounded delegation | The task is tiny or purely conversational | Central budget, worker contracts, synthesis checks |
| Swarm | The right specialist is hard to know upfront and handoffs are natural | Infinite ping-pong would be costly or unsafe | Routing history, max handoffs, escalation policy |
| Hierarchical team | The task has multiple domains and many subtasks | Layers would only repeat instructions without adding expertise | Budget per tier and clear acceptance criteria |
| Debate | You need to evaluate competing strategies or surface assumptions | There is no evidence base for claims | Judge rubric, source requirements, and final decision owner |
| Single agent | The task fits one context and one skill set | The prompt becomes overloaded with conflicting roles | Strong tool policy and verification loop |
The right multi-agent design often starts with a single agent and one explicit reviewer. If that works, add specialists only where they reduce real failure. Adding five agents because the architecture looks sophisticated usually increases cost and makes debugging harder. Production systems reward boring clarity more often than clever choreography.
Part 5: Self-Correction, Tool Creation, and Execution Budgets
Section titled “Part 5: Self-Correction, Tool Creation, and Execution Budgets”Reflection lets an agent critique and improve its own output before finalizing it. This can catch missing steps, inconsistent formatting, weak reasoning, and obvious factual errors. It can also waste enormous cost when the agent keeps polishing a good-enough answer. Reflection must therefore have a budget, a rubric, and a stopping rule.
Self-correction is stronger when it uses external verification rather than pure self-opinion. A code agent should run tests. A data agent should validate schema constraints. A support agent should check policy documents. A Kubernetes remediation agent should inspect rollout status after a change. The model can propose a correction, but the environment should verify whether the correction worked.
flowchart TD Draft[Draft output or action plan] --> Verify[External verification] Verify -->|pass| Final[Finalize] Verify -->|minor issue| Reflect[Reflect with bounded rubric] Reflect --> Improve[Improve once] Improve --> Verify Verify -->|major issue| Replan[Replan or escalate] Verify -->|budget exhausted| Stop[Stop with partial result and evidence]Tool creation is the most dangerous pattern in this module. An agent that writes and executes new code can expand its own capabilities, but it can also create security vulnerabilities, duplicate existing tools, bypass governance, or overload its own tool registry. Production systems should treat dynamic tool creation as privileged behavior that requires sandboxing, review, resource limits, and often human approval.
The safer default is tool composition, not tool creation. If the agent needs to inspect a log, parse JSON, and compare fields, it should use approved tools for file reading, JSON parsing, and comparison. A new tool should be created only when existing tools cannot reasonably perform the task, the tool specification is narrow, the runtime is sandboxed, and the resulting code can be tested.
| Control | What It Prevents | Implementation Example |
|---|---|---|
| Iteration limit | Infinite planning, reflection, or handoff loops | max_iterations, max_reflections, max_handoffs |
| Time budget | Long-running loops and user-visible hangs | Wall-clock deadline around the entire task |
| Token budget | Cost explosions from repeated context expansion | Per-stage token ceilings and truncation policy |
| Tool allowlist | Unauthorized side effects | Role-scoped tools and explicit permissions |
| Idempotency key | Duplicate external actions during retries | Stable request IDs for payments, tickets, and changes |
| Verification gate | Confident but wrong final answers | Tests, policy checks, schema validation, rollout checks |
| Human escalation | Unsafe autonomous decisions | Approval for destructive or irreversible actions |
A Kubernetes-hosted agent should have both application-level and platform-level limits. At the application layer, the agent should count iterations, retries, tool calls, and tokens. At the platform layer, a Kubernetes v1.35+ batch/v1 Job can enforce finite execution, and container resource limits can prevent a runaway process from consuming unlimited CPU or memory. These controls serve different purposes, and neither replaces the other.
apiVersion: batch/v1kind: Jobmetadata: name: support-agent-taskspec: backoffLimit: 1 activeDeadlineSeconds: 900 template: spec: restartPolicy: Never containers: - name: agent image: example.com/support-agent:1.0.0 args: ["run-task", "--task-id=$(TASK_ID)"] env: - name: TASK_ID value: "task-123" - name: MAX_ITERATIONS value: "8" - name: MAX_TOOL_CALLS value: "20" resources: requests: cpu: "250m" memory: "512Mi" limits: cpu: "1" memory: "1Gi"The Job manifest is not a complete safety system, but it shows the production mindset. The agent has an external deadline, bounded retries, explicit environment configuration, and resource limits. The application still needs tool permissions and verification gates because Kubernetes cannot tell whether a model is about to send a bad refund or write a misleading incident summary.
Part 6: Worked Example: A Support Agent Debugs a Failed Deployment
Section titled “Part 6: Worked Example: A Support Agent Debugs a Failed Deployment”This worked example traces a complete complex task before you build a smaller version in the hands-on exercise. The scenario is a platform support agent that helps an application team investigate a failed Kubernetes rollout. The agent has read-only cluster inspection tools, a documentation search tool, a ticket update tool, hybrid memory, and permission to propose a patch. It does not have permission to apply production changes without human approval.
The user says: “The checkout service rollout is stuck after today’s release. Find the likely cause, explain the fix, and draft the ticket update.” This is not a simple Q&A task. The agent must recall team context, inspect live state, plan tool use, adapt to evidence, avoid unsafe writes, and produce an auditable answer. A stateless chat response would be cheap but unreliable because it would not see the current cluster state or previous team decisions.
sequenceDiagram participant User participant Agent participant Memory participant Planner participant ClusterTool participant DocsTool participant Verifier participant TicketTool User->>Agent: Checkout rollout is stuck Agent->>Memory: Retrieve service context and past incidents Memory-->>Agent: Team uses readiness probes and Argo CD; similar failure last month Agent->>Planner: Build bounded investigation plan Planner-->>Agent: Inspect rollout, pods, events, recent manifest diff, docs Agent->>ClusterTool: Get rollout status and pods ClusterTool-->>Agent: New pods failing readiness probe Agent->>ClusterTool: Get pod events and container logs ClusterTool-->>Agent: Probe returns HTTP 404 on /ready Agent->>DocsTool: Search service release notes DocsTool-->>Agent: Health endpoint changed to /healthz Agent->>Verifier: Check diagnosis against evidence Verifier-->>Agent: Diagnosis supported, production patch needs approval Agent->>TicketTool: Draft update only TicketTool-->>Agent: Draft ticket comment created Agent-->>User: Cause, proposed fix, evidence, and next stepThe first stage is intake and risk classification. The agent identifies that the task concerns production deployment health, so it selects an operational workflow rather than casual chat. Because the requested action includes “find” and “explain” but not “apply,” the agent keeps tool permissions read-only except for drafting a ticket comment. This early classification prevents the agent from silently patching production just because it found a likely fix.
The second stage is memory retrieval. The agent pulls a team preference stating that checkout deployments are managed through GitOps, not direct cluster edits. It retrieves a past episode where a readiness probe path changed and caused a rollout to stall. It also retrieves a policy memory stating that production changes require a human approval comment on the deployment ticket. These memories do not answer the task directly, but they shape safe execution.
The third stage is planning. The agent creates a Plan-and-Execute workflow because each inspection step depends on the previous evidence, but the overall investigation structure is predictable. It decides to inspect rollout status, list new pods, read events, read logs if necessary, compare the current manifest to the expected health endpoint, verify the diagnosis, draft a ticket update, and stop before applying a change. This plan is specific enough to execute and bounded enough to audit.
| Trace Step | Agent Decision | Evidence Used | Risk Control |
|---|---|---|---|
| Intake | Treat as production support task, not casual advice | User mentions stuck rollout and service name | Restrict to read-only inspection plus ticket draft |
| Memory retrieval | Load GitOps preference and similar past incident | Team memory and episodic memory | Label memory as context, not proof |
| Planning | Use Plan-and-Execute with verification | Task has ordered diagnostic steps | Set max tool calls and no direct patching |
| Cluster inspection | Check rollout, pods, events, and logs | Rollout status and failing readiness probe | Read-only tools prevent accidental mutation |
| Evidence synthesis | Compare /ready failure with release note endpoint change | Logs plus documentation search | Require two independent evidence sources |
| Verification | Confirm proposed fix matches observed failure | Probe path mismatch and current manifest | Stop before production change |
| Memory write-back | Store episode summary after final response | Outcome and ticket draft ID | Avoid storing raw logs with sensitive data |
The fourth stage is execution. The cluster tool reports that the Deployment has unavailable replicas and the new ReplicaSet’s pods are failing readiness checks. Events show repeated readiness probe failures. Logs do not show application crashes, which narrows the cause away from startup failure and toward health check configuration. The agent then searches internal release notes and finds that checkout changed its health endpoint from /ready to /healthz during the release.
The fifth stage is localized reasoning, not blind pattern matching. The past episode about a readiness probe path change made the agent inspect probes earlier, but it did not use that episode as the answer. The answer came from current evidence: rollout status, pod events, and release notes. This distinction matters because episodic memory should guide investigation, while live evidence should support operational conclusions.
The sixth stage is verification. The agent checks that the current Deployment manifest still points the readiness probe at /ready, while the new application version exposes /healthz. It also checks whether liveness is failing; it is not. That means the service is running but not becoming ready, which matches the observed rollout behavior. The verifier marks the diagnosis as supported but notes that applying a manifest change is outside the agent’s permission boundary.
The seventh stage is output. The agent drafts a ticket comment that states the likely cause, the evidence, the proposed manifest change, the risk, and the required human approval. It tells the user that it has not applied the change. That final sentence is not bureaucratic filler; it is part of the safety contract. Users need to know whether the system has only advised, drafted, or acted.
Ticket draft:The checkout rollout is stalled because new pods are failing readiness checks.Evidence: Deployment rollout reports unavailable replicas; pod events show readiness probe HTTP 404 on /ready; release notes for the new version state that the health endpoint moved to /healthz.Proposed fix: update the readinessProbe HTTP path from /ready to /healthz in the GitOps-managed manifest, then let Argo CD reconcile.Risk: low if only readiness path changes, but production policy requires human approval before merge.Action taken: drafted this update only; no cluster or repository changes were applied.The eighth stage is memory write-back. The agent stores a sanitized episode: service name, symptom, evidence pattern, root cause, proposed fix, and approval boundary. It does not store raw logs or secrets. The next time checkout has a readiness issue, the agent can recall the pattern, but it still must inspect live evidence because the next failure may have a different cause.
This example illustrates the central lesson of the module. The agent succeeded because memory, planning, tools, verification, and permissions supported each other. Memory suggested where to look, planning organized the investigation, tools gathered current evidence, verification prevented overreach, and policy stopped the agent before an unauthorized production change. Remove any one of those controls, and the same intelligent model becomes much less reliable.
Part 7: Production Economics and Observability
Section titled “Part 7: Production Economics and Observability”Agent cost grows faster than many teams expect because every stage can add tokens, latency, and external calls. A simple chatbot might answer with one model call. An agent may rewrite the query, retrieve memory, generate a plan, call several tools, summarize observations, replan after an error, verify the result, reflect on the answer, and store a new episode. Each step may be justified, but the total must be budgeted.
pie title Token Consumption Breakdown "User request and system context" : 120 "Memory retrieval context" : 320 "Planning" : 520 "Tool observations" : 460 "Verification" : 260 "Response synthesis" : 420 "Memory write-back" : 180Latency also compounds. A planner call that takes a few seconds may feel acceptable, but a planner call plus three sequential tool calls plus verification plus reflection can feel broken in an interactive UI. Production systems often need progress updates, asynchronous job handling, cancellation, and partial results. The user experience should reflect the actual execution model instead of pretending the agent is a normal instant chatbot.
| Metric | What It Measures | Why It Matters | Example Target |
|---|---|---|---|
| Task completion rate | Percentage of tasks that reach a valid end state | Captures whether the agent actually finishes useful work | Above 85 percent for supported workflows |
| First-pass success | Percentage completed without replan or retry | Separates smooth execution from expensive recovery | Above 70 percent after stabilization |
| Tool-call count | Number of external actions per task | Detects loops and inefficient plans | Bounded by workflow class |
| Memory precision | Retrieved memories that were actually useful | Catches noisy vector recall and stale summaries | Reviewed with labeled fixtures |
| Verification failure rate | Outputs rejected by tests or policy checks | Shows whether generation quality is improving | Should decline as prompts and tools mature |
| Human escalation rate | Tasks requiring manual decision or approval | Indicates risk boundaries and automation coverage | Expected to remain nonzero for high-risk work |
| P95 latency | Slow-end user-visible response time | Determines whether UX needs async handling | Set by product workflow, not wishful thinking |
Observability should preserve the agent’s reasoning structure without logging sensitive prompt contents unnecessarily. A trace should show request classification, retrieved memory IDs, selected plan, tool calls, tool outcomes, retries, verification decisions, final action, and memory writes. Sensitive values should be redacted, and logs should respect tenant boundaries. Without these traces, incident response becomes guesswork.
flowchart LR Span1[Intake span] --> Span2[Memory span] Span2 --> Span3[Planning span] Span3 --> Span4[Tool execution spans] Span4 --> Span5[Verification span] Span5 --> Span6[Response span] Span6 --> Span7[Memory write span] Span4 --> ErrorBudget[Budget counters] Span5 --> ErrorBudgetThe most useful dashboards separate model quality from system behavior. If the agent fails because a tool returned a schema error, that is not the same as a hallucinated answer. If retrieval returned stale memory, that is not the same as a bad planner. If the verifier rejected a risky action, that may be a success, not a failure. Mature operations classify these outcomes so teams improve the right layer.
from dataclasses import dataclass, fieldfrom time import monotonic
@dataclassclass AgentBudget: max_seconds: float = 60.0 max_tool_calls: int = 12 max_replans: int = 2 started_at: float = field(default_factory=monotonic) tool_calls: int = 0 replans: int = 0
def check_time(self) -> None: elapsed = monotonic() - self.started_at if elapsed > self.max_seconds: raise TimeoutError(f"Agent exceeded {self.max_seconds} second budget")
def record_tool_call(self) -> None: self.tool_calls += 1 if self.tool_calls > self.max_tool_calls: raise RuntimeError("Agent exceeded tool-call budget")
def record_replan(self) -> None: self.replans += 1 if self.replans > self.max_replans: raise RuntimeError("Agent exceeded replan budget")
if __name__ == "__main__": budget = AgentBudget(max_seconds=10.0, max_tool_calls=3, max_replans=1) budget.record_tool_call() budget.record_tool_call() budget.check_time() print("Budget still healthy")The budget object is simple, but it encodes a non-negotiable production rule: every autonomous loop needs a counter. Engineers sometimes try to solve runaway behavior with better prompts alone. Prompts help, but counters, deadlines, allowlists, and verifiers are the controls that remain enforceable when the model is uncertain or wrong.
Part 8: Choosing the Right Design
Section titled “Part 8: Choosing the Right Design”Designing an agent begins with a task inventory, not with a framework choice. Write down what the user wants, what evidence is needed, what tools may be called, what can go wrong, what must never happen automatically, and how success will be verified. Only then should you choose memory and planning patterns. Frameworks can speed implementation, but they cannot decide your risk boundary.
For low-risk knowledge tasks, a RAG pipeline with short-term memory may be enough. For customer workflows, add durable user preferences, explicit policy memory, and carefully scoped tools. For operational workflows, add structured planning, external verification, and human approval for destructive actions. For long-running research or software engineering tasks, consider episodic memory and specialized agents, but keep budgets strict.
| Scenario | Recommended Memory | Recommended Planning | Required Safety Layer |
|---|---|---|---|
| Simple documentation assistant | Short-term buffer plus RAG snippets | Stateless answer or minimal routing | Citation and source freshness checks |
| Personal productivity assistant | Preferences, summaries, and recent task state | Plan-and-Execute for multi-step tasks | Calendar or email permission gates |
| Customer support agent | User facts, policy memory, and case episodes | Plan-and-Execute with escalation branches | Refund, account, and legal policy checks |
| Batch invoice extraction | Minimal memory and evidence variables | ReWOO for predictable tool chains | Schema validation and sample audits |
| Production operations assistant | Team context, incidents, and runbooks | Plan-and-Execute with localized replanning | Read-only default, approval for mutation |
| Architecture decision assistant | Prior decisions and design constraints | Tree of Thought or debate | Evidence requirements and decision record |
A useful design review asks adversarial questions. What stale memory could cause the worst answer? What tool call would be dangerous if repeated? What step would be expensive if the planner loops? What evidence would prove the final answer wrong? What should the agent do when it is uncertain? These questions turn “agent intelligence” into concrete engineering requirements.
The final design principle is reversibility. Prefer actions that can be inspected, drafted, simulated, or approved before execution. A draft ticket is safer than a sent email. A proposed patch is safer than a direct production edit. A dry-run policy check is safer than a live change. Agents are most valuable when they accelerate human work without erasing accountability.
Did You Know?
Section titled “Did You Know?”- Fact 1: Larger context windows reduce some memory pressure, but they do not remove the need for retrieval design because long prompts can still bury relevant details among distracting text.
- Fact 2: ReWOO-style planning can reduce repeated reasoning calls when a tool chain is predictable, but it becomes brittle when later observations should change the plan.
- Fact 3: Episodic memory is most useful when it stores the situation, actions, evidence, outcome, and lesson together, rather than storing a vague summary of success.
- Fact 4: A verifier that uses tests, schema checks, policy rules, or live system state is usually more reliable than a reflection loop that only asks the same model to judge itself.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix |
|---|---|---|
| Storing every conversation turn forever | Teams confuse “more memory” with “better recall” and fill the vector store with low-value chatter. | Add importance scoring, TTLs, duplicate detection, and explicit write policies before indexing. |
| Treating vector similarity as truth | Semantically similar old facts can outrank newer or more authoritative facts when metadata is ignored. | Combine similarity with recency, source reliability, user scope, and conflict resolution rules. |
| Building plans as unstructured prose | Natural-language steps are easy to read but hard for executors to validate and recover from. | Generate structured plans with step IDs, dependencies, tool names, expected outputs, and retry rules. |
| Using ReWOO for exploratory debugging | The planner assumes the evidence path upfront even though each observation should change the next question. | Use Plan-and-Execute or a reactive loop with localized replanning for uncertain investigations. |
| Letting reflection run until the answer feels perfect | The model keeps finding minor improvements and spends tokens after the output is already useful. | Set reflection limits, define severity thresholds, and prefer external verification over self-opinion. |
| Allowing swarm agents to hand off without history | Peer agents can bounce the same task between each other without a clear owner or finish condition. | Track routing history, enforce max handoffs, and escalate when no agent materially changes the state. |
| Giving tool-creating agents broad execution rights | Dynamic code can duplicate tools, bypass policy, or execute unsafe behavior in the host environment. | Require sandboxing, allowlists, overlap checks, tests, and human approval for privileged tool creation. |
-
Your team deploys a customer support agent that remembers user preferences. A user says, “I no longer want SMS updates; use email only.” The next day, the agent retrieves both the old SMS preference and the new email preference. What should you inspect and change first?
Answer
Inspect the memory write and conflict-resolution path, not just the prompt. The new statement should either invalidate the old SMS preference or create a newer authoritative record that the context builder ranks above the old one. A robust fix includes metadata for preference type, timestamp, source, and active status, then retrieval logic that excludes invalidated preferences from prompt context.
-
A document-processing agent extracts fields from thousands of invoices. Each invoice follows the same workflow: parse text, extract vendor, validate totals, and write structured JSON. The current reactive agent makes a model call after every tool result and costs too much. Which planning pattern should you evaluate, and why?
Answer
Evaluate ReWOO because the tool chain is predictable and evidence can be gathered according to an upfront plan. The model can plan the extraction and validation steps once, the executor can run the deterministic tools, and a final solver can synthesize or validate the result. This reduces repeated reasoning calls compared with a reactive loop that re-reads growing context after every observation.
-
A production operations agent remembers that checkout once failed because of a readiness probe path change. Today checkout is failing again, and the agent immediately recommends the same fix without inspecting the cluster. What design flaw caused this behavior?
Answer
The agent treated episodic memory as proof instead of using it as an investigation hint. The fix is to label retrieved episodes as prior examples, require live evidence for operational diagnoses, and add verification steps that inspect current rollout status, events, logs, and manifests before recommending a change. Episodic memory should guide where to look, not replace current observations.
-
A Plan-and-Execute agent generates a six-step deployment plan. Step three fails because a registry API times out once, and the agent abandons the whole task. What should the executor support?
Answer
The executor should support bounded retries and localized replanning. A transient API timeout should not automatically invalidate the whole plan, but the system also should not retry forever. The step definition should include retry policy, expected output, failure classification, and a path to replan the failed step if the error is recoverable.
-
A design assistant must recommend whether to store agent memory in a vector database, a relational database, or a hybrid architecture. The decision depends on privacy, conflict resolution, semantic recall, and auditability. Which reasoning pattern fits best, and what safety condition should you add?
Answer
Tree of Thought or a debate pattern fits because the hard part is evaluating competing architectures against several criteria. The safety condition is evidence-based evaluation: each branch or debate participant should cite constraints and trade-offs, and a judge or verifier should apply a rubric. The system should not choose based on rhetorical confidence alone.
-
A swarm system has a Developer agent and a QA agent. QA keeps rejecting a script, Developer keeps returning tiny edits, and the task never finishes. What controls should the coordinator enforce?
Answer
The coordinator should enforce routing history, max handoffs, material-change checks, and escalation. If a handoff does not change the task state meaningfully, the next loop should be blocked or sent to a supervisor. A supervisor topology may be better if the workflow needs a single owner to decide when the script is good enough.
-
A tool-creating agent proposes a new Python function to check CPU, another to check memory, and another to check disk, even though an approved metrics tool already returns all three. What should the tool governor do?
Answer
The governor should reject the new tools because they overlap with an existing approved capability. It should instruct the agent to use the metrics tool and only approve new tool creation when existing tools cannot reasonably satisfy the task. This prevents tool proliferation, context bloat, and unnecessary security review.
Hands-On Exercise: Build a Bounded Memory-and-Planning Agent
Section titled “Hands-On Exercise: Build a Bounded Memory-and-Planning Agent”In this lab you will build a small local agent loop with deterministic mock components. The goal is not to create a powerful AI system. The goal is to practice the architecture from the worked example: classify a task, retrieve relevant memory, execute a structured plan, verify the result, enforce budgets, and store a sanitized episode.
You will use only the Python standard library. The mock language model returns deterministic plans so the exercise is repeatable without an API key. Keep the implementation small, but pay attention to the boundaries: the agent should not run forever, should not store every message, and should not claim success without verification.
Task 1: Create the Workspace
Section titled “Task 1: Create the Workspace”Create an isolated directory and virtual environment.
mkdir -p ~/kubedojo-agent-memory-planningcd ~/kubedojo-agent-memory-planning.venv/bin/python --version 2>/dev/null || /Users/krisztiankoos/projects/kubedojo/.venv/bin/python --version/Users/krisztiankoos/projects/kubedojo/.venv/bin/python -m venv .venv.venv/bin/python -m pip install --upgrade pipSuccess criteria:
- The directory
~/kubedojo-agent-memory-planningexists. -
.venv/bin/python --versionprints a Python version. - No command requires a cloud API key.
Task 2: Write the Runnable Agent
Section titled “Task 2: Write the Runnable Agent”Create agent_lab.py with the following code.
from __future__ import annotations
from dataclasses import dataclass, fieldfrom enum import Enumfrom time import monotonicfrom typing import Callableimport json
class MockLLM: """Deterministic model stub for the lab."""
def generate(self, prompt: str) -> str: if "CREATE_PLAN" in prompt: return json.dumps( { "steps": [ { "step_id": "inspect", "description": "Inspect rollout symptoms", "tool": "inspect_rollout", "tool_input": {"service": "checkout"}, "depends_on": [], "expected_key": "symptom", }, { "step_id": "docs", "description": "Read release notes for health endpoint changes", "tool": "search_docs", "tool_input": {"service": "checkout"}, "depends_on": ["inspect"], "expected_key": "endpoint", }, { "step_id": "draft", "description": "Draft a ticket update with evidence", "tool": "draft_ticket", "tool_input": {"ticket": "SUP-123"}, "depends_on": ["inspect", "docs"], "expected_key": "draft_id", }, ] } ) if "VERIFY" in prompt and "/ready" in prompt and "/healthz" in prompt: return json.dumps({"passed": True, "reason": "Probe mismatch is supported by evidence."}) if "SUMMARIZE_EPISODE" in prompt: return "Checkout rollout failed readiness because the manifest used /ready while the app exposed /healthz." return "Mock response"
@dataclassclass MemoryRecord: kind: str content: str importance: float
class Memory: """Small memory layer with importance filtering and simple keyword retrieval."""
def __init__(self) -> None: self.records: list[MemoryRecord] = []
def store(self, kind: str, content: str, importance: float) -> None: if importance >= 0.3: self.records.append(MemoryRecord(kind=kind, content=content, importance=importance))
def retrieve(self, query: str) -> list[MemoryRecord]: terms = {term.lower() for term in query.split()} matches = [] for record in self.records: record_terms = {term.lower().strip(".,") for term in record.content.split()} if terms & record_terms: matches.append(record) matches.sort(key=lambda record: record.importance, reverse=True) return matches[:3]
@dataclassclass Budget: max_tool_calls: int = 5 max_seconds: float = 20.0 started_at: float = field(default_factory=monotonic) tool_calls: int = 0
def record_tool_call(self) -> None: if monotonic() - self.started_at > self.max_seconds: raise TimeoutError("Task exceeded time budget") self.tool_calls += 1 if self.tool_calls > self.max_tool_calls: raise RuntimeError("Task exceeded tool-call budget")
class StepStatus(str, Enum): PENDING = "pending" COMPLETE = "complete" FAILED = "failed"
@dataclassclass PlanStep: step_id: str description: str tool: str tool_input: dict[str, str] depends_on: list[str] expected_key: str status: StepStatus = StepStatus.PENDING result: dict[str, str] = field(default_factory=dict)
def inspect_rollout(args: dict[str, str]) -> dict[str, str]: service = args["service"] return { "service": service, "symptom": "new pods fail readiness checks at /ready with HTTP 404", }
def search_docs(args: dict[str, str]) -> dict[str, str]: service = args["service"] return { "service": service, "endpoint": "release notes say checkout now exposes /healthz", }
def draft_ticket(args: dict[str, str]) -> dict[str, str]: ticket = args["ticket"] return { "draft_id": f"{ticket}-draft", "status": "drafted only, no production change applied", }
class BoundedAgent: def __init__(self, llm: MockLLM, tools: dict[str, Callable[[dict[str, str]], dict[str, str]]]) -> None: self.llm = llm self.tools = tools self.memory = Memory()
def seed_memory(self) -> None: self.memory.store("policy", "Production checkout changes require human approval.", 1.2) self.memory.store("episode", "A previous readiness incident involved a changed health endpoint.", 0.9) self.memory.store("chatter", "The user said hello during onboarding.", 0.1)
def run(self, task: str) -> str: budget = Budget() memories = self.memory.retrieve(task) memory_context = "\n".join(f"- {record.kind}: {record.content}" for record in memories)
plan_data = json.loads(self.llm.generate(f"CREATE_PLAN\nTask: {task}\nMemory:\n{memory_context}")) steps = [PlanStep(**step) for step in plan_data["steps"]] completed: dict[str, dict[str, str]] = {}
for step in steps: missing = [dependency for dependency in step.depends_on if dependency not in completed] if missing: step.status = StepStatus.FAILED raise RuntimeError(f"Step {step.step_id} missing dependencies: {missing}")
budget.record_tool_call() tool = self.tools[step.tool] step.result = tool(step.tool_input)
if step.expected_key not in step.result: step.status = StepStatus.FAILED raise RuntimeError(f"Step {step.step_id} did not return {step.expected_key}")
step.status = StepStatus.COMPLETE completed[step.step_id] = step.result
evidence = json.dumps(completed, indent=2) verification = json.loads(self.llm.generate(f"VERIFY\nTask: {task}\nEvidence:\n{evidence}")) if not verification["passed"]: raise RuntimeError(f"Verification failed: {verification['reason']}")
episode = self.llm.generate(f"SUMMARIZE_EPISODE\nTask: {task}\nEvidence:\n{evidence}") self.memory.store("episode", episode, 1.0)
return ( "Diagnosis: checkout readiness is failing because the manifest still probes /ready, " "while the release notes say the app now exposes /healthz.\n" "Action: drafted a ticket update only; no production change was applied.\n" f"Evidence:\n{evidence}" )
if __name__ == "__main__": agent = BoundedAgent( MockLLM(), { "inspect_rollout": inspect_rollout, "search_docs": search_docs, "draft_ticket": draft_ticket, }, ) agent.seed_memory() print(agent.run("The checkout rollout is stuck after today's release. Find the likely cause."))Run the lab script.
.venv/bin/python agent_lab.pySuccess criteria:
- The output identifies the readiness probe path mismatch.
- The output states that only a ticket draft was created.
- The evidence includes results from
inspect,docs, anddraft. - The script exits without any network calls.
Task 3: Break the Budget on Purpose
Section titled “Task 3: Break the Budget on Purpose”Modify Budget(max_tool_calls=5) to Budget(max_tool_calls=2) inside the run method, then run the script again.
.venv/bin/python agent_lab.pySuccess criteria:
- The script fails before silently completing the third tool call.
- The failure message mentions the tool-call budget.
- You can explain why this is safer than letting the agent continue indefinitely.
Restore the tool budget to 5 after the experiment.
Task 4: Break Verification on Purpose
Section titled “Task 4: Break Verification on Purpose”Change the search_docs tool so it returns /ready instead of /healthz, then run the script again.
.venv/bin/python agent_lab.pySuccess criteria:
- The verifier no longer has evidence for the diagnosis.
- The agent does not produce a successful final diagnosis.
- You can identify the difference between tool execution success and task verification success.
Restore search_docs so it returns /healthz.
Task 5: Add a Memory Hygiene Check
Section titled “Task 5: Add a Memory Hygiene Check”Add one more stored memory inside seed_memory with low importance, then confirm it does not appear in retrieved memory.
self.memory.store("chatter", "The user asked whether coffee was available.", 0.2)Success criteria:
- Low-importance chatter is not stored.
- Operationally relevant policy and episode memory remain retrievable.
- You can describe why memory write policy matters as much as retrieval ranking.
Task 6: Explain the Architecture
Section titled “Task 6: Explain the Architecture”Write a short note beside your lab files that answers these questions in your own words.
- Which part of the lab represents short-term task context?
- Which part represents durable policy or episodic memory?
- Where is planning separated from execution?
- Where does verification prevent a confident but unsupported answer?
- Which tool permission boundary prevents the agent from applying a production change?
Next Module
Section titled “Next Module”Move on to Module 1.7: Multi-Agent Systems to deepen coordination patterns, observability tracing, RBAC-compliant tool execution, and human-in-the-loop approval design.
Sources
Section titled “Sources”- ReAct: Synergizing Reasoning and Acting in Language Models — Primary source for the ReAct pattern discussed in the planning section.
- ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models — Primary source for planning tool use upfront and reducing repeated reasoning overhead.
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models — Primary source for multi-branch reasoning with evaluation and pruning.
- Jobs — Kubernetes documentation for one-off batch workloads that run to completion.
- Resource Management for Pods and Containers — Kubernetes documentation for declaring CPU and memory requests and limits.
- Voyager: An Open-Ended Embodied Agent with Large Language Models — Primary source for the Voyager example in the module’s autonomous-agent discussion.
- Lost in the Middle: How Language Models Use Long Contexts — Further reading on why larger context windows do not eliminate retrieval and memory design problems.
- OpenAI API Pricing — Further reading for illustrative cost comparisons in the production economics section.