Skip to content

Dynamic Context Orchestration

Complexity: [COMPLEX]

Time to Complete: ~50 minutes

Prerequisites: Context Engineering Fundamentals, Repository Engineering for Agents, and Retrieval, Tools, and Memory Boundaries; comfort reading Python traces and basic shell.


By the end of this module, you will be able to apply the following skills in production harness design and review:

  • Design a per-turn context orchestration policy that decides what to load, retain, summarize, evict, or refresh against a fixed token budget.
  • Evaluate cache hit economics using provider TTL rules, especially Anthropic’s default five-minute ephemeral prompt cache lifetime and OpenAI’s prefix-cache behavior.
  • Implement compaction and handoff flows that preserve load-bearing decisions while dropping stale tool output and redundant retrieved snippets.
  • Compare just-in-time versus just-in-case context loading and diagnose when lazy retrieval beats eager front-loading.
  • Debug context bloat with attribution logs that answer why each block entered the model window on a given turn.

Mira’s agent harness finally has the static substrate in place. The repository exposes AGENTS.md, scoped rules, retrieval indexes, and memory boundaries, while the model has a large window, MCP tools, and a vector store that returns ranked snippets. On paper, the Context arc from modules 2.1 through 2.3 is complete, yet in production the expensive failures moved one layer up into runtime policy rather than static authoring.

Hypothetical scenario: Mira starts a long debugging session on a flaky deployment controller. Turn 12 still carries the full stack trace from turn 3 even though she fixed that error six turns ago. Turn 18 injects a fresh retrieval hit about an old runbook revision while a newer policy file sits unread in the repo map. Turn 22 misses a cache hit because someone appended a timestamp to the stable system prefix. Turn 28 opens a sub-agent for a multi-file refactor where the child session inherits the parent’s entire chat log instead of a narrow task contract.

None of these failures is fixed by writing a better single prompt, because they are runtime orchestration failures where the harness did not manage context as a dynamic resource with a budget, a freshness policy, and observable decisions. Static context tells the agent what the world looks like when the session begins, while dynamic context orchestration decides what the agent sees on turn N after tools, retrieval, memory, compaction, cache refresh, and sub-task fan-out have already reshaped the working set. Teams that skip this layer often blame the model when the real issue is an implicit policy equivalent to “forward the entire chat transcript forever,” which is neither measurable nor safe under cost pressure.

This module closes the Context triplet by giving you an explicit policy layer you can review in code, in configuration diffs, and in telemetry dashboards. You will treat each model call as a compiled turn composed of stable prefix, injected rules, retrieved evidence, tool results, summaries, and the current ask. You will learn when to spend tokens on just-in-case loading versus just-in-time fetches, how Anthropic’s five-minute cache TTL changes sleep-and-resume math, and how handoffs migrate durable state without dragging transcript noise. You will also learn how to measure whether orchestration improved cache hit rate instead of only shrinking prompts. LangChain’s context documentation frames the same idea as managing short-term versus long-term state across a run; this module focuses on the harness-owned compiler that decides which of those surfaces enters each turn. The design goal is not maximal context but correct context under budget, with enough logging that an on-call engineer can explain every injected block without downloading the full prompt verbatim.

Every agent turn is a small batch job where the harness gathers inputs, applies a policy, compiles a prompt, calls the model, records outputs, and updates internal state for the next turn. Module 2.1 named the working-set layers (stable prefix, task frame, evidence, and scratch), module 2.2 placed durable policy in the repository, and module 2.3 split runtime facts across retrieval, tools, and memory; module 2.4 owns the loop that connects those pieces across time instead of treating each API call as an isolated chat append. LlamaIndex describes querying as the composition step that turns indexes and retrievers into a final model input; in agent harnesses that composition must be repeatable and testable, not delegated entirely to the model’s appetite for more context.

+------------------------------------------------------------------+
| Runtime context loop (one turn) |
+------------------------------------------------------------------+
| 1. Read session state (budget, cache clock, open decisions) |
| 2. Classify intent (debug, edit, review, plan, handoff) |
| 3. Select static prefix (repo rules, tool schemas, skills) |
| 4. Decide dynamic inserts (retrieve? tool? memory? skip?) |
| 5. Apply eviction / compaction on existing working set |
| 6. Assemble ordered prompt + log attribution ledger |
| 7. Call model -> parse output -> update state for turn N+1 |
+------------------------------------------------------------------+

The loop is not optional infrastructure: if your product only forwards chat history to the API, you still have a policy, but it is implicit, unmeasured, and usually means “keep everything forever until the window breaks.” Explicit orchestration makes that policy reviewable in pull requests the same way you review authentication middleware, because the alternative is debugging production spend spikes by editing prose prompts. A practical maturity ladder helps teams prioritize: Level 0 forwards raw chat; Level 1 adds static repo files at session start; Level 2 adds per-turn retrieval and tool output with caps; Level 3 adds eviction, compaction, and ledgers; Level 4 adds cache-aware prefix layout and TTL-aware session pacing. Most production incidents in this arc sit between Level 1 and Level 2, where retrieval and tools exist but no harness-owned eviction runs.

Static context changes slowly relative to a task and includes repository instruction stacks, tool definitions, rubrics, and schema-stable skill files. Dynamic context changes every turn or on intent triggers and includes the latest user message, fresh tool output, newly retrieved chunks, conditional rule injections, and compaction summaries generated inside the session. The orchestrator’s job is to keep static bytes stable for caching while treating dynamic bytes as lease-held: they enter with metadata, earn their place by relevance to the current intent, and leave when superseded or resolved.

ClassExamplesTypical load triggerRisk if mishandled
StaticAGENTS.md, MCP tool schemas, output contractssession start, cache-friendly prefixstale repo policy if not refreshed after merge
Semi-staticissue body, branch name, feature flagstask startwrong issue context carried across tasks
Dynamiccommand output, file reads, retrieval hitsper turn or on tool eventbloat, stale evidence, cache breakage
Derivedcompaction summaries, handoff notesafter compaction or /handoffdropped load-bearing decisions

Pause and predict: You are on turn 15 of a refactor where the agent no longer needs the full kubectl describe output from turn 4 because the pod is healthy now. Should that output stay in the window for cache stability, or should the harness summarize and evict it? Write down your choice before reading the compaction section, because the answer depends on whether the bytes are still decision-critical or only historical noise; keeping resolved logs “for stability” often destroys stability by pushing the cache breakpoint or evicting still-needed task-frame bytes under pressure.

Just-in-case loading front-loads context because it might become useful: it feels safe and reduces mid-task retrieval latency, but it spends budget early and pushes variable bytes into prefix positions that can break provider caches. Just-in-time loading waits until a specific decision needs a fact, then fetches or reads narrowly, which pairs well with staged context from module 2.1 but requires reliable intent detection and a retrieval budget per turn. The failure mode of pure just-in-case is prefix bloat and contradictory evidence; the failure mode of pure just-in-time is latency spikes and tool loops when the model does not know a corpus exists. Production coding agents therefore use hybrid staging with explicit logging of which branch fired.

flowchart LR
A[Turn starts] --> B{Intent known?}
B -- no --> C[Minimal prefix + task frame]
B -- yes --> D{Need external fact?}
D -- no --> E[Skip retrieval]
D -- yes --> F[JIT fetch with budget cap]
C --> G[Ask clarifying question or tool]
E --> H[Assemble turn]
F --> H
G --> H

A practical default for production coding agents is hybrid staging: load the repo map and task frame just-in-case because almost every turn needs them, but load file bodies, logs, and vector snippets just-in-time with a per-turn cap. Log the reason code for each injection (intent:debug, tool:read_file, retrieve:policy) so reviewers can reconstruct the policy later during incident review. When two mechanisms could supply the same fact (repo file versus vector chunk versus tool output), the orchestrator should pick the freshest authoritative source and skip the others, logging the skip reason instead of silently stacking duplicates.

Intent triggers are guardrails, not magic: they map observable signals to context actions such as file-path globs that inject security rules, task labels that attach evaluation rubrics, and failure classes that allow larger log excerpts. Triggers should be versioned configuration checked into the repo, not ad-hoc prompt paragraphs, so changes receive code review and tests. A trigger that fires on **/deploy/** but ignores environment-specific overrides is a common source of “the agent knew the runbook but not the cluster policy” bugs.

orchestration_triggers:
- when:
paths_match: "**/deploy/**"
inject:
- docs/runbooks/deployment-checklist.md
budget_tokens: 1200
freshness: require_repo_head
- when:
intent: debug
allow:
tool_output_max_tokens: 3500
retrieve: true
evict:
resolved_errors: true
- when:
intent: handoff
action:
compact_transcript: aggressive
write_session_note: docs/session-state/

Triggers should be idempotent and logged, and if two triggers fire on the same turn the orchestrator needs deterministic precedence (for example, safety rules before convenience snippets) rather than whichever retrieval ranker spoke loudest. Precedence tables belong in configuration alongside budgets, because “both fired” is normal during refactors that touch deploy paths and tests simultaneously. Without precedence, you get oscillation: turn 19 loads a security rubric, turn 20 loads a performance tuning snippet, and turn 21 contradicts both because the model attended to whichever block appeared last.

Lazy retrieval means the model does not receive corpus excerpts until the harness decides they are worth their token cost, and it should be paired with tool gating so the model cannot bypass the budget by spamming search tools. A simple gate is a per-turn retrieval allowance enforced before any snippet bytes are appended to the prompt. RECOMP-style compression (Retrieve, Compress, Prepend) is the research analogue: compress multiple retrieved documents into a short summary before prepending, and emit an empty summary when retrieval is irrelevant so the model is not forced to attend to noise. Your harness can implement a lighter version without training a compressor—dedupe by source hash, cap tokens, and require a one-line “why retrieved” justification in the ledger—but the economic intuition is the same: retrieval is not free just because the vector database returned a hit.

# Illustrative policy fragment — not a production harness
MAX_RETRIEVAL_TOKENS_PER_TURN = 1800
def allow_retrieval(state, query, estimated_tokens):
if state.retrieval_tokens_this_turn + estimated_tokens > MAX_RETRIEVAL_TOKENS_PER_TURN:
state.log("retrieval_skipped", reason="turn_budget")
return False
if state.has_fresh_tool_answer(query):
state.log("retrieval_skipped", reason="fresh_tool_cache")
return False
return True

The gate turns retrieval from a model impulse into a harness decision, which is the core of dynamic orchestration. MCP tool definitions belong in the stable prefix when possible, but tool results are dynamic evidence; the MCP tools specification describes structured tool results so harnesses can validate and redact before injection rather than pasting raw JSON. That validation step is part of gating: a tool that returns ten thousand tokens of logs should not automatically become ten thousand tokens of model context.

Active learning prompt: Open a recent agent trace from your environment and, for three injected blocks, answer whether each is static or dynamic, just-in-time or just-in-case, and what eviction rule should have removed it. If you cannot answer from the trace, list the telemetry fields you would add (block kind, source hash, freshness timestamp, load-bearing flag, cache hit class). Traces that only show “messages[]” without injection metadata are Level 0 systems; your goal in this module is Level 3 or better.

Token budgets are not only model limits—they are cost, latency, and cache contracts. Module 2.1 introduced prefix caching and effective attention budgets; this section adds turn-level economics covering what you pay when the cache hits, what you pay when it misses, and how Anthropic’s five-minute ephemeral TTL changes pause behavior. Google’s Gemini long-context documentation emphasizes that very large windows still reward selective placement of critical facts; orchestration remains necessary because “fits in the window” is not the same as “reliably used by the model.”

Treat each turn as debits against a working budget where every block has an owner and a renewal policy, not as a single “remaining tokens” gauge on the API client.

BucketWhat consumes itOrchestration knob
Stable prefixinstructions, schemas, mapskeep byte-stable across turns
Task frameissue, acceptance criteriarefresh only when task changes
Evidencetool output, file readscap size, summarize when resolved
Retrievalvector snippetsrank + dedupe + TTL
Output reservemodel completionnever steal from input silently

A turn that spends 90% of the budget on historical logs may technically fit the window while still failing the task because the acceptance criterion no longer fits in the effective attention zone described in Liu et al.’s lost-in-the-middle findings. Always reserve completion tokens explicitly and treat overrun as a harness bug, not as a model character flaw. When budgets tighten, cut in this order unless a load-bearing registry says otherwise: superseded retrieval, resolved tool logs, optional examples, narrative repetition, and only then semi-static session material.

turn_budget:
model_limit_tokens: 200000
target_input_tokens: 52000
output_reserve_tokens: 6000
allocations:
stable_prefix: 14000
task_frame: 2500
evidence: 18000
retrieval: 3500
scratch_summaries: 4000
headroom: 10000

Headroom is not waste: it absorbs unexpected tool output and prevents emergency compaction from deleting the wrong block under pressure when a single kubectl or test command dumps a larger-than-expected payload. Teams that run at 98% utilization every turn are optimizing for a demo, not for a week-long refactor where one noisy command should not collapse the session.

Anthropic Five-Minute TTL As A Control Variable

Section titled “Anthropic Five-Minute TTL As A Control Variable”

Anthropic documents ephemeral prompt caching with a default five-minute lifetime, refreshed on cache use, with optional longer TTL at higher cost. OpenAI documents automatic prefix caching with in-memory retention often on the order of five to ten minutes of inactivity for many models, with extended retention on newer model families. These numbers are not trivia—they are scheduling constraints for agent sessions where human review loops routinely exceed five minutes. Anthropic’s engineering guidance for Claude Code explicitly treats context window fill as the primary resource to manage, which aligns with treating TTL as part of session design rather than as vendor trivia.

Timeline (Anthropic ephemeral cache)
|-- write cache entry (turn 1) --|
|.......... 5 min TTL ..........|
| refresh on hit |
|.......... 5 min TTL ..........|
| expire -> full prefix reprocess (cache miss) |

Worked example: Suppose a stable prefix costs 18,000 tokens to process uncached and 1,800 tokens on a cache read at a 0.1× multiplier (per Anthropic’s published caching price table). A cache miss on turn 20 costs roughly the difference, so three accidental misses in an hour can exceed the cost of a careful human-written summary policy. If your harness pauses for eight minutes while the human reviews a diff, the cache may expire and the next turn pays the miss unless you intentionally keep the session warm with low-cost heartbeat turns (which has its own ethics and cost profile) or you restructure the prefix so reprocessing is cheap enough to tolerate. Heartbeat turns are not free ethics-wise: they consume model capacity and can create the illusion of progress while the human is away, so document them as an explicit policy with rate limits.

Compare two orchestration choices during a coffee break where the human is away for eight minutes and the stable prefix is large enough that cache misses are material:

StrategyWhat happens after 8-minute pauseTradeoff
Do nothingcache likely expired; next turn reprocesses prefixsimple, predictable cost spike
Lightweight pingmay refresh TTL if provider counts the hitspends tokens; may annoy rate limits
Split stable prefix externallyreload smaller compiled mapengineering work; smaller miss penalty

Neither strategy is universally correct: the right choice depends on how often pauses exceed TTL and how large the stable prefix is. If pauses are long and prefixes are huge, handoff-first orchestration usually beats heartbeat-first orchestration because it resets dynamic tail noise while preserving promoted decisions in a semi-static artifact. If pauses are short and prefixes are modest, accepting occasional misses may be cheaper than engineering elaborate ping machinery.

Pause and predict: Your stable prefix is 22,000 tokens and your median inter-turn gap is six minutes during code review. Do you expect cache hits on most turns, or frequent misses? What orchestration change reduces miss cost without stuffing timestamps into the prefix? A correct answer usually involves moving clocks and request IDs into external logs, splitting tool schemas into a versioned attachment loaded only when tools change, and promoting acceptance criteria into a compact task frame that survives compaction.

Not every expensive turn is a “cache miss” in the provider sense, so classify misses so telemetry stays actionable instead of lumping all costly turns into one bucket.

Miss typeSymptomTypical orchestration fix
Prefix driftcache_read_input_tokens drops to 0 after harmless-looking editremove per-turn timestamps from stable prefix
Below minimum lengthno cache fields despite cache_controlincrease stable prefix or accept no cache
TTL expirymiss after idle gapshorten pauses, shrink prefix, or tolerate miss
Breakpoint too lategrowing chat pushes breakpoint past 20-block lookbackadd explicit breakpoint on semi-static boundary
Tool schema churntools changed between turnsversion tool definitions separately

Log provider usage fields every turn: for Anthropic, inspect cache_creation_input_tokens and cache_read_input_tokens; for OpenAI, inspect usage.prompt_tokens_details.cached_tokens. Without those counters, teams optimize prose instead of economics and will ship “shorter prompts” that still miss caches because a dynamic header moved by one byte. Pair provider counters with harness ledger hashes of the stable prefix so you can tell drift apart from TTL expiry in one glance.

Hypothetical scenario: A long-running agent session compacts aggressively every 30 turns, which shrinks the transcript but leaves a 25,000-token stable prefix intact; the human pauses for lunch, and after lunch the cache is cold. Re-priming requires re-sending the prefix plus re-loading two retrieved policy snippets the orchestrator thought were still “fresh enough” in memory. Sometimes the cheapest operational move is to start a new session with a structured handoff note rather than resurrecting the bloated internal state machine, and that is not failure—it is orchestration choosing a clean working set over nostalgic attachment to chat history. OpenAI’s harness engineering writing describes multi-session workflows where durable state lives outside the chat transcript; dynamic orchestration generalizes that pattern to any long-horizon agent product.

Compaction is lossy compression with obligations: you are allowed to drop bytes only when you can show either that the information is no longer decision-critical or that its durable form already lives in a better surface (repo doc, memory store, handoff file). Treat compaction as a scheduled batch job tied to turn count, budget pressure, or explicit /compact commands—not as an emergency-only panic button—because emergency compaction under pressure is when teams delete acceptance criteria. RECOMP research shows that compressing retrieved evidence into a short faithful summary before prepending can preserve task quality at a fraction of token cost; session compaction applies the same idea to tool logs and chat evidence inside a long agent run.

Content typeDefault action when resolvedMigrate to
Verbose tool logssummarize to causal chainscratch summary in session
Retrieved snippetsevict when supersededlink + hash in ledger
Open questionskeep until answeredtask frame
Accepted decisionspromote summaryhandoff note + issue comment
Rejected optionskeep short veto linesession summary
Durable policy discovered mid-taskpromoterepo doc via human PR

Compaction should never delete the only copy of a load-bearing constraint: if the acceptance criterion existed only in turn 2 prose, compaction must lift it into the task frame or an explicit open_decisions block before the original text disappears. A promotion checklist before compaction runs prevents the most common regression: “the agent forgot it must not commit generated artifacts” after a summarize pass that sounded fluent but dropped negations. Run promotion first, compact second, and log both steps in the ledger so reviewers can see causality.

Before compaction (turn 19)
+------------------------------------------------+
| stable prefix |
| task frame (issue + AC) |
| tool log A (resolved) |
| tool log B (resolved) |
| retrieval chunk X (superseded) |
| fresh tool log C (active) |
| user ask |
+------------------------------------------------+
After compaction (turn 20)
+------------------------------------------------+
| stable prefix |
| task frame (issue + AC + promoted decisions) |
| summary: logs A+B merged into 12 lines |
| retrieval pointer: X archived in ledger |
| fresh tool log C (active) |
| user ask |
+------------------------------------------------+

Load-bearing decisions are constraints that change tool authorization, file edit scope, or merge requirements, such as “do not touch generated artifacts,” “must run .venv/bin/python scripts/test_pipeline.py,” and “split PR if diff exceeds 200 LOC.” Store them in a machine-visible list, not buried inside narrative summary prose, because summaries are optimized for fluency while registries are optimized for enforcement. The orchestrator should refuse to compact away any registry item unless it is promoted to the task frame or written to a handoff artifact with a backlink, mirroring how production policy engines refuse to delete rules without an explicit deprecation event.

load_bearing_decisions:
- id: ac-3
text: "Do not commit .pipeline/state.yaml"
source_turn: 2
expires: task_end
- id: review-1
text: "Cross-family review required before merge"
source_turn: 11
expires: task_end

The orchestrator refuses to compact away any item in that list unless it is promoted to the task frame or written to a handoff artifact with a backlink.

Handoffs are how dynamic orchestration survives session boundaries without dumping the entire transcript into the next prompt. A good handoff is HTML or markdown with stable sections: goal, current state, decisions, blockers, next actions, and links to evidence. KubeDojo’s own session workflow uses docs/session-state/ plus a STATUS.md index, and that pattern is intentional: the index stays small while the narrative lives in a dedicated artifact, which preserves cache-friendly prefixes in later sessions. The orchestration lesson generalizes to any product where session B should cold-start from pointers, not from replaying session A’s entire tool output history.

Session A ends
|
v
/handoff writer -> docs/session-state/2026-05-25-topic.html
|
v
STATUS.md index updated (pointers only)
|
v
Session B starts
|
v
cold-start API -> briefing/orient -> load handoff pointer
|
v
JIT repo reads only for files referenced in handoff

Dynamic orchestration for session B should treat the handoff as semi-static context for the first turns, then return to just-in-time expansion for file bodies and retrieval. Do not paste the handoff plus the entire previous chat log unless you are performing forensic review, because that duplicates decisions and breaks cache locality while giving the illusion of “more context.” Claude Code’s documented workflow explicitly recommends /clear between unrelated tasks and structured handoffs for larger features; your harness should encode the same separation between exploratory research sessions and implementation sessions.

Summaries fail in predictable ways: they smooth away negations, drop version numbers, or merge incompatible decisions. Add a quality gate before accepting a compaction summary, and keep raw evidence one more turn when the gate fails even if token pressure is high.

CheckQuestion
CoverageDoes every load_bearing_decisions entry appear?
FreshnessAre timestamps and versions still present where needed?
ProvenanceCan a reviewer open the source turn or artifact?
ConflictDid we merge incompatible instructions?

If the gate fails, keep the raw evidence block one more turn and tighten the summarizer prompt, because spending extra tokens for one turn is cheaper than shipping the wrong patch. Gates can be automated cheaply: require every load-bearing registry ID to appear verbatim in the summary, require version strings to match a regex, and require explicit “rejected option” lines when the session debated alternatives.

Dynamic Prompt Assembly And Policy Injection

Section titled “Dynamic Prompt Assembly And Policy Injection”

Dynamic prompt assembly is the compiler pass that turns policy into bytes: static repo files supply defaults, and the orchestrator selects which rules, skills, and schemas enter this turn. Treat assembly like a linker: unresolved symbols (missing skills, stale tool schemas) should fail closed or fall back to a known-safe minimal prefix, not silently link random documents because retrieval ranked them highly.

Think in layers, not one giant string, because monolithic system prompts defeat caching, review, and team ownership boundaries.

LayerOwnerChanges whenCache impact
Platformvendor / harnessrarehighest stability
Organizationcompany policyweeklyhigh
RepositoryAGENTS.md, rulesper mergemedium
Sessionhandoff, preferencesper sessionmedium-low
Turnuser ask, tool resultsevery turndynamic tail

Assembly order should follow provider cache hierarchy: tools, system, then messages (Anthropic documents this ordering). Put stable layers first and append volatile layers last so cache breakpoints align with semi-static boundaries rather than with the latest user sentence. When tool lists change between turns, version them explicitly; MCP servers can emit tools/list_changed notifications, and harnesses that hot-swap schemas without adjusting breakpoints are a common source of silent cache invalidation.

flowchart TB
subgraph prefix["Cache-friendly prefix"]
P[Platform safety + contracts]
O[Org policy excerpt]
R[Repo AGENTS + scoped rules]
end
subgraph mid["Semi-stable"]
S[Session handoff pointer]
T[Task frame]
end
subgraph tail["Dynamic tail"]
E[Evidence + retrieval]
U[User ask]
end
P --> O --> R --> S --> T --> E --> U

Scoped rules are policies, not prose decorations: module 2.2 showed repository surfaces, and module 2.4 shows the runtime selector that decides which surfaces compile into today’s turn. Selectors should be conservative—inject the smallest rule set that covers the edited paths—because over-injection trains the model to ignore rules as noise.

rule_injection:
- match:
globs: ["src/content/docs/**"]
rules: [".claude/rules/new-content-checklist.md"]
- match:
task_class: review
rules: ["docs/quality-rubric.md"]
- match:
task_class: security
rules: ["docs/security/agent-threat-model.md"]

The selector must log {rule_id, matched_glob, injected_tokens} because without logs debugging context bloat becomes guesswork during incidents. Anthropic’s Claude Code guidance recommends keeping CLAUDE.md concise and moving occasional workflows into skills loaded on demand; orchestration should mirror that split so the always-on prefix stays cache-stable while procedural depth loads only when triggers fire.

Skills are procedural context, and loading every skill at session start is just-in-case overkill. Load skills when triggers match, unload skill bodies from the prefix when the task class changes, and keep a compact index in the stable prefix so the model knows what can be loaded without paying full skill token cost up front.

ApproachWhen to useFailure mode
Eager skill loadtiny skill libraryprefix bloat
Lazy skill loadlarge skill treemodel unaware skill exists
Triggered loadclear task taxonomymisclassified intent

A workable pattern is an index block in the prefix listing available skills with one-line descriptions while full skill bodies load on trigger, which preserves discoverability without paying thousands of tokens up front. Sub-agents described in Claude Code best practices are another form of conditional loading: they receive a narrow bundle instead of the parent transcript, which is the same orchestration boundary expressed for human-driven sessions.

Rules written as vague prose (“be careful with secrets”) are not machine-enforceable policy, while rules written as policy (“never print values matching AKIA*; use <TOKEN> placeholders”) support linting, tests, and orchestration before bytes reach the model.

StyleExampleOrchestrator can
Prose”Handle customer data responsibly”hope
Policy”Redact emails in tool logs before model injection”regex + block
Policy + testsame, with CI fixturefail closed

Convert recurring prose rules into policy tables the harness enforces before bytes reach the model, because the model then receives already-sanitized context which is cheaper than arguing with it after the fact. Policy tables also make cross-family review possible: reviewers can diff orchestration config without reading ten thousand tokens of chat.

Eviction, Freshness, And Multi-Agent Boundaries

Section titled “Eviction, Freshness, And Multi-Agent Boundaries”

Eviction is how orchestration reclaims budget without waiting for catastrophic window overflow, freshness is how orchestration decides whether to trust a remembered fact, and multi-agent boundaries are how orchestration prevents child tasks from polluting parent state. StreamingLLM research on attention sinks shows that retaining a small set of initial tokens can stabilize very long runs when using sliding windows; the lesson for harness design is not to copy KV caches literally, but to recognize that some early session anchors (task frame, load-bearing registry) should survive aggressive eviction of middle evidence that models otherwise under-attend.

Long agent sessions resemble streaming inference: middle turns pile up, attention rots, and naive sliding windows drop critical early constraints. Orchestration compensates by promoting early constraints into a durable task frame and registry, analogous to keeping sink tokens while evicting middle tool logs. Infini-attention style research (complementary to lost-in-the-middle) explores architectures that retain long-range state; until your provider exposes that transparently, harness policy is the retention layer you control today.

Retrieved snippets should carry metadata at injection time so eviction policies can reason about staleness, supersession, and budget pressure without re-parsing prose.

snippet_record:
id: ret-9f2a
source: vector://runbooks/deploy.md#restart
injected_turn: 14
tokens: 420
freshness: 2026-05-20
relevance_score: 0.82

Eviction candidates are evaluated each turn against the policies in the table below, and the orchestrator should log which policy fired when multiple candidates compete for the same bytes.

PolicyEvict whenKeep when
Stalenessfreshness older than task SLAstill matches live tool verification
Supersedednewer snippet same topicnewer snippet lower quality
Low saliencerelevance below floor for 3 turnslinked in load_bearing_decisions
Budget pressureover allocationpromotes to task frame this turn

Under budget pressure, evict in this order: superseded retrieval, resolved tool logs, old scratch summaries, optional examples, and only then touch semi-static session material; evict stable prefix only as a last resort and expect a cache miss tax when you do. Eviction without ledger entries is invisible in postmortems, so log {block_id, policy, tokens_freed} every time.

Memory is not truth—it is a cached claim with an owner. Require source, scope, captured_at, and verification_method on memory writes, and at read time orchestration should ask whether scope is still valid (user, repo, tenant), whether a fresher tool or repo source exists, and whether a deletion event invalidated the memory. If a fresher source exists, prefer re-verify over trusting memory, because module 2.3’s cross-user leakage scenario is what happens when this check is skipped. Memory should enter the prompt as a cited claim with freshness metadata, not as omniscient narrative authority.

SignalAction
Live tool contradicts memorydrop memory for this turn, log conflict
Repo file changed since memoryJIT re-read targeted file
Memory older than SLAretrieve or tool-verify
Memory matches tool + repoallow with citation

Orchestration should surface conflicts to the model as structured deltas, not silent overwrites, because silent overwrite teaches the harness to lie confidently while looking efficient on token graphs. A structured delta might be: “memory says deployment freeze active; tool deploy_status reports rollout completed 10:05Z; using tool, archiving memory with conflict flag.”

Multi-step tasks invite sub-agents, but without boundaries children inherit parent bloat and return unmergeable essays. Child bundles should include a narrow task frame, file allowlist, token ceiling, and explicit output schema, while excluding parent chat logs and parent retrieval hits unless converted into short evidence cards with provenance.

Parent session
|
+-- spawn child with bundle:
| task_frame (narrow)
| file allowlist
| token ceiling
| no parent chat log
|
+-- child returns:
patch summary
test results
open questions
|
v
Parent merges child contract into evidence bucket

Spawn fresh when the subtask is independently reviewable or needs a clean cache prefix, and continue in-process when the subtask is a single tool call’s worth of work. The parent merge step should validate child output against schema before appending to evidence, rejecting essays that ignore the contract.

Use fresh childContinue parent
parallel file refactorsone-file typo fix
cross-family reviewformatting pass
long research branchreread single constant

Child prompts should not include parent retrieval hits unless converted into a short evidence card with provenance, otherwise you duplicate chunks under different message IDs and confuse eviction logic. Treat child sessions like microservices: contracts, timeouts, and idempotent merges, not like threads that share all memory by default.

If you cannot explain why a byte was present, you cannot operate dynamic orchestration in production, because observability turns context from a mystery meat prompt into an auditable compile artifact. The minimum viable observability stack is: per-turn ledger JSON, provider cache counters, and a diff of stable-prefix hash between turns. Anything less leaves you tuning prompts during incidents.

Append a per-turn ledger alongside the model call so on-call engineers can answer “why was this in context?” without downloading full prompts containing customer data.

{
"turn": 18,
"intent": "debug",
"budget": {"target_input": 52000, "actual_input": 49812, "output_reserve": 6000},
"blocks": [
{"kind": "stable_prefix", "tokens": 13840, "cache": "hit"},
{"kind": "task_frame", "tokens": 2100, "cache": "n/a"},
{"kind": "tool_output", "id": "kubectl_describe_pod", "tokens": 6200, "fresh": true},
{"kind": "retrieval", "id": "ret-9f2a", "tokens": 420, "evicted_next_turn": false}
],
"decisions": ["skipped_retrieval: fresh_tool_cache"]
}

The ledger answers “why is this in my context?” without reading the entire prompt, and redaction classes let you store hashes and source URIs in centralized logs while keeping raw text in the customer environment only. Pair ledgers with trace IDs shared across sub-agents so parent merges can reference child ledger slices.

Track these series per workflow and review them weekly, not only during incidents, because slow drift in retrieval tokens per turn is easier to fix before it becomes a mandatory compaction spiral.

MetricFormula / sourceHealthy signal
Cache hit ratecache_read / (cache_read + cache_create)stable on repeated prefix
Miss after idlemisses where idle_gap_sec > TTLnear zero if sessions continuous
Retrieval tokens / turnsum retrieval bucketflat or falling with JIT
Eviction countevicted blocks per turnrises under pressure, not always zero
Compaction savingstokens before - afterpositive when logs verbose

Alert on prefix drift: sudden drop in cache hits with unchanged task shape, which often means someone injected a dynamic header above the stable prefix. Dashboards should segment by task class (debug, review, implement) because optimal retrieval budgets differ: debug may tolerate large logs briefly, while review should cap logs and emphasize rubric injection.

Bloat hunts follow a consistent order. Sort ledger blocks by tokens descending. Flag blocks without load_bearing linkage or active tool dependency. Check for duplicate retrieval on the same source. Check for tool logs older than the last successful command. Check for skills or rules loaded but not referenced in the last three turns.

Hypothetical scenario: Turn 25 is slow and expensive. The ledger shows 19,000 tokens of tool output labeled fresh: true but the commands succeeded ten turns ago. The fix is not a better model but an orchestration freshness bug that never flipped resolved_errors: true. Add a unit test that simulates a resolved failure and asserts the freshness flag clears on the next turn.

The patterns below are production defaults that survived multi-week agent sessions, while the anti-patterns are shortcuts that look fine in demos and fail under week-long sessions with real tool output and human pauses.

PatternWhen to useWhy it worksScaling note
Turn compiler with ledgerany production harnessmakes policy explicit and measurablestore ledgers in object storage with retention
Hybrid JIT/JIC stagingcoding agentsbalances latency and budgettune per task class
Load-bearing decision registrylong sessionsprevents compaction amnesiasync to issue tracker on handoff
TTL-aware session pacingcost-sensitive teamsaligns human pauses with cache economicsdocument ethical ping policy
Narrow child bundlesparallel subtaskscontrols fan-out bloatcap concurrent children
Anti-patternWhy teams pick itWhat breaksBetter move
Infinite chat historysimplest transportcache miss + attention rotcompaction + handoff
Timestamped stable prefixobservability habitcache never hitslog time outside prefix
Retrieval as default fillerfeels safer than empty contextnoise drowns task framelazy retrieval with budget
Parent log inheritance for childreneasier spawn codeunreviewable child promptschild task contract only
Memory without verificationspeedstale or cross-tenant factsre-verify against tool/repo
Compaction without promotiontoken panicloses acceptance criteriapromote load-bearing items first

Use this flow when designing or reviewing orchestration policy, and treat each diamond as a configuration knob you can test in simulation before shipping to users.

flowchart TD
A[New turn] --> B{Over budget?}
B -- yes --> C[Evict superseded retrieval + resolved logs]
B -- no --> D{Need new external fact?}
C --> D
D -- yes --> E{Fresh tool answer exists?}
D -- no --> F[Assemble prefix + task + ask]
E -- yes --> F
E -- no --> G[JIT retrieve within cap]
G --> F
F --> H{Idle gap > cache TTL?}
H -- yes --> I[Choose: handoff new session or accept miss]
H -- no --> J[Call model + log ledger]
I --> J
QuestionIf yesIf no
Will this block be needed on the next turn?keep in evidencesummarize or evict
Is it durable beyond the task?migrate to repo/memorykeep session-local
Does it change tool/file permissions?promote to load-bearing listtreat as narrative
Is it already in stable prefix?do not duplicate in retrievalinject or refresh
Will a child need parent chat?export evidence cardspawn narrow bundle
  1. Anthropic’s prompt caching documentation states that the default ephemeral cache has a five-minute lifetime, refreshed when cached content is reused, with optional longer TTL at additional cost. Source: Anthropic Prompt Caching.

  2. OpenAI’s prompt caching guide notes that cache hits require exact prefix matches, recommends static content before variable user content, and reports that caching can reduce latency by up to 80% and input token costs by up to 90% for eligible workloads. Source: OpenAI Prompt Caching.

  3. Anthropic documents a 20-block lookback window when matching cache breakpoints in growing conversations — if your breakpoint drifts too far, earlier cache writes fall out of range and you pay fresh processing. Source: Anthropic Prompt Caching — Structuring your prompt.

  4. Liu et al.’s “Lost in the Middle” work shows that models often under-use information placed in the middle of long contexts, which is why orchestration should keep load-bearing constraints in the task frame ends, not buried inside verbose tool logs; RECOMP adds that retrieval summaries can preserve quality at a fraction of token cost when compression is harness-owned. Sources: arXiv:2307.03172, arXiv:2310.04408.

Teams new to orchestration often copy chat UI behavior into backend harnesses, which guarantees cache miss and attention rot at scale. The table lists frequent failures; the paragraphs after it explain how to institutionalize fixes so they survive the next hire.

MistakeWhy It HappensHow to Fix It
Treating chat history as the orchestration policydefault UI behaviorimplement turn compiler + ledger
Appending dynamic headers above stable prefixdebugging conveniencelog timestamps outside prefix
Never evicting retrieved snippetsfear of missing contextstaleness + superseded rules
Compacting without promoting acceptance criteriatoken panicload_bearing_decisions registry
Spawning sub-agents with full parent transcriptsquick copy-pastechild task contract + allowlist
Ignoring cache TTL during human review pausesfocus on code not economicshandoff or accept miss explicitly
Trusting memory without re-verifymemory feels authoritativetool/repo freshness checks
No telemetry on injected blocksprivacy or efforttoken attribution ledger per turn

Treating chat history as policy is the most expensive mistake because it hides inside “the model forgot.” Replace implicit history with an explicit turn compiler, ledger, and promotion registry. Add CI fixtures that assert eviction runs after resolved failures.

Timestamped stable prefixes feel like observability wins but destroy economics. Log time outside the prefix and correlate with trace IDs instead.

Compaction without promotion is a merge-risk event. Treat missing load-bearing lines in summaries as a build failure, not as acceptable lossy compression.

Parent log inheritance for children turns parallel speedups into unmergeable noise. Use child contracts and schema-validated outputs.

Ignoring TTL during human review should trigger a conscious choice: handoff, accept miss, or shrink prefix. Do not treat the first turn after lunch as a surprise bill.

Your agent’s cache hit rate collapses on turn 30 even though the repository rules did not change, because the only code change appended an ISO timestamp to the system message each turn. What should you change first?

Answer

Move per-turn timestamps out of the stable prefix into the dynamic tail or external logs.

Anthropic and OpenAI both emphasize exact prefix matching for cache hits.

A timestamp in the system block changes the prefix hash every turn, which forces cache creation or uncached processing.

Keep observability without mutating cache-stable bytes.

Hypothetical scenario: Turn 40 still includes a 6,000-token stack trace from a fixed test failure, and the agent keeps citing the old error even though the latest test run passed. Which orchestration rule failed?

Answer

Resolved-error eviction failed.

Tool output from fixed failures should be summarized into a short “was failing, now passing” note or removed.

The model is attending to stale evidence because the harness never marked the log as superseded.

Freshness metadata and resolved_errors: true triggers prevent this.

A team pauses sessions for code review meetings that last 25 minutes while Anthropic ephemeral cache TTL is five minutes. What are two legitimate orchestration responses?

Answer

First, start a new session after review with a structured handoff that promotes load-bearing decisions without replaying the entire transcript.

Second, accept cache miss cost but shrink the stable prefix so reprocessing is cheaper.

Optional heartbeat pings may refresh TTL but should be an explicit policy with cost and rate-limit review, not an accidental loop.

You spawn three sub-agents to refactor separate packages, each child returns a 4,000-token essay, and the parent session exceeds budget on merge. What boundary change helps most?

Answer

Narrow the child output contract to patch summary, test commands run, and open questions — not narrative essays.

Children should not inherit parent retrieval chunks; they should receive allowlisted paths and a token ceiling.

The parent merges structured evidence cards, which eviction logic can rank and drop safely.

Retrieval keeps injecting the same deployment runbook chunk every turn even though the agent already read the live deployment status via a tool. How should orchestration gate retrieval?

Answer

Skip retrieval when a fresh tool answer covers the same intent, logging retrieval_skipped: fresh_tool_cache.

This is lazy retrieval plus tool gating.

It saves budget and reduces contradictory evidence.

Re-open retrieval only if the tool answer is stale or contradictory.

After compaction, the agent forgets it must not commit .pipeline/state.yaml even though the rule was only mentioned in turn 3 conversation prose. What promotion step was skipped?

Answer

The harness compacted without promoting a load-bearing decision into the task frame or registry.

Compaction must lift constraints that affect permissions or merge requirements into durable session structures before deleting raw prose.

Hypothetical scenario: Turn 22 is slow, logs show 0 cache read tokens and 24,000 cache creation tokens, and idle gap was only 2 minutes. What else should you inspect besides TTL expiry?

Answer

Inspect prefix drift, breakpoint placement, and minimum cacheable length.

TTL is not the only miss cause.

A changed tool schema, modified system block, or breakpoint beyond the 20-block lookback can produce expensive turns even with short idle gaps.

Use the ledger to see which block changed first.

Your organization wants observability without exposing customer content in logs. Which ledger fields balance debuggability and privacy?

Answer

Log block kind, token counts, cache hit/miss, rule IDs, retrieval source hashes, and redaction class — not raw customer text.

Attach provenance pointers so reviewers can open authorized sources in a secure environment.

This answers “why was this in context?” without copying PII into telemetry.

Hands-On Exercise: Build A Context Budgeter

Section titled “Hands-On Exercise: Build A Context Budgeter”

You will implement a small turn simulator that applies orchestration policy against a fixed token budget, logs cache hits and misses using a five-minute TTL, and demonstrates improved cache hit rate after eviction-by-staleness. Use .venv/bin/python from the repository root (never bare python3) so results match the repository virtual environment used in CI gates.

Create a working directory and save the harness below as context_budgeter.py, then run all phases from that directory so SQLite paths stay relative and reproducible.

#!/usr/bin/env python3
"""Context budgeter lab — simulates per-turn orchestration policy."""
from __future__ import annotations
import json
import sqlite3
import time
from dataclasses import dataclass, field
from pathlib import Path
DB_PATH = Path("context_budgeter.sqlite")
TURN_BUDGET = 8000
STABLE_PREFIX_TOKENS = 3200
CACHE_TTL_SEC = 300 # Anthropic ephemeral default: 5 minutes
RETRIEVAL_COST = 900
TOOL_LOG_COST = 1100
SUMMARY_COST = 350
@dataclass
class Snippet:
snippet_id: str
topic: str
tokens: int
captured_at: float
stale_after_sec: int = 120
@dataclass
class SessionState:
turn: int = 0
tokens_used: int = 0
cache_written_at: float | None = None
cache_hit: bool = False
snippets: list[Snippet] = field(default_factory=list)
tool_log_tokens: int = 0
summaries: list[str] = field(default_factory=list)
def init_db() -> sqlite3.Connection:
conn = sqlite3.connect(DB_PATH)
conn.execute(
"""
CREATE TABLE IF NOT EXISTS turn_log (
turn INTEGER,
action TEXT,
tokens INTEGER,
cache_hit INTEGER,
total_tokens INTEGER,
note TEXT
)
"""
)
conn.commit()
return conn
def cache_valid(state: SessionState, now: float) -> bool:
if state.cache_written_at is None:
return False
return (now - state.cache_written_at) <= CACHE_TTL_SEC
def evict_stale_snippets(state: SessionState, now: float) -> int:
kept: list[Snippet] = []
freed = 0
for snip in state.snippets:
age = now - snip.captured_at
if age > snip.stale_after_sec:
freed += snip.tokens
continue
kept.append(snip)
state.snippets = kept
return freed
def simulate_turn(
conn: sqlite3.Connection,
state: SessionState,
*,
now: float,
action: str,
inject_snippet: Snippet | None = None,
add_tool_log: bool = False,
compact: bool = False,
idle_gap_sec: float = 0,
) -> None:
state.turn += 1
state.tokens_used = 0
state.cache_hit = False
note_parts: list[str] = []
if idle_gap_sec:
now += idle_gap_sec
note_parts.append(f"idle_gap={idle_gap_sec}s")
if cache_valid(state, now):
state.tokens_used += int(STABLE_PREFIX_TOKENS * 0.1)
state.cache_hit = True
note_parts.append("cache_hit")
else:
state.tokens_used += STABLE_PREFIX_TOKENS
state.cache_written_at = now
note_parts.append("cache_miss")
freed = evict_stale_snippets(state, now)
if freed:
note_parts.append(f"evicted_stale={freed}")
if compact and state.tool_log_tokens:
state.tokens_used += SUMMARY_COST
state.tool_log_tokens = 0
state.summaries.append(f"summary@turn{state.turn}")
note_parts.append("compacted_tool_log")
if add_tool_log:
state.tool_log_tokens = TOOL_LOG_COST
state.tokens_used += TOOL_LOG_COST
note_parts.append("tool_log")
if inject_snippet:
state.snippets.append(inject_snippet)
state.tokens_used += inject_snippet.tokens
note_parts.append(f"inject:{inject_snippet.snippet_id}")
for snip in state.snippets:
state.tokens_used += snip.tokens
over_budget = state.tokens_used > TURN_BUDGET
if over_budget:
note_parts.append("OVER_BUDGET")
conn.execute(
"INSERT INTO turn_log (turn, action, tokens, cache_hit, total_tokens, note) "
"VALUES (?, ?, ?, ?, ?, ?)",
(
state.turn,
action,
state.tokens_used,
int(state.cache_hit),
state.tokens_used,
";".join(note_parts),
),
)
conn.commit()
def report(conn: sqlite3.Connection) -> None:
rows = conn.execute(
"SELECT turn, action, tokens, cache_hit, note FROM turn_log ORDER BY turn"
).fetchall()
hits = sum(1 for row in rows if row[3])
print("turn | action | tokens | cache_hit | note")
for turn, action, tokens, cache_hit, note in rows:
print(f"{turn:4} | {action:16} | {tokens:6} | {cache_hit:9} | {note}")
rate = hits / len(rows) if rows else 0.0
print(f"cache_hit_rate={rate:.2f} ({hits}/{len(rows)})")
def main() -> None:
if DB_PATH.exists():
DB_PATH.unlink()
conn = init_db()
state = SessionState()
t0 = time.time()
# Phase A — baseline without stale eviction discipline
simulate_turn(conn, state, now=t0, action="prime", add_tool_log=True)
simulate_turn(
conn,
state,
now=t0 + 30,
action="retrieve_old",
inject_snippet=Snippet("r1", "deploy", 900, captured_at=t0 - 200),
add_tool_log=True,
)
simulate_turn(conn, state, now=t0 + 60, action="followup", add_tool_log=True)
# Phase B — same shape but eviction-by-staleness enabled (implicit in harness)
simulate_turn(conn, state, now=t0 + 90, action="compact", compact=True)
simulate_turn(
conn,
state,
now=t0 + 120,
action="retrieve_fresh",
inject_snippet=Snippet("r2", "deploy", 900, captured_at=t0 + 120),
)
simulate_turn(conn, state, now=t0 + 150, action="steady", idle_gap_sec=0)
# Phase C — TTL miss after idle gap > 5 minutes
simulate_turn(conn, state, now=t0 + 180, action="pre_idle", idle_gap_sec=0)
simulate_turn(conn, state, now=t0 + 200, action="post_idle", idle_gap_sec=400)
report(conn)
print(json.dumps({"db": str(DB_PATH), "ttl_sec": CACHE_TTL_SEC}, indent=2))
if __name__ == "__main__":
main()
  • Run .venv/bin/python context_budgeter.py from your lab directory.
  • Capture the printed table and cache_hit_rate from phase A turns.
  • Identify which turns are cache misses and which notes explain them.
  • Record total tokens per turn in a scratch file.
  • Re-run the script after reading how evict_stale_snippets uses stale_after_sec.
  • Confirm turns where evicted_stale appears in the note column.
  • Compare token totals before and after eviction turns.
  • Write one sentence on how eviction prevented OVER_BUDGET if applicable.
  • Modify only the final turn’s idle_gap_sec to 120 and rerun.
  • Modify it to 400 again and rerun.
  • Tabulate cache hit rate for both idle gaps.
  • Relate results to Anthropic’s five-minute ephemeral TTL documentation.
  • Add a skip_retrieval_if_tool_log flag to simulate_turn and short-circuit injection when true.
  • Run a three-turn scenario where turn 2 sets the flag and retrieval would have duplicated tool knowledge.
  • Log retrieval_skipped in the note field.
  • Compare token spend against the duplicate retrieval run.

Part E: Deliverable — Three-Turn Improvement Trace

Section titled “Part E: Deliverable — Three-Turn Improvement Trace”
  • Produce a three-turn trace (turns you choose) where eviction-by-staleness yields a higher cache hit rate or lower tokens than a baseline without eviction.
  • Paste the turn_log rows and computed cache_hit_rate.
  • Add a five-line “policy README” describing JIT retrieval, eviction order, and TTL handling.
Solution sketch (policy README + sample interpretation)

Policy README (example):

  1. Keep STABLE_PREFIX_TOKENS byte-stable; never append per-turn clocks to the prefix.
  2. Evict retrieval snippets when now - captured_at > stale_after_sec before adding new snippets.
  3. Compact resolved tool logs into summaries once failures are fixed.
  4. Skip retrieval when a fresh tool log already answers the topic.
  5. After idle gaps greater than 300 seconds, expect cache misses and optionally start a handoff session.

In the provided script, phase B turns should show evicted_stale notes and improved token stability.

The final post_idle turn should miss cache because idle_gap_sec=400 exceeds CACHE_TTL_SEC=300.

If your three-turn improvement trace does not beat baseline, tighten stale_after_sec or lower RETRIEVAL_COST until eviction frees enough budget for steady cache hits on turns 2-4.

  • SQLite log contains at least six turns with action, tokens, cache_hit, and note fields.
  • You recorded cache hit rate for baseline and improved three-turn traces.
  • You demonstrated stale snippet eviction in notes (evicted_stale=...).
  • You explained one TTL miss using the five-minute reference.
  • You implemented or documented skip_retrieval_if_tool_log behavior.
  • Your policy README lists eviction order under budget pressure.

The Context arc ends here.

Continue to Harness Fundamentals — Layers and System of Record (Module 3.1 in the AI Engineering Foundations index), where prompt and context policies become durable gates, observability contracts, and team-wide harness mechanics instead of per-session improvisation.