Skip to content

Instantly share code, notes, and snippets.

@changkun
Last active April 17, 2026 09:57
Show Gist options
  • Select an option

  • Save changkun/492759c90323f26283ab2477254afbf5 to your computer and use it in GitHub Desktop.

Select an option

Save changkun/492759c90323f26283ab2477254afbf5 to your computer and use it in GitHub Desktop.

Agent harness design: trade-off analysis

An agent harness is the runtime that sits between a language model and the world — it manages the conversation loop, dispatches tool calls, enforces context and trust boundaries, persists session state, and shepherds work across failures. "Agent harness" is not a single pattern; it is a design space with several largely independent axes. Choices on one axis constrain others, and the sharpest failures in production come from the interactions rather than from any single dimension in isolation.

A harness also does not exist in isolation. It depends on — and is co-designed with — a broader ecosystem: identity and delegation systems, storage substrates, proxies and gateways, network policy enforcement, observability pipelines, evaluation infrastructure, model lifecycle tooling, supply-chain controls, policy engines and approval flows, cost and quota systems, long-term memory stores, and compliance mechanisms. A harness that wins on its internal axes but has no eval suite, no policy engine, and no cost ceiling is still a fragile production system.

This document is organized around a four-primitive framing introduced in Part 0: the ecosystem has converged on four separable primitives — sandbox fabric, persistence fabric, identity fabric, and agent harness — connected by a small number of contractual touch points. The 33 design axes that follow each own (or cross) one of these primitives. Part I (§§1–21) and Part II (§§22–33) are retained as a secondary organisation by risk timing — runtime-internal concerns versus ecosystem-integration concerns — but the primary organising concept is primitive ownership. Each section is tagged with its owning primitive and its relevant touch points. This document is intended as a design reference, not as justification for a specific implementation.


Part 0 — The four-primitive framing

The agent-platform ecosystem has converged — across E2B, Daytona, Modal, Namespace, Blaxel, Fly, Replit, Anthropic, OpenAI, LangGraph, Google ADK, and the dedicated identity vendors (Okta, Auth0, Keycloak, AWS IAM, GCP IAM, SPIFFE/SPIRE, Cedar, Oso, Permit.io) — on four separable primitives, not three. Most observable failures happen not inside a primitive but at the contracts between them. Framing the design space by ownership and touch points sharpens which failures belong to which team and which interfaces need the most scrutiny. Identity is included as a first-class primitive — not a touch point — because it is the substrate on which tenancy, resource ownership, isolation guarantees, sharing semantics, and audit attribution are all built. Treating identity as a cross-cutting concern rather than as a substrate produces the ambient-authority and cross-tenant-bleed failures documented throughout Part II.

The four primitives

Primitive What it owns Consumed as Shipping examples
Sandbox fabric Isolation runtime (gVisor, Kata, Firecracker), pod lifecycle, gRPC / FS daemon, bootstrap spec, cold-start pool, network / egress policy Capability request: "I need gVisor + 4 GB + egress to npm" E2B [115], Daytona [59], Modal [116], Namespace [71], Blaxel, Fly Sprites [61]
Persistence fabric Storage substrate (PVC, CoW, content-addressed, snapshot), durable event log, retention / TTL, tamper-evidence, cross-zone availability, snapshot / branch / fork, artifact blob store, memory backend Durable-state contract: append, read, snapshot, fork Postgres event logs, Neon branching [136], S3, Pinecone [103] / pgvector [104] / Chroma [105], Firecracker snapshots [69]
Identity fabric IdP + token broker + ABAC/RBAC engine. Resolves callers to (user_id, tenant_id, agent_id, workload_id, scopes). Substrate for tenancy, resource ownership, isolation guarantees, sharing semantics, audit attribution, cost chargeback. Identity resolution: "who is this caller, what tenant, what attributes?" Okta (incl. Okta for AI Agents [153]), Auth0 (incl. agent-identity work [68]), Keycloak [183], AWS IAM [184], GCP IAM [185], SPIFFE/SPIRE [66], Cedar [101] / Cedar-Agent + OPAL, Oso [186], Permit.io [187]
Agent harness Session loop, context management, tool dispatch, HITL policy, LLM client, MCP handling, durability model on top of persistence fabric Session API (create, resume, interrupt, fork) Claude Agent SDK, LangGraph [119], OpenAI Agents SDK [122], Google ADK

Touch points — the contracts between primitives

Touch points are where production platforms actually fail. Each touch point is a contract that crosses a primitive boundary; if it is ambiguous, the failures at that seam are load-bearing.

  • Capability request (harness ↔ sandbox). "Give me an environment matching this spec." Covers isolation class, resource shape, network policy, bootstrap inputs.
  • Durable state (harness ↔ persistence). Append-event, read-session, snapshot, fork, retention. Covers session log, workspace, memory, artifacts.
  • Identity resolution (harness / sandbox / persistence ↔ identity). "Who is this caller? What tenant? What scopes, attributes, delegation chain?" Every other contract relies on this one having a correct, auditable answer.
  • Audit event stream (cross-cutting — all four primitives emit to the audit sink). Covers observability, tamper-evidence, PII handling, compliance. Every audit event carries an identity-fabric attribution.
  • Egress / policy (sandbox ↔ external). Network controls, credential injection, domain allowlists.
  • Transport (harness ↔ client). Session events, SSE / WebSocket / webhook, steering.

Axis classification by owning primitive + touch points

Primitive Axes
Sandbox fabric §9 Security (partly), §16 Sandbox lifecycle, §25 Network policy, §29 Supply chain
Persistence fabric §3 Session durability (partly), §4 Workspace persistence, §17 Session branching, §21 Artifact surface, §23 Storage substrate, §32 Long-term memory (+ identity), §33 Audit (+ identity)
Identity fabric §22 Identity, authorization, and delegation (tenancy model, resource-identity association, isolation, sharing, lifecycle cascades, ABAC/RBAC substrate). Consumed by §3, §11, §31, §32, §33 as first-class dependencies.
Agent harness §2 Agent lifecycle, §3 Session durability (partly — step-log contract), §5 External tool connections / MCP, §7 Context management, §10 Model capability, §12 Execution mode, §13 Scheduler, §14 Multi-agent composition, §15 Tool dispatch, §18 Interruption, §19 Agent authoring, §20 Tool registry, §27 Evaluation, §28 Model lifecycle, §30 Policy / HITL
Touch points §1 Placement (harness ↔ sandbox ↔ trust plane), §6 Concurrency (harness + resource-level + external), §8 Operational failure modes (cross-cutting), §11 Transport (harness ↔ client, maps to identity), §24 Proxy / gateway topology (harness ↔ external), §26 Observability (cross-cutting), §31 Cost / admission (harness ↔ gateway, attributes via identity)

How to navigate this document

  • If you are designing the sandbox fabric — start with §16 (lifecycle), §25 (network policy), §29 (supply chain), then §9 (security scope). Touch-point sections §1 (placement) and §24 (gateways) tell you what the harness expects.
  • If you are designing the persistence fabric — start with §23 (storage substrate), §4 (workspace), §3 (session durability), then §17 (branching), §21 (artifacts), §32 (memory), §33 (audit). Touch-point sections §8 (operational) and §26 (observability) tell you what the harness and the compliance regime expect. Read §22 for the tenancy and ownership model every persistence-fabric resource is attributed to.
  • If you are designing the identity fabric — start with §22 (full identity-fabric design). Then read §3 (sessions are identity-owned), §32 (memory partitioning), §33 (right-to-erasure cascade), §31 (per-tenant cost attribution), and §11 (transport-auth-to-canonical-user mapping). The identity fabric is the substrate the other three consume; it is not itself a consumer of them.
  • If you are designing the harness — start with §2 (agent lifecycle), §5 (MCP), §7 (context), §10 (model), §12–§15 (execution, scheduler, composition, dispatch), §18–§20 (interruption, authoring, registry), then §27–§30 (eval, model lifecycle, policy). Expect to consume the other three primitives through their touch-point contracts.
  • If you care about the contracts between primitives — §1, §6, §8, §11, §22, §24, §26, §31 are where the cross-primitive failures live.

Part I — Runtime design axes

1. Harness placement relative to the sandbox

Primitive: Touch point (harness ↔ sandbox ↔ trust plane). Defines the secret-reachability contract.

The harness has to reach two things that usually live on opposite sides of a trust boundary: high-value secrets (LLM keys, OAuth tokens, platform credentials) and the agent's execution environment (a shell, a filesystem, package installers, browser automation). Where the harness sits relative to the sandbox determines how those two reachability sets compose.

The three canonical placements

graph TB
    classDef trust fill:#f9d6d6,stroke:#c0392b
    classDef control fill:#d6eaf8,stroke:#2980b9
    classDef exec fill:#d5f5e3,stroke:#27ae60

    subgraph A["Full outside"]
        direction TB
        subgraph A_trust["Trust plane"]
            A_vault[Credential vault]
            A_proxy[Egress policy]
        end
        subgraph A_control["Control plane"]
            A_harness[Harness / Runner]
            A_session[Session store]
            A_mcp[MCP pool]
        end
        subgraph A_exec["Execution plane"]
            A_sandbox[Sandbox pod]
        end
        class A_trust trust
        class A_control control
        class A_exec exec
        A_harness -->|gRPC| A_sandbox
        A_vault -.->|keys| A_harness
    end

    subgraph B["Hybrid"]
        direction TB
        subgraph B_trust["Trust plane (external)"]
            B_proxy[Proxy + vault]
        end
        subgraph B_combined["Control + Execution (colocated)"]
            B_harness[Harness]
            B_sandbox[Sandbox]
        end
        class B_trust trust
        class B_combined exec
        B_proxy -.->|inject creds| B_harness
        B_harness -->|local call| B_sandbox
    end

    subgraph C["Full inside + external proxy"]
        direction TB
        subgraph C_trust["Trust plane (external)"]
            C_proxy[Proxy]
        end
        subgraph C_inside["Single container"]
            C_all[Harness + Sandbox]
        end
        class C_trust trust
        class C_inside exec
        C_proxy -.->|sole egress| C_all
    end
Loading

Full outside

The harness runs in a trusted control plane. The sandbox is a credential-free execution environment that only accepts commands (typically over gRPC) and returns results. Credentials are unreachable from generated code.

Strengths. The blast radius of compromised code or prompt injection is bounded to whatever the sandbox itself can do — read files in /workspace, run user code, make whatever outbound calls the egress policy permits. High-value secrets stay entirely outside. Palo Alto Unit 42 [13] has shown that even Firecracker microVMs advertised as "complete isolation" can be bypassed via DNS tunneling, which is the strongest argument for keeping credentials out of the sandbox boundary rather than relying on isolation guarantees. Decoupling also produces independent failure domains: a harness crash doesn't kill the sandbox, and a sandbox crash doesn't lose session history. Anthropic [14] reports that splitting the "brain" from the "hands" reduced p50 time-to-first-token by roughly 60% and p95 by more than 90%, mostly by keeping container provisioning off the critical path for sandbox-optional workloads.

Costs. Every tool call is a network round-trip over gRPC. Persistent multiplexed connections put this in the low-millisecond range but it will never match an in-process syscall. Browser Use [15] argues this cost is "noise compared to LLM response times" for most workloads; LangChain [16] calls it out as the defining trade-off of the sandbox-as-tool pattern. Cold-start varies widely by provider — 2026 benchmarks put Daytona at ~27–90 ms, Blaxel ~25 ms resume, E2B ~150–200 ms, Modal 2–5 s, with Firecracker-snapshot-restore as low as 28 ms in production reports [17] [129] [130]. No serious provider cold-pulls Docker images on the critical path, so "10–20 s Docker pull" is a reference for bare-Kubernetes deployments only. A distinct failure mode at the container layer is runc CVE-2025-31133 / CVE-2025-52881 (Nov 2025) [131] — container-level escapes independent of the proxy boundary, relevant especially to the "full inside" threat model.

There is also a cognitive tax on the model itself. When multiple execution environments coexist (sandbox exec, MCP tools, web fetch, a code interpreter), the model has to track which environment holds which state. Claude's own documentation warns that models sometimes confuse execution environments, assume state is shared across them, or pick the wrong tool for an operation.

Operationally, full-outside trades one deployable unit for many: durable session log, sandbox orchestrator, gRPC fleet, credential vault, MCP connection pool, egress proxy, audit pipeline, and the ability to debug live incidents without entering user containers. Half-built control planes are often more dangerous than simple colocated designs.

Hybrid (colocated harness, external trust plane)

Harness and sandbox live in the same process or container; credentials arrive via an external proxy that injects them at request time and enforces egress. This is the configuration documented in Anthropic's Claude Code secure deployment guide [18] (Agent SDK applications inside a sandbox/container/VM with --network none and a Unix-socket proxy to the host) and by Daytona [59].

Strengths. Inner-loop latency collapses to in-process calls. The trust plane — the load-bearing security piece — is still non-bypassable because the network path is shut off except via the proxy.

Costs. Non-bypassable enforcement is harder than it looks. Not all runtimes respect HTTP_PROXY/HTTPS_PROXY (Node.js fetch() famously does not), TLS prevents content inspection of HTTPS traffic, and a bypass of the egress boundary exposes the credentials the proxy was injecting. ARMO [56] contrasts application-layer guardrails (bypassable via prompt injection) with kernel-level enforcement such as eBPF at 1–2.5% CPU overhead — "a prompt injection can manipulate the agent's behavior, but it can't override kernel-level restrictions." The n8n sandbox escape (CVE-2026-25049, CVSS 10.0) [57] illustrates what this looks like in the wild.

Full inside

Harness and sandbox collapse into a single container with a proxy as the only egress. This is Fly.io's "Sprites" model [61] and the common single-user CLI deployment pattern.

Strengths. Simplest thing that works, one deployable artifact, fastest tool loop. Fly.io's argument — "the age of sandboxes is over" for persistent single-tenant workloads — is coherent when the threat model doesn't include untrusted code execution.

Costs. Weakest isolation if the proxy is bypassable. Unsuited to multi-tenancy: one user's compromised session can reach another's data.

The separable concerns

Two design decisions are easy to conflate but shouldn't be: trust-boundary placement and harness physical placement. What the security model actually requires is that credentials and policy enforcement sit outside the agent's reachable boundary. That can be achieved with an external proxy injecting into a colocated harness, not only by moving the harness itself out. Vercel [19] formalizes this as five isolation levels, with "separated compute with secret injection proxy" as the default production recommendation.

A scoring heuristic

The honest framing is that placement depends on tenancy, threat model, latency profile, and operational capacity — not a universal rule.

Signal Pushes toward
Shared multi-tenant control plane Outside / hybrid
Untrusted input + high-privilege writes + high-value secrets Outside / hybrid
TTFT is a hard metric; many sessions skip sandbox Outside
Team can operate proxy, audit, replay, orchestrator Outside / hybrid
Single-user or per-tenant isolated deployment Inside / hybrid
Primarily same-directory file ops, no high-value secrets Inside
Inner-loop latency matters more than TTFT Inside / hybrid
Small platform team, tight complexity budget Inside / hybrid

If outside signals dominate by a clear margin, go outside. If inside signals dominate, colocate. Otherwise, hybrid.


2. Agent instance lifecycle: singleton, versioned, or per-request

Primitive: Harness. Touch points: §27 Evaluation (gates rollout), §26 Observability (drift detection).

A harness can instantiate an agent — system prompt, tool set, model binding — once at process start and share the instance across sessions, or it can resolve the agent at request time from a registry of versions. The choice is independent of placement.

Singleton

Runner and agent are application-lifetime singletons; sessions share them and are isolated only by a session_id. Zero per-request allocation, simple concurrency, one deployable.

The cost is that updating one agent's prompt requires redeploying the whole runtime. There is no per-agent canary, no gradual rollout, no A/B test of prompts. The singleton pattern turns any behavioral change into an all-or-nothing event: a bad prompt regresses 100% of sessions for that agent type until rollback completes.

OpenAI's GPT-4o sycophancy rollback [20] is the textbook illustration: the update rolled out on April 25, 2025, rollback began April 28, and the full revert took roughly a day. ZenML's analysis of 1,200+ production LLM deployments [21] found prompt updates to be the leading cause of unexpected production behavior — ahead of model version changes and infrastructure failures. Deepchecks [24] documents how a three-word prompt change ("be more empathetic and engaging") weakened content filters enough to allow policy-violating output.

Runtime prompt/model registry

Industry practice has converged on runtime registries that resolve prompt versions per request. MLflow's prompt registry [22], Braintrust, and Portkey [23] all support this model, and the Portkey pattern documents canary rollouts at 1–5% of traffic with automatic rollback on quality regressions. This is the pattern to reach for when agent types multiply, when prompt quality is measured continuously, or when any single agent's traffic is large enough that "all at once" is not an acceptable rollout granularity.

The registry is not free. It introduces a resolution hop on every request (cacheable), a consistency story across replicas, and a UX for authoring, reviewing, and rolling back prompt versions. The mental model for operators also shifts: behavior depends on which prompt version the session resolved, which is one more variable to carry in incident analysis.

Per-request instantiation

The extreme — build a new agent per request — is almost never worth it for stateless text agents but reappears in agent-factory patterns (each session gets a customized tool set derived from user metadata). It trades allocation cost for runtime flexibility.

When pressure shows

graph LR
    subgraph "Single binary"
        B[Control plane v42]
        B --- CA[Coding agent v42]
        B --- SA[Slackbot agent v42]
        B --- MA[Monitoring agent v42]
    end

    subgraph "Registry-resolved"
        B2[Control plane v42]
        B2 --- CA2[Coding agent v43<br/>canary 10%]
        B2 --- SA2[Slackbot agent v41<br/>stable]
        B2 --- MA2[Monitoring agent v42<br/>stable]
    end

    style CA fill:#f9d6d6,stroke:#c0392b
    style SA fill:#f9d6d6,stroke:#c0392b
    style MA fill:#f9d6d6,stroke:#c0392b
    style CA2 fill:#fdebd0,stroke:#e67e22
    style SA2 fill:#d5f5e3,stroke:#27ae60
    style MA2 fill:#d5f5e3,stroke:#27ae60
Loading

The singleton is fine at 1–2 agent types and starts hurting past three or four. The first time a prompt regression in one agent forces a rollback that also reverts a critical fix in another, the next design conversation is about versioning.


3. Session state durability: intent log, step log, or hybrid

Primitive: Persistence fabric (event store) + Harness (step-log contract). Touch points: durable-state contract (harness ↔ persistence), §22 Identity (sessions are identity-owned resources — tenancy, ACL, sharing, and right-to-erasure cascade follow §22), §33 Audit (tamper-evidence).

A harness has to persist enough state to recover from any of: harness crash, replica reschedule, sandbox pod loss, or the user resuming a session days later. There are three common choices and they differ in what they record, not just where. Independent of the choice, every session is an identity-owned resource: each log entry, checkpoint, and derived artifact carries an owner, a tenant, and an ACL sourced from the identity fabric (§22). The durability contract captures what happened; the identity contract captures for whom, against whose tenancy, under what scopes.

Append-only event log (intent)

Every user message, model response, tool call, and tool result is written to a durable store. The session is the sole recovery artifact: any replica can serve any session by reading from the log. Clean replay, clean audit, transport-agnostic.

This model's failure mode is the gap between side effect and commit. If the harness crashes after a tool executed but before its result was persisted, recovery replays the pre-crash state, and the agent re-issues a non-idempotent operation: a second git commit, a duplicate ticket, a repeated payment. The session store and the external world diverge silently.

sequenceDiagram
    participant LLM
    participant R as Runner
    participant SB as Sandbox
    participant SS as Session store

    R->>LLM: prompt
    LLM-->>R: tool_use: sandbox_exec("git commit -m 'feat'")
    R->>SB: gRPC Exec("git commit ...")
    SB-->>R: OK (commit created)
    Note over R: ✕ Runner crashes here
    Note over SS: Event NOT persisted

    Note over R,SS: Recovery
    R->>SS: GetSession
    SS-->>R: last known state (pre-commit)
    R->>LLM: replay from last checkpoint
    LLM-->>R: tool_use: sandbox_exec("git commit -m 'feat'")
    R->>SB: gRPC Exec("git commit ...")
    Note over SB: ✕ Duplicate commit or error:<br/>nothing to commit, working tree clean
Loading

Retries and replay are common enough that these operations have to be designed for, not treated as edge cases [25].

Durable step log (intent + commit boundary)

Frameworks like Restate [26], Temporal [27], and Inngest [28] journal each step boundary before the effect executes, then record its result. Replay skips any step whose commit is durable. The model formalizes the distinction between "we asked to do X" and "X completed," and makes exactly-once idempotency a property of step definitions rather than hope. Maxim Fateev (Temporal CEO) has argued explicitly that probabilistic LLM behavior makes naive retry logic insufficient: retries on a model call can produce different outputs, so the journal has to record which output was committed, not just that one was attempted [132]. By 2025 durable execution had crossed the early-majority line: AWS Durable Functions, Cloudflare Workflows (GA), and Vercel Workflow DevKit all shipped with AI-agent use cases as primary framing [133] — the pattern is no longer framework-specific.

The step log does not make external effects magically idempotent. It still requires idempotency keys or compensating logic at the boundary — but the journal gives you the hook to put them on.

Filesystem-as-state

Zhou et al.'s InfiAgent [29] formalizes an extreme position: working context (open files, partial plans, execution checkpoints) "disappears as soon as the context window resets or a process is interrupted," and the filesystem should be the sole authoritative record of task state. The event log becomes a secondary log of intent; the filesystem holds the truth. This pushes the design toward an opinionated convention (e.g., progress.md, plan.md, journal.md) that the agent is trained or instructed to maintain.

The reconciliation gap

Whichever durability model is chosen, there is a question that doesn't answer itself: if the sandbox has mutable state (files, packages, processes) outside the durable log, how does the harness reconcile the two on recovery? For idempotent operations (overwriting a file with known content) it doesn't matter. For non-idempotent operations (appending to a file, incrementing a counter, creating a commit, calling an external API with side effects), the harness needs either a step-log boundary or an idempotency key threaded through the tool.


4. Workspace persistence: PVCs, snapshots, and checkpoints

Primitive: Persistence fabric. Consumed by: Harness (via durable-state contract); picked by: §16 Sandbox lifecycle.

The sandbox is mutable state that outlives individual tool calls. How to persist it across pod lifecycle events is an independent design axis.

PersistentVolumeClaim (PVC)

The common Kubernetes pattern. A PVC is mounted at /workspace; when the pod dies, a replacement mounts the same PVC.

The failure modes are well-documented. PVCs (especially ReadWriteOnce) are zone-scoped in most cloud providers: if the pod reschedules to a different availability zone (node failure, capacity rebalancing, spot eviction), the PVC cannot follow and recovery falls back to full re-bootstrap. Rack2Cloud [30] documents this as a common day-2 failure: "Pod restarts on a different node after a failure. New node can't mount the PVC. Pod stays in ContainerCreating." Kubernetes issue #121436 [31] shows PVC binding annotations persisting after failed scheduling, preventing rescheduling to alternative nodes. RWO also creates contention during crash recovery: an old pod terminating blocks a new one from mounting until the grace period elapses, adding seconds to minutes of unpredictable recovery latency.

When the pod is deliberately destroyed (teardown policy, session timeout, scale-down) the PVC is often deleted with it. Any uncommitted work is permanently lost. Gitpod issue #9544 [32] documents total data loss when a workspace timeout fired before the final sync completed — a user lost roughly two hours of work. A fresher failure mode is Karpenter issue #2777 [134] (2026): Karpenter injects PVC zone into pod NodeAffinity, which can leave pods indefinitely Pending on node deletion because the only viable node has been removed.

Snapshot / checkpoint

Replit takes a different approach: checkpoint the complete state — workspace, conversation, environment — on a cadence, and restore from checkpoint on crash. Replit's December 2025 snapshot-engine deep-dive [135] documents the architecture as copy-on-write filesystem plus Neon database branching [136], explicitly capturing AI conversation context alongside filesystem state; they report recovering from OOM crashes that occurred "roughly once an hour" without data loss [33]. The cost is snapshot frequency versus storage overhead, and the operational complexity of a checkpoint pipeline.

Reproducible bootstrap as recovery

A third option: treat the sandbox as fully disposable and rely on a deterministic bootstrap spec (pinned commit SHAs, locked package versions, side-effect-free setup) to reconstruct state from scratch. This only works if the bootstrap is deterministic. A floating git clone main, an npm install without a lockfile, or setup scripts that mutate remote state make "resume" an ambiguous operation that drifts over time.

The honest statement

Workspace persistence is partial durability, not full recovery. Any design that claims "sessions survive indefinitely" needs to be explicit about which failure modes it covers (pod restart in same zone) and which it doesn't (zone failure, teardown before sync, preemption of a hibernated node).


5. External tool connections: MCP and beyond

Primitive: Harness. Touch points: §22 Identity (token lifecycle), §24 Gateway (MCP gateway).

Modern harnesses integrate external tools via stateful protocols, most notably MCP (Model Context Protocol). MCP connections carry negotiated capabilities, cursor positions, and server-side context — they are not stateless HTTP calls.

In-memory, per-replica pooling

The simplest design: each control-plane replica holds MCP connections in memory, keyed by (session_id, server_url), established lazily on first use. No distributed state, session-scoped lifecycle matches MCP semantics.

The scaling problems are well-known. The MCP 2026 roadmap [34] explicitly names stateful sessions as the primary bottleneck: "stateful sessions fight with load balancers, horizontal scaling requires workarounds." Without sticky routing, every replica restart, scale-down, or rebalance triggers mass reconnection. Claude Code issue #30224 [35] documents this: "When an SSE-based MCP server restarts, all active Claude Code sessions become stale. The tools remain listed but every call fails silently." MCP Python SDK issue #520 [36] shows sessions lost in multi-worker Kubernetes environments because "SSE sessions are created in one worker process, but subsequent requests may be routed to different worker processes where the session state is not shared."

sequenceDiagram
    participant LB as Load balancer
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant MCP as MCP servers

    Note over R1: Holds MCP connections<br/>for sessions A, B, C

    Note over R1: ✕ Replica 1 crashes

    R2->>MCP: Reconnect session A (GitHub)
    R2->>MCP: Reconnect session A (Jira)
    R2->>MCP: Reconnect session B (GitHub)
    R2->>MCP: Reconnect session B (Slack)
    R2->>MCP: Reconnect session C (GitHub)
    R2->>MCP: Reconnect session C (Jira)
    Note over R2,MCP: N sessions × M servers = N×M simultaneous reconnects<br/>Server-side state (cursors, capabilities) lost
Loading

Token lifecycle and refresh races

OAuth tokens introduce concurrency problems that pure session-scoped pooling doesn't address. Claude Code issue #27933 [37] shows this in production: multiple concurrent CLI processes race on refreshing a single-use OAuth refresh token, and the loser gets a 404 with no automatic recovery. Users with 5–12 concurrent sessions [38] report forced re-authentication multiple times per day.

A pool design needs explicit answers to:

  • What happens if a token is refreshed while a connection is active — does the connection carry the old token or is it swapped?
  • What happens if a token is revoked mid-session — does the error surface to the model as "retry" or "stop trying"?
  • What happens when multiple MCP servers require different OAuth flows for the same user — how is consent orchestrated?

What the ecosystem is converging on

Session affinity (sticky routing via load balancer or consistent hashing), reconnect backoff with jitter, admission control at the pool level, and explicit token lifecycle management are all load-bearing. The MCP spec deprecated HTTP+SSE on 2025-03-26 in favor of Streamable HTTP [137]; major vendors (Atlassian Rovo, Keboola) are sunsetting SSE endpoints by Apr–Jun 2026. Resumable streams use Mcp-Session-Id + Last-Event-ID for lossless cursor replay on reconnect [138], which is the concrete mechanism the "SSE → harness replays from cursor" row relies on. MCP spec 2026-03-15 also mandates RFC 8707 resource indicators [139] to prevent token-mis-redemption attacks, adding an explicit answer to the "token is refreshed while connection is active" question.


6. Concurrency: session-level vs resource-level coordination

Primitive: Touch point (harness + resource-level + external systems). Touch points: §30 Policy (coordination at the external resource).

Session-level locking prevents two concurrent Run() calls on the same session. It says nothing about concurrent operations on the same external resource from different sessions of the same agent.

The gap

graph TB
    classDef locked fill:#d5f5e3,stroke:#27ae60
    classDef unlocked fill:#f9d6d6,stroke:#c0392b

    A[Agent: acme-api] --> S1[Session 1<br/>sandbox A]
    A --> S2[Session 2<br/>sandbox B]
    A --> S3[Scheduled tick<br/>sandbox C]

    S1 -->|git push main| GIT[git repo: acme/api]
    S2 -->|git push main| GIT
    S3 -->|create ticket| JIRA[Jira: ACME project]
    S1 -->|create ticket| JIRA

    class S1,S2,S3 locked
    class GIT,JIRA unlocked

    style GIT fill:#f9d6d6,stroke:#c0392b
    style JIRA fill:#f9d6d6,stroke:#c0392b
Loading

Two coding sessions for the same agent can produce conflicting commits. Two conversational sessions can post duplicate responses. A user-initiated session and a scheduled tick for a hybrid agent can target the same external state without either knowing about the other.

Ogenrwot & Businge [39] analyzed 142,652 AI-generated PRs and found a 27.67% textual merge-conflict rate — roughly one in four AI-generated pull requests produces conflicts, with 540 average conflicting lines per PR; the 2026 AgenticFlict extension [39] grew this corpus to 932,791 agentic PRs with conflict-aware metadata. Geng & Neubig (CMU) [40] showed that worktree isolation substantially outperforms shared-workspace approaches (59.1% vs 56.1% on Commit0-Lite) and that soft isolation via instruction-level constraints alone actually degrades performance below single-agent baselines. MultiAgentBench (ACL 2025) [140] formalizes coordination-protocol evaluation across star/chain/tree/graph topologies and is a stronger anchor for the "agent-level coordination" question than MAST [41] alone.

Options

  • Worktree-per-session. Give each session a dedicated branch/worktree; merge via PR. Geng & Neubig's result is strong support for this.
  • Resource-level distributed locks. Before a session does a git push to a shared branch, acquire a lock keyed on the branch. Works for well-typed operations, doesn't help when the conflict surface is fuzzy.
  • Optimistic concurrency with reconciliation. Let collisions happen and resolve on the external system (merge commits, ticket dedup). Shifts the complexity to the recovery path.
  • Agent-level coordination. The MAST paper [41] formalizes inter-agent misalignment and coordination failure as first-class categories in multi-agent systems — reinforcing that these conflicts are structural, not incidental, and need a coordination mechanism at the agent level, not just the session level.

The choice depends on how strongly the workload's external effects conflict, and on whether the downstream systems have their own concurrency controls (git's ref updates, most databases' row locks) that can be relied on.


7. Context management: compaction, externalization, and agent self-documentation

Primitive: Harness. Touch points: §32 Long-term memory (persistence fabric), §26 Observability (compaction events).

A long-running agent hits the context window. Two questions: what gets preserved, and who decides.

Automatic compaction

Summarize older turns into a compressed representation. Keeps sessions running past the hard limit. The loss is invisible — summarization discards detail, and the agent doesn't know what it lost.

Anthropic's own cookbook [42] measured this directly: compaction preserves 3/3 high-level facts but 0/3 obscure specifics. A two-week empirical test across coding tools [43] found 23 context loss events in Claude Code alone, including the AI suggesting contradictory decisions after compaction (e.g., recommending Redux despite prior explicit rejection). Lindenbauer et al. (JetBrains / TU Munich) [44] showed that simple observation masking — replacing old tool outputs with placeholders — outperforms LLM summarization in 4 of 5 settings, and that summaries cause "trajectory elongation" (agents persist 13–15% longer than optimal because summaries mask failure signals). ACON [141] reports −25% peak tokens on AppWorld without accuracy loss and −54.5% on 8-objective QA while surpassing the no-compression baseline — a sharper empirical anchor than the 3/3 vs 0/3 figure for what compaction can preserve when done well. LoCoBench-Agent [142] quantifies the compaction cliff: accuracy drops below 50% once compression exceeds 5× across four tested models.

Externalization to the filesystem

progress.md, plan.md, or a structured journal. The agent writes what it wants to remember; compaction reloads from the file. Strength: the agent controls what persists. Weakness: the agent's self-documentation is uneven. Cognition [45] reports that Sonnet 4.5 in Devin frequently writes summaries (CHANGELOG.md, SUMMARY.md) without prompting, but the summaries lack comprehensiveness — "the model didn't know what it didn't know." In some cases "the agent spent more tokens writing summaries than actually solving the problem."

Context-window growth vs. compaction pressure

Bigger context windows reduce compaction frequency but do not eliminate the degradation-with-length problem. Du et al. [46] showed a 24.2% accuracy drop on MMLU for Llama-3.1-8B as input length increases, even with perfect retrieval. Hong et al. (Chroma) [47] confirmed this across 18 LLMs and 194,480 calls: performance degrades "even under minimal conditions" as context grows. The implication is that "just fit everything in" is not a robust strategy for persistent agents.

Steady-state growth for scheduled agents

Persistent scheduled sessions (a monitoring agent running every five minutes with session_mode: persistent) accumulate events indefinitely. The design questions this raises — when compaction fires, what steady-state event count looks like, when to rotate sessions — are separate from the turn-by-turn compaction question and tend to go unaddressed until the storage or context-size signal becomes a problem.

The honest statement

Context management is convention-driven in every production harness the ecosystem has documented. The quality of long-session behavior depends on:

  1. The model's ability to write a good summary when asked.
  2. The model's ability to interpret its own (or another model's) summary on reload.
  3. The harness's willingness to enforce an externalization pattern rather than hoping for it.

None of these are mechanical guarantees.


8. Operational failure modes that are usually under-scoped

Primitive: Touch point (cross-cutting). Touch points: §24 Gateway (SPOF), §31 Cost (runaway enforcement), §26 Observability (detection).

The design axes above are the headline architectural choices. Several failure modes tend to be documented-in-principle but under-addressed in practice. They are not exotic — each has public post-incident evidence.

graph TB
    classDef risk fill:#f9d6d6,stroke:#c0392b

    GW[LLM gateway down] -->|all sessions stall| STALL[No fallback or circuit breaker]
    K8S[K8s capacity exhausted] -->|session creation hangs| HANG[No admission control or timeout]
    LOOP[Runaway agent loop] -->|unbounded LLM spend| COST[No hard timeout for request-driven sessions]
    GROWTH[Event log growth] -->|Postgres bloat| PERF[No storage compaction or partitioning]

    class GW,K8S,LOOP,GROWTH risk
Loading

LLM gateway as single point of failure

A single gateway in front of model inference is a single point of failure for the entire platform. Langfuse [48] experienced a 3-hour 7-minute outage when their gateway layer (Cloudflare) failed — most traces during the incident "never reached ingestion endpoints and are permanently lost." OpenAI's own routing layer [49] hit a cascading failure when a buffer allocation bug caused nodes to exhaust memory under load, with insufficient remaining capacity to self-heal. Every active session stalls simultaneously; there is usually no circuit breaker, fallback route, or graceful degradation designed in from day one.

Sandbox scheduling exhaustion

Session creation that calls the Kubernetes API has to contend with the cluster's ability to actually schedule the pod. If resource quota, node capacity, or pending scale-up prevents admission, the common failure is a hang with no client-visible error — the SSE stream simply never starts. Admission control at the harness layer, with explicit capacity signals and clear rejection semantics, is what closes this.

Runaway request-driven sessions

Scheduled agents typically have a timeout_per_tick. Request-driven sessions often do not. A coding agent stuck in a test-fix loop consumes tokens indefinitely. A LangChain A2A pipeline [50] entered an infinite loop between two agents for 11 days, generating a $47,000 bill — "neither agent had a budget ceiling." A Claude Code recursion incident [51] consumed 1.67 billion tokens in 5 hours ($16,000–$50,000).

The minimum controls are hard wall-clock timeouts per session, token budgets enforced at the harness layer rather than the billing system, and admission control that can reject new sessions under cost pressure. Budget alerts are not budget enforcement.

Scheduler retry is not exactly-once side effects

An agent-level distributed lock prevents two replicas from firing the same tick concurrently. It does not guarantee exactly-once external effects. If a tick posts to Slack, opens a Jira ticket, or writes to a database and the replica crashes before persisting the tick outcome, recovery can replay the tick and duplicate the side effect. Locking prevents overlap; it does not make scheduled actions idempotent.

Session store growth

An append-only event log has no compaction at the storage layer (compaction in this context is an LLM context-management feature, not a storage feature). High-throughput agents generating many events per turn will bloat the backing table. Azguards [52] measured LangGraph's append-only Postgres checkpointing at modest scale (100 concurrent agents): roughly 150 MB/sec of WAL generation, 3–5 second replication lag (vs <100 ms with optimization). Their Pointer State Pattern reduced checkpoint size by 99.8%. Partitioning, archival, or storage-level TTL is load-bearing, not optional.


9. Security: what isolation solves and what it doesn't

Primitive: Sandbox fabric (isolation) + Harness (policy); cross-cut. Touch points: §25 Network policy (non-bypassable egress), §22 Identity (credential lifecycle), §30 Policy (overprivileged actions).

"Harness outside the sandbox" is an important security property but a narrow one. It solves secret reachability: generated code cannot read high-value credentials. It does not, by itself, solve:

graph LR
    classDef safe fill:#d5f5e3,stroke:#27ae60
    classDef partial fill:#fdebd0,stroke:#e67e22
    classDef unsafe fill:#f9d6d6,stroke:#c0392b

    subgraph "Isolation coverage"
        SR[Secret reachability]
        FD[Fault domain isolation]
    end

    subgraph "Requires additional controls"
        PI[Prompt injection]
        OP[Overprivileged actions]
        NB[Non-bypassable enforcement]
        CL[Credential lifecycle edge cases]
    end

    class SR,FD safe
    class PI,OP,NB,CL partial
Loading

Prompt injection

The model may follow malicious instructions embedded in web pages, files, repository content, or MCP tool responses. Isolation does not prevent the agent from taking harmful actions through its legitimate tool interface — it only prevents credential theft. Willison [53] names this the "lethal trifecta": private data access + untrusted content exposure + external communication ability. OWASP [54] ranks prompt injection as the #1 LLM vulnerability for 2025, noting that "it is unclear if there are fool-proof methods of prevention." Security audits [55] found prompt injection in 73% of production AI deployments assessed in 2025, with attack success rates exceeding 85% against state-of-the-art defenses when adaptive strategies are employed. Q4 2025 Wiz Research data [143] shows 340% YoY increase in documented prompt-injection attempts against enterprise AI and 190% increase in successful attacks (data exfiltration or unauthorized action); indirect prompt injection now accounts for >55% of observed attacks with 20–30% higher success rates than direct injection — relevant because indirect is exactly what MCP tool responses and web-fetch channels expose. Concrete production CVEs: Microsoft Copilot CVSS 9.3, GitHub Copilot CVSS 9.6, Cursor IDE CVSS 9.8 [143] — all 2025–2026, all exploited. SandboxEscapeBench [144] quantifies the defense gap: kernel mechanisms (capabilities, seccomp, MAC) block 67.57% of privilege escalations vs only 21.62% for namespaces/cgroups alone — concrete support for the eBPF/kernel-enforcement argument.

Overprivileged allowed actions

Even with domain allowlists, broad permissions on an allowed domain (e.g., write access to a production database via an allowed MCP server) let prompt injection cause damage through legitimate channels. Short-lived credentials reduce the replay window; they do not reduce the permissions within that window.

Non-bypassable enforcement

True egress enforcement requires network-level controls (NetworkPolicy, private subnet with no internet gateway, Unix socket as sole egress), not just configuration. Anthropic's secure deployment guide [18] explicitly warns: "Not all programs respect HTTP_PROXY/HTTPS_PROXY. Node.js fetch() ignores these variables by default." ARMO [56] contrasts application-layer guardrails with kernel-level enforcement (eBPF, 1–2.5% CPU overhead), and the n8n sandbox escape (CVE-2026-25049, CVSS 10.0) [57] shows the consequence of treating the sandbox boundary as sufficient without that enforcement layer.

Credential lifecycle edge cases

The credential happy path is easy; the edge cases determine behavior under failure.

  • Token refreshed while an MCP connection is active — does the connection use the old or new token?
  • Token revoked by user while a session is in-flight — how does the error propagate to the agent, and can it distinguish "retry" from "stop"?
  • Token near expiry when a scheduled agent tick fires — who is responsible for proactive refresh?
  • Multiple MCP servers requiring different OAuth flows for the same user — how is consent orchestrated?

These are design questions, not operational accidents. They need answers before the first incident.


10. Model capability as a load-bearing assumption

Primitive: Harness. Touch points: §24 LLM gateway (routing), §27 Evaluation (gate on swap), §28 Model lifecycle (training pipeline).

The cleanest abstraction a harness can offer is "the model is swappable via gateway configuration." It is true at the protocol level and misleading at the behavior level. The API surface — tool-use JSON, system prompt, multi-turn conversation — is necessary but not sufficient to describe what the agent can actually do.

graph TB
    classDef stable fill:#d5f5e3,stroke:#27ae60
    classDef unstable fill:#f9d6d6,stroke:#c0392b
    classDef drift fill:#fdebd0,stroke:#e67e22

    MODEL[Model change<br/>provider swap · upgrade · deprecation]

    MODEL --> TC[Tool calling behavior]
    MODEL --> PP[Prompt effectiveness]
    MODEL --> CW[Context window / compaction]
    MODEL --> MR[Multi-step reasoning]
    MODEL --> EE[Environment disambiguation]
    MODEL --> TF[Tool schema interpretation]

    class MODEL unstable
    class TC,PP,CW,MR,EE,TF drift
Loading

Tool calling behavior varies

The Berkeley Function Calling Leaderboard has moved to BFCL V4 Agentic [3], which adds multi-hop search, memory management, and format sensitivity — the V3-era Claude 3.5 Sonnet 90.2% / GPT-4o 83.6% figures are stale as a capability ranking; model ordering on V4 differs materially from V3 and the benchmark structure itself has changed, so numbers quoted here should be labelled with their BFCL version. ToolACE [4] shows the gap widens on complex call patterns (parallel, nested, multi-step). ACEBench [5] demonstrates that different models have different failure patterns, not just different accuracy. One model reliably calls a shell tool with correct syntax; another hallucinates flags, wraps commands in unnecessary subshells, or fails to escape arguments. Same schema, different behavior.

Prompt sensitivity is model-specific

Zhu et al. [6] found prompt variations cause performance swings of 12–85 percentage points depending on model. Sclar et al. [7] showed up to 76 accuracy points difference from formatting changes alone in few-shot settings. A prompt tuned for one model may produce verbose preamble on another, or be silently ignored by a third with weaker instruction following.

Multi-step reasoning degrades differently

Shi et al. [8] showed all top LLMs exhibit an average 39% performance drop in multi-turn versus single-turn conversations. Jain et al. [9] formalized "agent drift" with projected 42% reduction in task success and 3.2× increase in human intervention for long-running agents across LangGraph/AutoGen/CrewAI deployments — identifying semantic drift (deviation from intent), coordination drift, and behavioral drift as distinct failure modes. "Drift No More?" [145] evaluates drift on τ-bench across open-weight models and provides a baseline against which real deployments can compare. Bhatt et al. [10] showed goal drift is influenced by pattern-matching behaviors deeper in the context window. A harness that depends on 20+ turn coherence is implicitly depending on a specific model's multi-turn capability.

Versions of the same model are not the same model

Chen, Zaharia & Zou [1] showed GPT-4 dropped from 84% to 51% accuracy on prime number identification between March and June 2023 API versions — same endpoint, same prompts. Jiang et al. [2] showed that even 0.3% accuracy degradation can be statistically significant but goes undetected without per-sample comparison. Shi et al. [11] catalog 15 hidden failure modes in LLM systems — incorrect tool invocation, version drift, cost-driven performance collapse — all of which go undetected without structured evaluation. Xu et al. [12] showed model updates cause "negative flips" (previously correct instances become incorrect) even when aggregate metrics improve.

What a harness needs to carry

  • Model-conditional prompts and tool schemas. Parameter descriptions, examples, and prompt variants that are selected per-model, not written to a lowest-common denominator.
  • Eval gates on model change. An automated suite that runs when a model is swapped or upgraded, defining "this model works for this agent" in measurable terms.
  • Graceful degradation path. Shadow mode, gradual migration, per-session model pinning, or fallback routing so a model change is not an all-at-once, all-agents event.
  • Per-model drift observability. Tool call success rate, prompt compliance, compaction quality, multi-step plan coherence — measured and compared across model versions rather than surfaced as "the agent seems worse now."

Without these, the model is effectively pinned by implicit behavior assumptions even when the gateway config claims otherwise.


11. Transport adaptation and client interaction

Primitive: Touch point (harness ↔ client). Touch points: §22 Identity (transport-specific auth → canonical user), §3 Session durability (at-least-once delivery).

The harness speaks an internal API (sessions, events, tool calls). The outside world speaks many protocols (HTTP/REST, Slack events, email, webhooks, A2A [112], MCP, message queues). Something has to map between them. That something is usually called the transport layer or adapter layer, and its design is a first-class axis.

Options

Single-transport harness. The harness exposes exactly one inbound protocol (typically HTTP). Everything else is the caller's problem. Simplest deployable; forces every integration partner to build an HTTP client; no Slack, email, or A2A without an external translator.

Pluggable transport adapters. The harness core is transport-agnostic. Adapters translate external events into session operations: a Slack event becomes resume(session) on the session mapped from thread_ts; an A2A task becomes create(session) with the A2A task_id as the external correlation. Strong for heterogeneous integrations; requires a stable core session-operations contract.

Transport-native harnesses. A separate harness binary per transport, each with its own session model. Works for small surfaces; duplicates the hard parts (durability, recovery, policy) per transport.

Client observation patterns

Independently of how events arrive, clients have to observe session progress. Four common patterns:

Pattern Latency Reconnect Failure mode
SSE streaming Low Client reopens; harness replays from cursor Intermediate proxies drop long-held connections
WebSocket Low, bidirectional Manual More moving parts than SSE
Polling with cursor High (poll interval) Trivial Wasted calls when idle
Webhook callbacks Medium Harness retries At-least-once delivery; caller needs idempotency

SSE [111] is the common default for browser-side clients; webhooks for server-to-server integrations; polling as a fallback when SSE proxies misbehave. Harness-to-transport multiplexing (e.g., same session visible via both SSE to the UI and webhook to a back-office system) is a common but non-trivial requirement.

Failure modes

  • ID mapping loss. The transport's native ID (Slack thread_ts, A2A task_id, email Message-ID) must map deterministically to the harness's session_id. Loss or ambiguity produces either duplicate sessions or cross-thread bleed.
  • At-least-once vs exactly-once. Webhooks retry. If the adapter writes to the session on each retry, duplicate events appear unless the adapter or session store deduplicates.
  • Out-of-order events. Slack and email can deliver events out of order. The adapter must either enforce ordering (with latency cost) or the harness must accept unordered events (with state-reconstruction cost).
  • Transport-specific auth. Each transport carries its own identity model (Slack user ID, email sender domain, A2A agent card). Mapping to the harness's canonical user identity is a contract with the identity fabric (§22), not a transport-internal concern. Every adapter performs a bounded "trust this transport assertion, ask the identity fabric to resolve it to (user_id, tenant_id, scopes)" step before any session operation. If this contract is implicit or unverified, a transport-layer compromise (Slack token leak, spoofed email header) directly grants harness authority. Slack-to-canonical and email-to-canonical mappings must be stored in the identity fabric, not in adapter-local tables, so revocation and right-to-erasure cascade correctly.

When each applies

Pluggable adapters pay off as soon as two transports exist; before that, the abstraction is overhead. Single-transport is the right default for internal-only tools. Transport-native is a sign of organizational divergence more than a design choice.


12. Execution mode and session mode

Primitive: Harness. Touch points: §13 Scheduler (scheduled mode), §7 Context management (persistent sessions), §6 Concurrency (hybrid mode).

A harness may drive sessions three fundamentally different ways, and a given agent may be any of them:

Mode Trigger Typical use
Request-driven User or external system calls the harness Classic chat, coding agent on a ticket
Scheduled Cron/interval fires a tick Monitoring, periodic reports, digest generation
Hybrid Both of the above A support agent that answers on demand and proactively checks ticket staleness nightly
Event-driven External event (webhook, queue message) fires Reactive agents: "when this PR is opened, review it"

Why execution mode is a design axis

Request-driven is the simplest: the harness processes work when a client asks. Scheduled is harder — there's no client to call /resume, so the platform must self-heal on crash. Hybrid is harder still: a user-initiated session and a scheduled tick for the same agent can collide on external state (§6). Event-driven adds ingestion, filtering, and dedup as new concerns.

Scheduled and event-driven modes are what "autonomous agents" usually means in production. They need:

  • A place to store "what's scheduled for when" (agent scheduler state).
  • Agent-level locking to prevent duplicate tick execution across replicas.
  • Tick timeouts distinct from session idle timeouts.
  • Circuit breakers on consecutive failures.
  • Explicit recovery policy for in-flight ticks at harness restart.

Session mode for scheduled agents

A scheduled agent further picks between:

Persistent sessions. One session is created on first tick and reused indefinitely. The agent accumulates context: a monitoring agent remembers last week's baseline and this week's drift. Requires context management (§7) because events accumulate without bound; the progress.md or equivalent summary pattern becomes load-bearing.

Ephemeral sessions. Each tick creates a fresh session. No accumulated context; the tick prompt must be self-contained. Ideal for stateless periodic jobs (scan a repo, send a digest). Old sessions age out aggressively.

The choice is a trade-off between continuity (persistent) and statelessness (ephemeral), not a technical constraint. Persistent is correct when the agent's value comes from remembering; ephemeral is correct when each tick is independent.

Failure modes

  • Runaway schedule. A one-minute interval on an agent whose tick takes two minutes produces unbounded concurrency unless the lock prevents it.
  • Replay of side-effectful ticks. A tick that posts to Slack and then crashes before recording "done" replays on recovery and double-posts. Same class of problem as §3; scheduler recovery is where it surfaces first.
  • Drift between schedule and reality. An agent scheduled "every five minutes" drifts when ticks run longer than the interval. Interval-from-start vs interval-from-end semantics matter.
  • Event-driven storms. A webhook firehose (dependabot, repo notifications) can drive orders of magnitude more ticks than cron. Admission control (§8) at the event layer is necessary.

13. Scheduler design

Primitive: Harness. Touch points: §3 Session durability (tick replay), §8 Operational (circuit breaker), §31 Cost (runaway bound).

If execution mode (§12) includes scheduled or event-driven, the scheduler itself is a subsystem with its own design axes. Treating "we'll just use cron" or "we'll use Temporal" as a single decision obscures trade-offs that surface at scale.

Placement: in-process vs external

In-process scheduler. A priority queue of next-fire-times inside the harness, a goroutine that sleeps until the next fire. Shares the harness's singletons, session store, and sandbox manager. Simplest deploy; scheduler state must survive harness restart (usually via Postgres).

External scheduler (Kubernetes CronJobs). One CronJob per scheduled agent. Familiar to Kubernetes operators; rigid (no dynamic schedules, one pod per fire, no cross-tick state without external storage); poor fit for high-frequency or many-agent workloads.

Durable execution framework (Temporal [27], Inngest [28], Restate [26]). Step-journaling frameworks that treat each scheduled fire as a durable workflow execution. Strong fit when ticks need multi-step choreography with external side effects. Adds an operational dependency and a programming model.

A 13-hour AWS Cost Explorer outage (December 2025) was caused by an autonomous AI agent with broad operator permissions and no circuit breaker [146] — concrete production evidence that the circuit-breaker bullet is load-bearing, not theoretical.

Coordination across replicas

graph LR
    classDef ok fill:#d5f5e3,stroke:#27ae60
    classDef risk fill:#f9d6d6,stroke:#c0392b

    R1[Replica 1] -->|tick fire: agent-A| LOCK{Agent lock}
    R2[Replica 2] -->|tick fire: agent-A| LOCK
    R3[Replica 3] -->|tick fire: agent-A| LOCK
    LOCK -->|only one wins| EXEC[Execute tick]

    class EXEC ok
    class LOCK risk
Loading

Multiple replicas each run their own scheduler loop. Without coordination, every replica fires every tick. Options:

  • Agent-level distributed lock (Postgres advisory, Redis). Acquired before Run(); released after. Work distributes naturally — whichever replica acquires wins. Simple and correct; doesn't prevent duplicate side effects on crash mid-tick. Two sharp gotchas: Kleppmann's well-known Redlock safety critique under network partitions [147] means Redis-based locks need careful fencing tokens; and with PgBouncer transaction pooling, pg_advisory_lock releases on connection return, so session-scoped callers must use pg_advisory_xact_lock or a session pool.
  • Shard assignment. Consistent-hash agents across replicas; only the owning replica fires. Lower coordination cost per tick; rebalance complexity on replica churn.
  • Single scheduler leader. One replica is elected leader and schedules; others execute. Reduces duplicate-fire risk to zero but creates a leader dependency.

Protecting against runaway behavior

The mechanisms most production schedulers end up needing:

  • Jitter. Random offset within some fraction of the interval to prevent thundering herd when many agents share a schedule.
  • Per-tick timeout. Explicit wall-clock cap on Run() for a tick; hard-kills on exceed.
  • Exponential backoff. Failed ticks don't retry immediately; backoff grows with consecutive failures.
  • Circuit breaker. After N consecutive failures, the agent is auto-disabled and requires operator re-enable. Prevents a broken agent from burning tokens indefinitely.
  • Self-healing on restart. At startup, detect ticks whose last_tick_end IS NULL and replay; combine with §3 step-log semantics if exact-once execution is required.

Failure modes

  • Lock held by a dead replica. If a replica crashes while holding an agent lock, the lock must have a TTL or heartbeat to be reclaimable. Locks without expiry stall the agent indefinitely.
  • Clock skew. Distributed schedulers that use wall-clock times across replicas need NTP discipline; skew produces double-fires or missed fires.
  • Tight schedules with long ticks. A five-minute interval with a seven-minute tick produces either overlapping ticks or stretched intervals depending on semantics. Neither is usually what the operator intended.

14. Multi-agent composition and delegation

Primitive: Harness. Touch points: §22 Identity (A2A identity propagation), §6 Concurrency (pipeline races on shared state).

A single agent is the simple case. Real workloads often compose: a coding agent that chains to a review agent; a supervisor that delegates to workers; a conversational agent that hands off to a specialist via A2A [112]. How composition works is a design axis.

Composition shapes

graph TB
    classDef primary fill:#d6eaf8,stroke:#2980b9
    classDef sub fill:#fdebd0,stroke:#e67e22

    subgraph "Sequential pipeline"
        S1[Agent A] --> S2[Agent B] --> S3[Agent C]
    end

    subgraph "Supervisor / workers"
        SU[Supervisor] --> W1[Worker 1]
        SU --> W2[Worker 2]
        SU --> W3[Worker 3]
    end

    subgraph "Delegation / A2A"
        D1[Agent A] -->|"A2A task"| D2[Agent B<br/>different platform]
        D2 -->|"result"| D1
    end

    class S1,SU,D1 primary
    class S2,S3,W1,W2,W3,D2 sub
Loading

Sequential pipeline. Fixed ordering; each stage's output feeds the next. ADK's SequentialAgent, LangGraph's graph primitives, CrewAI's task pipelines [113] all implement this shape.

Supervisor/workers. A coordinating agent decides which worker to invoke for each subtask. LangGraph supervisor/swarm packages [119] and the OpenAI Agents SDK [122] are the current durable implementations; OpenAI Swarm [114] was explicitly experimental and is superseded by the Agents SDK. Flexible; harder to reason about because routing is dynamic. MAFBench [148] shows framework-level design choices alone can cause >100× latency variance and coordination success swinging from 90% to 30% depending on shape, and arXiv:2604.02460 [149] finds that single-agent LLMs outperform multi-agent systems on multi-hop reasoning under equal thinking-token budgets — a counter-cite for when not to compose.

Peer delegation (A2A). Agents on different platforms hand tasks to each other via a protocol (A2A [112], MCP-agent, ACP). Requires identity propagation (§17), trust boundaries between platforms, and agreement on task/result schema.

Parallel fan-out. Multiple agents work on subtasks in parallel; a reducer combines results. Cuts latency; multiplies cost; introduces result-merging complexity.

State sharing

Orthogonal to shape: do composed agents share state?

Shared session and sandbox. All agents in the pipeline operate on the same session, same workspace. Cheap; composable; races on external effects (§6) and prompt contamination.

Isolated sub-sessions with shared parent. Each sub-agent gets its own session; the parent coordinates. Safer; more plumbing.

Isolated sub-sessions, separate sandboxes. Full worktree isolation. Geng & Neubig [40] showed this substantially outperforms shared-workspace approaches. Highest cost; strongest safety.

Failure modes

  • Prompt pollution. A supervisor's instructions leak into sub-agents' contexts; a sub-agent's failure mode propagates to siblings. Mitigated by explicit prompt boundaries.
  • Loops. Supervisor calls Worker A which calls Worker B which calls Supervisor. Without hop limits, this is the $47K loop [50].
  • Identity confusion across A2A. When Agent A on platform X delegates to Agent B on platform Y, whose user identity does B see? Poorly specified identity propagation has produced real incidents.
  • MAST failure modes [41]. "Inter-agent misalignment" and "coordination failure" are top failure categories in multi-agent systems; neither is solved by composition frameworks alone.

15. Tool dispatch by statefulness

Primitive: Harness. Touch points: §5 External connections (stateful-connection class), §6 Concurrency (stateful-resource coordination), §18 Interruption (cancellation-aware dispatch).

§5 covers MCP as one example of stateful external tools. The broader design axis is that tools come in three statefulness classes, each with a different dispatch shape.

The three classes

Class State location Dispatch shape Lifecycle
Stateless None Direct function call; any replica, any time No setup/teardown
Stateful-connection Client-held (harness side) Session-scoped pool; connection survives across tool calls Lazy open, session-scoped close
Stateful-resource Server-held (external) Call mediated by a coordinator that tracks the resource External lifecycle

Stateless tools. Web fetch, sandbox exec (given sandbox ID), simple REST APIs. The tool function looks up any needed handle from context and calls. No persistent state on the harness side. Scales trivially.

Stateful-connection tools. MCP (§5), WebSocket APIs, gRPC streams. A connection carries negotiated capabilities, cursors, and server-side context. The harness holds the connection for the session's lifetime; the tool function delegates to a pool.

Stateful-resource tools. Databases with session state, distributed locks, rate-limited APIs with per-key state, long-running jobs whose status is held externally. The harness doesn't hold the state — the external system does — but the harness has to correlate calls to the same resource across turns, handle partial failures, and coordinate with concurrent sessions (§6).

Why the classification matters

Each class needs a different answer to the same set of operational questions:

Question Stateless Stateful-connection Stateful-resource
Where's the state? Nowhere Harness pool External system
What happens on harness restart? Nothing Reconnect Rediscover
What happens on external system restart? Next call sees fresh state Lose pool, reconnect Job may be lost or duplicated
Affinity requirements? None Session → replica helpful Depends on external system
Backpressure Gateway rate limits Pool size limits External system's limits

A harness that dispatches all tools through a single mechanism has implicitly chosen the class of the dominant tool. Systems dominated by MCP look like the §5 analysis; systems dominated by stateless tools look simpler but still have stateful tools they haven't thought through.

Failure modes

  • Treating stateful-resource as stateless. Dispatch a long-running job, forget the job ID, lose correlation. Visible as "agent said it kicked off a deploy but has no idea if it finished."
  • Treating stateless as stateful. Pool connections to stateless HTTP APIs unnecessarily; pay complexity cost without any benefit.
  • Mixed-class tools in a single logical operation. An action composed of a stateless fetch, a stateful-connection MCP call, and a stateful-resource DB write has three different failure and recovery modes in one tool call.

16. Sandbox lifecycle and bootstrap determinism

Primitive: Sandbox fabric. Consumed by: Harness (via capability request). Touch points: §23 Storage substrate (picked by lifecycle mode), §4 Workspace persistence (recovery fidelity).

The sandbox is an execution environment with a lifecycle independent of both the session and the harness. Its lifecycle mode is a design axis; so is the determinism of its initial bootstrap.

Lifecycle modes

Always-on per session. The sandbox is created at session start and stays live until session end or idle timeout. Fastest tool latency (no provisioning hops); highest cost (idle pods accrue compute charges during LLM wait).

Hibernation / pause-resume. The sandbox is paused between turns (or between scheduler ticks). Pay only for storage when paused; cold-start penalty on resume (tens of ms for Firecracker snapshots, seconds for Kubernetes pause/unpause). Good fit for bursty interaction patterns.

Lazy provisioning. No sandbox exists until the first sandbox tool call. Appropriate for agents that often don't need a sandbox (conversational agents, MCP-only monitoring). Cold-start cost is paid by the first tool call that triggers it.

Pre-warmed pool. A pool of N ready pods absorbs burst demand for ephemeral sessions. Trades off idle cost against p99 session-start latency. E2B [115], Blaxel, Modal [116] all offer this as a managed product. GKE Pod Snapshots + the Kubernetes agent-sandbox CRD (SIG, Nov 2025) [150] are a production-grade answer that combines snapshot resume with Kata/gVisor isolation; OSDI '25 "Fork in the Road" [151] analyzes cold-start latency optimizations in production serverless systems and is directly applicable.

Per-session ephemeral. Fresh pod per session, destroyed at end. Strongest isolation between sessions; worst cold-start. Paired with §4's reproducible-bootstrap option, this is a coherent minimalist design.

Bootstrap as recovery boundary

Every non-always-on mode implicitly depends on bootstrap reproducibility. If the sandbox can be destroyed and recreated, recovery fidelity is bounded by what the bootstrap spec can reconstruct.

Determinism comes in degrees:

Level Example Recovery fidelity
Pinned git checkout <sha>, lockfile-pinned npm ci High — identical workspace each time
Versioned git checkout main, npm install with lockfile Medium — drifts as upstream main moves
Floating git clone main, npm install (no lockfile) Low — whatever upstream looks like right now
Side-effectful Setup script that registers with a remote service, creates credentials, or mutates shared state Not reproducible — re-running changes external state

Recovery after sandbox loss replays bootstrap. If bootstrap is non-deterministic, "resume" is an ambiguous operation: the agent sees a workspace that is similar to what it had, not the same. Over long-lived sessions, this drift accumulates silently.

Failure modes

  • Cold-start under load. Provisioning latency gets worse when the cluster is under pressure. A pool sized for average load misses burst.
  • PVC detach race on resume. Hibernated pods that resume to a different node hit the §4 PVC zone-scoping problem.
  • Bootstrap that mutates remote state. A bootstrap that creates credentials, registers webhooks, or posts "starting" to a channel re-runs those side effects on every recovery.
  • Snapshot staleness vs bootstrap determinism. Resume-from-snapshot bypasses bootstrap entirely; the determinism table above applies only to fresh boots. Snapshot-based recovery inherits whatever the snapshot captured, including expired credentials, stale DNS caches, and now-invalid session tokens. A distinct failure mode from non-reproducible bootstrap.
  • Pool exhaustion. Pre-warmed pools without admission control collapse under unexpected demand; pods that go to the pool unwashed from a prior session leak state.

17. Session branching and fork

Primitive: Persistence fabric (branch / fork substrate). Consumed by: Harness. Touch points: §23 Storage (CoW / snapshots make forking cheap), §4 Workspace persistence (branching needs substrate).

Section 3 treated the session as a linear append-only log. Production harnesses have moved past this: Claude Code [117], Cursor [118], Replit Agent [33], LangGraph [119], and OpenAI's Codex CLI [120] all ship some form of rewind, checkpoint, or fork — turning the session from a list into a tree (or at least a log with a movable head). That changes the data model, and it creates trade-offs that are independent of plain durability.

Options

Linear-only. The trajectory is immutable; to back out of a bad turn, users start over. Simplest; hostile to the user when the agent takes a wrong turn after 30 useful steps.

Rewind / truncate. Jump back to an earlier checkpoint; discard the tail. Claude Code's /rewind and Cursor's per-edit checkpointing implement this. Workspace rollback is coupled to the rewind — the filesystem is restored to match. Cheap; loses information the tail might have contained that the next attempt would benefit from.

Forkable tree. Multiple live branches from any checkpoint, all preserved. LangGraph's time-travel model treats checkpoints as a DAG with explicit branch IDs [119]; Claude Code's /fork spawns a child session from a shared history point [117]. Maximum flexibility; requires a branch manager and a story for reconciling divergent sandbox/workspace state.

The workspace-branching problem

graph LR
    classDef easy fill:#d5f5e3,stroke:#27ae60
    classDef hard fill:#f9d6d6,stroke:#c0392b

    CK[Checkpoint]
    CK --> B1[Branch A: conversation]
    CK --> B2[Branch B: conversation]
    CK --> W[Workspace / sandbox state]
    W --> C{Shared or cloned?}
    C -->|shared| RACE[Branches interfere]
    C -->|cloned| COST[Clone cost per fork]

    class B1,B2 easy
    class RACE,COST hard
Loading

Branching conversation state is cheap — it's just a tree of event log IDs. Branching workspace and sandbox state is not. The substrate choices from §23 matter: CoW overlays and content-addressed snapshots make forking cheap; PVCs don't. Forks that share the workspace risk interference; forks that clone the workspace pay the clone cost (which is bounded by the storage substrate, not the harness).

Failure modes

  • External state divergence. Branches diverge from external state (git remotes, issue trackers, MCP server-side cursors) that can't be branched. An agent running on branch A may push a commit that branch B then tries to push and conflicts with.
  • Merge-back is genuinely unsolved. Conversation merge is not semantically meaningful in most cases; file merge is, but the agent has to be the thing that performs it, not the harness.
  • Orphan branches as cost leaks. Every abandoned branch holds workspace, sandbox, and session log resources until retention policy cleans it up. Without a branch TTL, orphan accumulation is real.
  • Checkpoint granularity. Too fine (one per edit) produces snapshot storms; too coarse (one per user turn) loses intra-turn recovery points.

When each applies

Rewind alone is enough for interactive coding. Forkable trees pay off when users want to explore alternative approaches in parallel, when evaluation pipelines need to replay a session with a changed prompt, or when the harness is itself an experimentation platform. Linear-only is defensible for strictly transactional workloads (a digest agent, a single-shot classifier).


18. Interruption, cancellation, and steering semantics

Primitive: Harness. Touch points: §30 Policy / HITL (pause enables approval), §15 Tool dispatch (cancellation must propagate).

The harness is driving a long-running loop. The user or an approval system sometimes needs to redirect it mid-flight — pause for approval, inject new information, cancel entirely, or rewind to a point and re-run with a nudge. These are runtime mechanics, separate from policy (which determines when an interrupt should happen). Conflating them produces both weak policy enforcement and weak user control.

Options

Hard cancel only. kill(session) with no state preservation; workspace and sandbox torn down on cancel. Brutal; simple; makes every user correction a context-losing event.

Cooperative pause + resume. The harness checks an interrupt flag at step boundaries (between LLM calls, tool calls, sandbox exec); on interrupt, serializes state, awaits a resume command. LangGraph's interrupt() implements this as a first-class graph primitive [121]; OpenAI Agents SDK exposes RunHooks / AgentHooks as pause points [122]. Requires every long-running step to be cancellation-aware — in practice, this means the tool dispatch layer has to cooperate.

Queued steering. User input is appended to the next turn's context without pausing the current step. Devin's "Ask" mode [123] does this: the user can ask questions or redirect without interrupting the task. Fast; the current step may still commit a stale decision before it sees the steering message.

Resume-at-point. A variant of fork (§17) where the user sends a new instruction and rewinds to an earlier point; the harness replays forward from there with the new instruction in context.

The cancellation-awareness cascade

graph TB
    classDef aware fill:#d5f5e3,stroke:#27ae60
    classDef blind fill:#f9d6d6,stroke:#c0392b

    U[User cancels]
    U --> H[Harness sees flag]
    H --> L[LLM call: respects context deadline]
    H --> T[Tool call: propagates to tool]
    T --> S[Sandbox exec: respects SIGTERM]
    T --> M[MCP call: respects RPC cancel]

    class H,L aware
    class T,S,M blind
Loading

Cancellation is only as effective as its most-blind hop. An LLM call that doesn't honour its context deadline holds the session for tens of seconds past cancel. A sandbox command that ignores SIGTERM holds for the grace period. An MCP call whose server doesn't support cancellation either blocks until the RPC times out or is orphaned.

Failure modes

  • Stale steering. User message queued for the next turn becomes meaningless when the current step ships a decision the user was trying to redirect.
  • Leaked resources on hard cancel. Cancel that drops the session without unwinding leaves orphaned sandbox pods, MCP connections, and temp files.
  • Interrupt-handler as attack surface. An interrupt flag that any caller can set (not authenticated) lets a compromised tool or MCP server DoS the harness.
  • Approval-gate latency. Cooperative pause that waits synchronously for human approval holds compute for the entire wait; for overnight approvals, this is a cost and failure-mode problem (long TCP connections, LLM keep-alive, etc.).

Why a separate axis

Policy and HITL (§30) decide what requires approval. Interruption is the runtime verb that lets the approval happen. Systems that bolt approval onto a harness without a first-class pause primitive end up with either (a) hard-cancel-and-restart on every approval (expensive, user-hostile), or (b) policy checks that only fire at step boundaries convenient for the harness, not the ones the policy actually wanted.


19. Agent authoring model

Primitive: Harness. Touch points: §14 Multi-agent composition (authored capability surface), §9 Security (plugin trust boundary).

How an agent is authored, packaged, and discovered at build time. Section 2 covered the runtime lifecycle of an agent instance (singleton, registry, per-request). This is the question one layer earlier: what is the unit of agent definition, and how do end-users add new ones?

Options

Monolithic definition. One prompt, one tool list, one model binding — all chosen at deploy time. The harness API exposes "run an agent"; there is no build-time surface for adding capabilities. Simple; no user-extensibility; every new agent type is a code change.

Declarative capability bundles. Agents (or capabilities) are files on disk with a manifest the harness auto-discovers. Claude Code plugins [124] distinguish skills (auto-discovered, on-demand capability bundles), subagents (custom prompt + tool subset + model, spawned via an Agent tool), and slash commands (manual invocation), all packaged as plugins. CrewAI defines crews and agents as YAML/Python declarative objects [113]. The harness parses, validates, and dispatches to these at runtime.

Runtime-composable. Agents are code objects; composition is programmatic. OpenAI Agents SDK models delegation as handoffs — agent-to-agent transitions rendered as tools (transfer_to_refund_agent) with input filters [125]. LangGraph builds agents as node graphs at runtime. Mastra registers tools on Agent objects via createTool [126]. Maximum flexibility; only accessible to developers who can write code in the harness's language.

Trade-off axes

Axis Monolithic Declarative Runtime-composable
User extensibility None High (file drops) High (code)
Validation surface At deploy At load At runtime
Migration cost between harnesses N/A High (custom manifests) Very high (custom APIs)
Trust boundary for user-authored capabilities N/A Plugin sandboxing required Full harness code privilege

Failure modes

  • Plugin trust boundary. A user-authored plugin that brings its own tools and prompts inherits what trust level? Claude Code deliberately forbids hooks, mcpServers, and permissionMode in plugin subagents because a plugin author is not necessarily a harness admin.
  • Manifest schema drift. Declarative formats evolve; old manifests break silently unless the loader is strict and versioned.
  • Discovery collisions. Two plugins define a /deploy slash command or a review-pr subagent with the same name. First-writer-wins is a footgun.
  • Portability tax. Porting a Claude Code subagent to CrewAI or ADK is a rewrite; the authoring format is the lock-in, not the model.

Why a separate axis

§2 is about instances; §14 (multi-agent composition) is about how they compose at runtime. Neither addresses the build-time authoring surface, which is where user-extensibility lives. A harness with strong instance-lifecycle (§2) and strong composition (§22) but no authoring model has no way for end-users to add capabilities without modifying the harness itself.


20. Tool registry, discovery, and selection

Primitive: Harness. Touch points: §10 Model capability (schema bloat hits model), §9 Security (poisoned tool descriptions), §24 Gateway (aggregation).

How the model learns a tool exists. Orthogonal to MCP as transport (§5) and to tool statefulness (§15). Every harness has to answer: which tools does the model see in any given turn, and how were they selected?

Options

Static flat list. All tools registered at agent creation, sent to the model in every system prompt. Simple; runs out at roughly 30–50 tools because of context cost and model attention (BFCL and ACEBench both show tool-call accuracy degrading as the tool list grows).

Dynamic tool-RAG. Tool schemas are embedded; per turn, the harness retrieves the top-k most relevant to the user's current message. Scales to thousands of tools at the cost of retrieval-layer error modes (wrong tool retrieved, right tool not retrieved).

Hierarchical namespacing with mount/unmount. The agent sees a pruned subset per phase: during "understand the problem," it sees read-only tools; during "make a fix," it sees write tools. Requires the harness to track phase and swap tool lists explicitly.

Model-driven tool discovery. An agent tool like list_available_tools(category) lets the model itself ask for what's available. The ceiling moves up but the model's discovery quality varies by capability (§10).

Selection versus transport

graph LR
    classDef transport fill:#d6eaf8,stroke:#2980b9
    classDef registry fill:#fdebd0,stroke:#e67e22

    subgraph "§5 Transport"
        MCP[MCP servers]
        HTTP[HTTP APIs]
        LOCAL[Local functions]
    end

    subgraph "§20 Registry"
        CAT[Tool catalog<br/>names, schemas, descriptions]
        SEL[Per-turn selection<br/>static / RAG / namespaced]
    end

    MCP --> CAT
    HTTP --> CAT
    LOCAL --> CAT
    CAT --> SEL
    SEL --> MODEL[Model context]

    class MCP,HTTP,LOCAL transport
    class CAT,SEL registry
Loading

The gateway (§24) aggregates transport; the registry curates what the model sees. A harness can have a perfect MCP gateway and still fail at selection because it presents all 400 aggregated tools to every turn.

Ecosystem

Claude Code auto-discovers skills and commands from plugin directories with manifest conventions [124]. OpenAI Agents SDK distinguishes function tools, hosted tools, and MCP tools with different schema and guardrail pipelines [125]. Arcade [77] and Composio [78] offer hosted tool catalogs with search and discovery as products. Mastra registers tools per-agent with explicit inputSchema / outputSchema [126].

Failure modes

  • Tool-name collisions. Two MCP servers expose list_issues; the model disambiguates by guessing. Mitigated by forced namespacing (github.list_issues, linear.list_issues).
  • Schema bloat. 400 tool descriptions in the system prompt dominate the cost of every turn. Tool-RAG or mount/unmount is the scaling answer.
  • Tools that exist but the model never calls. Description is too abstract; model can't pattern-match. Visible only via selection-rate observability (§26).
  • Tool-poisoning via description. Invariant Labs' mcp-injection-experiments showed that malicious tool descriptions at the catalog layer can inject instructions into the model through the system prompt, bypassing user prompts.

21. Artifact surface

Primitive: Persistence fabric (blob store). Consumed by: Harness. Touch points: §4 Workspace persistence (TTL race), §33 Audit (lineage / right-to-erasure).

An artifact is a generated file, image, build output, or structured result that is a user-facing deliverable — distinct from the workspace (§4), which is internal state. Replit's agent markets CSVs, PDFs, slide decks, and Markdown as first-class deliverables alongside the built app [33]; OpenHands exposes /download_files and event-stream file-artifact capture [127]; OpenAI's Responses API returns images, files, and structured outputs as attached artifacts distinct from message content [128]; E2B sandboxes provide a downloads API for files produced during execution [115].

Options

In-workspace-only. All outputs live under /workspace; the client must know which paths are deliverables and fetch them directly. Cheap; ambiguous — the client has no structured signal for "what's a result versus scratch."

Artifact events on the session log. The agent emits an explicit artifact event (blob ID + MIME type + filename + description). The client subscribes. Clean separation; requires a blob store and a retention policy independent of workspace. Retention of artifacts often outlasts workspace (the workspace is torn down; the PDF the agent produced is still the user's deliverable).

Typed artifact contract. The harness defines a taxonomy (image, file, table, structured object, chart) with per-type renderers at the client. Enables richer UIs — a chart artifact is rendered, not just downloaded. Couples the harness to artifact taxonomy; adds a schema surface to keep stable.

Lineage and reproducibility

graph LR
    classDef art fill:#d5f5e3,stroke:#27ae60
    classDef meta fill:#fdebd0,stroke:#e67e22

    ART[Artifact: report.pdf]
    ART --> CK[Checkpoint §17 ckpt_42]
    ART --> SESS[Session §3 ses_abc]
    ART --> TOOL[Tool call: pandoc_convert]
    ART --> IN[Input: draft.md]

    class ART art
    class CK,SESS,TOOL,IN meta
Loading

A serious artifact system records which checkpoint produced the artifact, from which inputs, via which tool call. That's the audit trail (§33) and the reproducibility story. Without it, an artifact is an opaque blob whose provenance is lost as soon as the session ages out.

Failure modes

  • Workspace TTL racing artifact upload. Gitpod issue #9544 [32] is the classic: workspace timeout fires before the final sync completes and the user's deliverables are gone. Artifact pipelines that piggy-back on workspace storage inherit this race.
  • Unbounded artifact storage. Without retention policy, generated artifacts accumulate. At per-user scale this is a cost leak; at multi-tenant scale it's also a compliance exposure (§33).
  • Artifact without attribution. An artifact with no lineage metadata cannot be reproduced, audited, or selectively deleted under right-to-erasure.
  • Client-server artifact divergence. Typed artifact contracts that evolve independently of client renderers produce "unknown artifact type" user-facing errors.

Part II — Ecosystem around the harness

The ten axes above describe choices that live inside the harness process. Part II covers the surrounding systems the harness integrates with. These are independently versioned, usually owned by different teams, and usually the place where production AI platforms actually fail — not inside the harness itself, but at the seams between the harness and the systems it depends on.


22. Identity, authorization, and delegation

Primitive: Identity fabric (first-class — IdP + token broker + ABAC/RBAC engine). Consumed by: all three other primitives. Touch points: §3 Session durability (sessions are identity-owned), §11 Transport (canonical-user mapping), §30 Policy (caller attributes feed decisions), §31 Cost (tenant attribution), §32 Memory (per-tenant partitioning), §33 Audit (attribution + erasure cascade), §9 Security (ambient authority).

The identity fabric is the substrate on which tenancy, resource ownership, isolation guarantees, sharing semantics, audit attribution, and cost chargeback are all built. It resolves every caller into a tuple (user_id, tenant_id, agent_id, workload_id, scopes, attributes) that the other three primitives consume on every operation. This section has two slices: the authorization slice (tokens, delegation, OBO, SPIFFE) covered first, and the substrate slice (tenancy, resource-identity association, isolation, sharing, lifecycle, ABAC/RBAC engine) covered second. Systems that build only the first end up with over-scoped tokens, ambient authority, and unclear audit trails; systems that build only the second cannot express "a user acting on behalf of another tenant's shared resource." Both are required.

Identity layers (authorization slice)

The harness has three identities to reconcile on every request: the user whose intent is being served, the agent acting on the user's behalf, and the workload (the process or pod making the actual call).

graph TB
    classDef user fill:#d6eaf8,stroke:#2980b9
    classDef agent fill:#fdebd0,stroke:#e67e22
    classDef workload fill:#d5f5e3,stroke:#27ae60

    U[User identity<br/>human, SSO, MFA]
    A[Agent identity<br/>name, version, capabilities]
    W[Workload identity<br/>pod, process, SPIFFE ID]

    U -->|"on-behalf-of"| A
    A -->|"runs as"| W
    W -->|"calls"| EXT[External systems]

    class U user
    class A agent
    class W workload
Loading

Authorization options

Bearer tokens with broad OAuth scopes. The default. The user authorizes the agent once with broad scopes (repo, email:send), the agent reuses the same token for every action for the lifetime of the session. Simple; poor blast-radius control. GitHub's fine-grained PATs [62] and Google's one-time consent screens are the response to this pattern.

Short-lived, narrowly scoped capability tokens. Each tool call mints a token with exactly the permissions needed and a short TTL (minutes). Requires a token broker in the trust plane and tools that accept scoped tokens. Much better blast radius, much more plumbing. Biscuit [63] and Macaroons [64] are the formal versions; practical systems tend to use short-lived OAuth JWTs with tight aud and scope.

On-behalf-of (OBO) flows. The user delegates to the agent, the agent delegates to a specific downstream service, the downstream service sees a token with both identities attested. Required when the user's authorization must propagate end-to-end (e.g., a user-scoped query against a data warehouse). RFC 8693 [65] formalizes token exchange; Microsoft's Entra OBO flow is the most common implementation. IETF draft-oauth-ai-agents-on-behalf-of-user-00 [152] extends OAuth 2.0 to encode user + agent + client identity in a single delegated token — the standards-track answer to the "agent as first-class identity" problem. Okta for AI Agents (Sept 2025 EA) [153] and its Cross-App Access primitive make the pattern a shipping product, not just research.

Workload identity (SPIFFE/SPIRE). The pod proves what it is to the trust plane without a shared secret. Essential when the harness runs in Kubernetes and needs to distinguish "real harness replica" from "something else in the cluster" [66]. Orthogonal to user identity — both are needed.

Tenant model

The tenancy model is the coarsest resource-ownership dimension. Every resource in the persistence fabric (sessions, workspaces, memory, artifacts, audit entries, cost records) either has a tenant attribute or is implicitly in a single tenant. Choices:

  • Flat user (no tenants). Each user is their own island. Simplest; works for single-org consumer tools and dev sandboxes. Breaks the moment a user needs to share a session with a colleague or a billing admin needs to roll up costs.
  • User + team. Users belong to one or more teams; resources can be owned by a user or by a team. Fits small orgs, SaaS tiers. Adequate for most B2B early-stage products. Identity fabric resolves "does Alice have access to team-X's sessions?" via team membership.
  • User + org + project. Three-level hierarchy: user belongs to org, org contains projects, resources live in projects. The default enterprise model (Stripe, Linear, GitHub orgs). Supports per-project isolation inside a shared org plus org-level billing.
  • Hierarchical ABAC attributes. No fixed levels; tenancy is an emergent property of attribute rules ("user has dept=legal and resource has classification=privileged"). AWS IAM's [184] resource tags and condition keys, GCP IAM's [185] resource hierarchy, and Cedar [101] / Cedar-Agent all natively support this. Maximum flexibility; steep policy-authoring and audit cost.

Trade-off. Isolation granularity trades directly against sharing flexibility. Flat-user is trivial to isolate but cannot share at all. Three-level hierarchies are clear to reason about but calcify when the real org shape is a matrix. ABAC handles any topology but makes "who can see what?" undecidable by inspection — only the policy engine knows.

Resource-identity association

Every persistence-fabric resource has to decide how identity associates with it. The design space:

  • Owner-only. One user is the owner, no one else sees the resource. The simplest model; drives most individual-user session stores.
  • Owner + ACL. Owner plus an explicit access-control list (user IDs or group IDs with roles: read, write, admin). Google Docs–style sharing.
  • Attribute-based (ABAC). No explicit list; access is computed by a policy against caller attributes and resource tags. Sessions tagged {tenant: acme, project: alpha, classification: internal} are visible to callers with matching attributes. Scales better than ACLs for large orgs; opaque without tooling.
  • Per-tenant logical partitioning. All resources share a schema but carry a tenant_id column (or Pinecone-style namespace, or pgvector row with a tenant filter). Every query is rewritten to filter on tenant. Cheap; one SQL bug away from a cross-tenant data breach.
  • Per-tenant physical isolation. Each tenant gets a separate database, schema, or cluster. Strongest isolation; highest operational cost; harder to roll up cross-tenant analytics or shared admin views.

Each resource class (sessions, workspaces, memory indexes, artifacts, audit log, cost records) makes this choice independently. A common production shape is: sessions and workspaces as owner + ACL; memory indexes as per-tenant namespace (logical); audit log as a shared physical store with tenant-tagged rows; artifacts in a single object store with per-object ACLs enforced at the gateway.

Failure modes.

  • Cross-tenant retrieval bleed-over. Vector memory, RAG indexes, and embeddings stores are the highest-risk category. A top-k query against a shared index without an enforced tenant_id filter returns another tenant's data. Pinecone namespaces [103] and pgvector row-level security [104] are the standard mitigations; either the query is enforced to include the filter, or the index itself is per-tenant.
  • Implicit tenancy from request headers. If the tenant ID is read from an HTTP header without cryptographic binding (JWT claim, mTLS client cert), a header-spoof reaches any tenant.
  • Audit entries without owner. An event log that records "session X did Y" without a verified identity tuple is worthless for forensic analysis.
  • Artifacts detached from owner. An artifact uploaded by a session needs to inherit the session's owner and ACL. Systems that skip this produce orphan artifacts that survive user deletion.

Isolation guarantees

Given a chosen tenant model, isolation lives on a spectrum:

Tier Mechanism Blast radius on bug Cost Fit
Logical Shared DB + tenant column + enforced filter One missing WHERE clause leaks everything Lowest Dev, internal tools, low-stakes SaaS
Namespace Separate schemas / keyspaces / Pinecone namespaces per tenant Misrouted query hits wrong tenant Low Most B2B SaaS
Physical Separate DBs / clusters / accounts per tenant Bug stays in one tenant High Regulated (HIPAA, PCI), large enterprise, sovereign-cloud

A harness serving a mix of tenants often uses different tiers per resource class: audit logs physically sharded for regulatory reasons, sessions logically partitioned for cost, high-sensitivity memory physically separated per tenant. The cost/isolation trade-off is explicit and per-resource.

Sharing semantics

Real products need sharing. The design questions:

  • Cross-user session share. Can Alice share a running session with Bob on her team (read-only live view; collaborative editing; takeover)? Implemented as an ACL entry or an attribute grant; requires the harness to accept multiple concurrent viewers (§11 transport) without privilege escalation.
  • Cross-tenant catalogs. A shared marketplace of agents, tools, or prompts readable by all tenants, writable only by a platform tenant. Implemented as a "public" visibility flag or a platform-tenant-owned resource with a universal read grant.
  • Delegation chains. Alice grants Bob access to session X for one week; Bob invokes an automated agent on the session. Whose identity does the agent run as — Alice's, Bob's, or a third delegated identity? Systems that collapse the chain lose attribution and cascade revocation.
  • Invite / accept model. Grants pending on the invitee accepting. Avoids accidental exposure; adds one more state machine.
  • Revocation semantics. When a grant is revoked, do in-flight sessions terminate, continue read-only, or finish their current turn? Revocation latency is a design choice (synchronous hard-kill vs. token-TTL expiry).

Sharing surfaces cross with every resource class. A platform that gets it right for sessions but not for memory produces "I revoked Bob but his session can still query my memory" — a cross-tenant leak with a paper trail.

Identity lifecycle and cascade

Identity is not static. The cascades matter more than the happy path:

  • User deletion → resource cascade. Deleting a user must cascade to their sessions, owned workspaces, memory entries, artifacts, cost records, and audit entries — modulated by retention rules (§33: audit entries may be required to persist for N years under tamper-evident chaining; erasure becomes tombstone + proof of deletion).
  • Tenant migration / re-parenting. Acme is acquired by Globex; all Acme resources must re-parent under Globex's tenant. Rare but load-bearing when it happens; hard to retrofit if tenancy was implicit.
  • User re-authentication after revocation. After a token is revoked, in-flight operations need to either halt at the next boundary or re-authenticate. Long-running scheduled agents (§12) are the hard case.
  • Agent decommission. An agent version is retired; its workload identity should no longer mint new tokens, but sessions still referencing it must resolve for audit.
  • Right-to-erasure scope (§33). GDPR erasure is a graph traversal from a user to every resource the identity fabric has attributed to them. Without a canonical ownership graph, erasure is incomplete; with one, it is a batch job.

Identity as audit and cost primitive

Every audit event carries a caller identity tuple; every cost record requires tenant attribution. These are not separate concerns from identity — they are the identity fabric's highest-volume consumers. If an audit entry is missing (user_id, tenant_id, agent_id, workload_id) it is forensically useless; if a cost record lacks tenant_id it cannot be billed. §26 (observability) and §31 (cost) both depend on the identity fabric emitting consistent tuples on every call.

ABAC / RBAC substrate

The policy engine that evaluates "can this caller take this action on this resource?" is the shipping surface of the identity fabric. Options:

  • OPA / Rego [100]. Kubernetes-native, general-purpose, data-driven. Widely deployed; Rego has a learning curve.
  • AWS Cedar [101]. Typed, analyzable, verification-friendly. Cedar-Agent extends Cedar with streaming policy updates via OPAL for agent use cases.
  • Oso [186]. Authorization-as-a-library with an opinionated resource-relationship model; positioned as a ReBAC (relationship-based) alternative to OPA for app-level authorization.
  • Permit.io [187]. Authorization-as-a-service that composes OPA, Cedar, or Zanzibar-style backends behind a unified API.

This is distinct from §30 (policy / HITL for agent actions like "can the agent send this email?"). §22's policy is about resource access by identities; §30's policy is about action risk by agents. A production system runs both engines — sometimes collapsed into one Rego bundle, sometimes deliberately separated so the identity team and the agent-safety team each own their policy surface.

Failure modes

  • Ambient authority. A single long-lived token in the environment is used by every tool call. Prompt injection can trigger any action the token permits. Invariant Labs' May 2025 disclosure of the GitHub MCP "one-PAT-for-all-repos" pattern [154] is the canonical real-world example: cross-repository token scope creep turned a reasonable-looking PAT into a cross-tenant exfiltration primitive.
  • Consent fatigue. Requiring re-consent per action drives users to grant broad scopes just to avoid the dialog. Salesforce's Agentforce identity model [67] and Auth0's agent-identity work [68] explicitly address this: the agent is a first-class identity, not a proxy for the user.
  • Agent-to-agent delegation. When agent A calls agent B, whose identity does B see? If A passes through the user's token, B is now operating with the user's full authority for its entire lifetime.
  • Revocation latency. A user revoking consent doesn't invalidate in-flight tokens. Sessions can continue running on revoked authority until the token expires.
  • Silent tenant coupling. Resources that inherit tenancy implicitly (from the caller's default tenant at creation time) become unshareable without re-parenting. Make tenancy explicit at resource creation.
  • ACL / ABAC drift. Tenant or org re-org changes membership; resources retain old ACLs or tags. Periodic reconciliation required.

When each applies

Short-lived capability tokens are worth the cost when the agent's allowed actions are high-privilege or irreversible. OBO is load-bearing when downstream authorization depends on the user (data access, per-tenant ABAC). SPIFFE is standard for any multi-replica, multi-service harness. Broad OAuth scopes are acceptable for single-user, low-privilege deployments and almost nowhere else. Flat-user tenancy fits dev and single-org tools; user + team for most SaaS; org + project for enterprise; ABAC when the org shape is not a tree. Logical partitioning is the default; namespace isolation when cross-tenant bugs are unacceptable; physical isolation when compliance demands it.


23. Storage substrate beyond PVCs

Primitive: Persistence fabric. Consumed by: Harness (via workspace contract), Sandbox fabric (via volume mount). Touch points: §4 Workspace persistence (one substrate for workspace), §17 Branching (CoW enables cheap fork).

Section 4 covered PVCs as one workspace-persistence strategy. The substrate choice is a design axis in its own right, and the ecosystem has moved well past "one PVC per session."

Options

PVC (ReadWriteOnce, zone-scoped). Covered in §4. Simple; zone-scoped; slow teardown race; no native snapshot.

Copy-on-write overlay (overlayfs, btrfs, ZFS). A base image is shared read-only across sessions; per-session writes go into an overlay. Cold start is fast because the base is already cached on the node. Recovery is fast because the overlay is small. The cost is that the node must have the base; cross-node migration requires lazy fetching.

Content-addressed snapshots. The workspace is checkpointed to content-addressed storage (essentially restic or zfs send piped to object storage). Snapshots compose with idempotency — identical files across sessions dedupe. Firecracker snapshots [69] take this to the microVM level: the entire VM (memory + disk) serializes to object storage and restores in tens to hundreds of milliseconds. Firecracker + REAP prefetching (USENIX ATC 2024) [155] reports 1.04–9.7× invocation speedup over baseline snapshots and 61–96% memory footprint reduction, with concrete numbers to plan capacity against.

Lazy-clone volumes. The volume presents immediately but blocks on page fetches for files not yet pulled. Depot [70], Namespace Labs [71], and the OCI-standard Nydus/stargz-snapshotter [156] (containerd-native) report cold-start in the hundreds of milliseconds even for multi-gigabyte base images. Strong fit for per-session fresh clones of large repos. A counterpoint worth knowing: Microsoft's 2025 LLM container cold-start analysis [157] documents a 14 GB Python/CUDA stack ballooning to 900 s when FUSE-streamed — sub-second claims are highly workload-dependent.

Virtualized filesystems (FUSE / object-store-as-FS). s3fs, gcsfuse, or a custom FUSE layer presents object storage as a POSIX filesystem. Durable by construction; latency-per-call depends on the object store. Strong fit for read-heavy agents with minimal writes.

Ephemeral disk + git as state. No durable workspace at all. Each session clones fresh from git, works on a branch, pushes on completion. The filesystem is scratch; the durable state lives in the git host. Maximum operational simplicity, no workspace persistence problem — but no state persists between turns unless committed.

Trade-off axes

graph LR
    classDef fast fill:#d5f5e3,stroke:#27ae60
    classDef slow fill:#f9d6d6,stroke:#c0392b

    subgraph "Cold start"
        A1[Firecracker snapshot: tens of ms]
        A2[Lazy clone: hundreds of ms]
        A3[Overlay CoW: seconds]
        A4[PVC mount: seconds to minutes]
        A5[Git clone from scratch: minutes]
    end

    class A1,A2 fast
    class A4,A5 slow
Loading

Cold-start cost, durability guarantee, blast radius on node failure, and cost of snapshot frequency form a four-axis trade-off. No substrate is best on all four.

Failure modes

  • Node-pinned state. CoW overlays tie the session to the node that has the base cached. Rescheduling loses the advantage.
  • Snapshot pipeline lag. If snapshot cadence is every five minutes and the pod dies three minutes after the last snapshot, three minutes of work is lost regardless.
  • Object-store consistency. FUSE-backed filesystems inherit the underlying store's consistency model. S3's strong read-after-write is recent; older integrations still assume eventual consistency.
  • Cost of full snapshots. Firecracker snapshots serialize memory; a 4 GB VM produces a 4 GB snapshot. At session scale, storage cost is non-trivial.
  • Snapshot memory-state leakage. Restoring the same snapshot across tenants leaks in-memory secrets and PRNG state (documented explicitly in Firecracker's snapshot-support docs). Multi-tenant snapshot reuse requires per-tenant zeroing or re-keying.

24. Proxy and gateway topology

Primitive: Touch point (harness ↔ external systems). Touch points: §8 Operational (gateway SPOF), §31 Cost (gateway enforcement), §25 Network policy (egress is non-bypassable).

The harness talks to at least four external system classes: LLM providers, tool servers (MCP and otherwise), user-facing transports (HTTP, Slack, email), and infrastructure services (databases, object stores). Each class typically sits behind a dedicated gateway, and the topology of those gateways is a design decision in its own right.

The four gateway classes

graph TB
    classDef harness fill:#d6eaf8,stroke:#2980b9
    classDef gateway fill:#fdebd0,stroke:#e67e22
    classDef external fill:#f9d6d6,stroke:#c0392b

    H[Harness]

    H --> IG[Ingress API gateway<br/>auth, rate limit, shape]
    H --> LG[LLM gateway<br/>routing, cache, fallback, budget]
    H --> TG[Tool / MCP gateway<br/>aggregation, auth, catalog]
    H --> EP[Egress proxy<br/>credential injection, policy]

    IG --> USER[Users / transports]
    LG --> LLMS[LLM providers]
    TG --> MCPS[MCP servers]
    EP --> EXT[Arbitrary internet]

    class H harness
    class IG,LG,TG,EP gateway
    class USER,LLMS,MCPS,EXT external
Loading

LLM gateway

Purpose: routing across providers, model version pinning, response caching, prompt caching, retry on transient failure, automatic fallback on provider outage, per-key budget enforcement, request/response logging.

Ecosystem: LiteLLM [72], Portkey [73], Kong AI Gateway [74], Cloudflare AI Gateway [75], Helicone [76]. OpenAI-compatible interfaces have become the lingua franca (covered in §10).

Failure modes: gateway as SPOF (covered in §8); cache-key pollution when model behavior depends on implicit context; fallback routing that changes tool-call syntax mid-session.

Tool / MCP gateway

Purpose: aggregate many MCP servers behind one endpoint, unify auth, provide a catalog/discovery layer, enforce per-tool policy. Without a gateway, each session holds N direct connections; with one, N connections are multiplexed.

Ecosystem: Arcade [77], Composio [78], Cloudflare MCP gateway, Pomerium for MCP [79]. The MCP 2026 roadmap [34] treats gateway patterns as first-class.

Failure modes: gateway becomes a trust-sensitive component (it now holds OAuth tokens for many servers); gateway caches and tool-state cursors create the same staleness problems as direct MCP pooling; single-vendor lock-in on the tool catalog.

Egress proxy

Purpose: the only network path out of the sandbox. Injects credentials at request time, enforces domain allowlists, logs egress for audit, terminates TLS for content inspection (where policy permits).

Ecosystem: Squid and Envoy remain the workhorses; Pomerium and Cloudflare Zero Trust layer identity on top; mitmproxy for audit-focused deployments.

Failure modes: programs that ignore HTTP_PROXY (covered in §9); TLS cert pinning that defeats inspection; DNS tunneling that bypasses HTTP-layer controls (the Unit 42 finding [13]).

Ingress API gateway

Purpose: authenticate inbound requests, rate-limit per tenant, shape traffic by priority, enforce request-size limits.

Ecosystem: Kong, Envoy, Ambassador, cloud provider offerings. Increasingly augmented with AI-specific shaping (token-aware rate limits, prompt-size limits).

The compositional question

Each gateway class can be a separate service or collapsed with a neighbor. Common collapses: LLM gateway + egress proxy (both see external HTTP); Tool gateway + egress proxy (all tool calls become egress); ingress gateway + LLM gateway (user-facing proxy also shapes model calls). Each collapse simplifies deployment and loses a separation of concerns. The separations matter most in high-trust-boundary environments.


25. Network policy and egress enforcement

Primitive: Sandbox fabric. Touch points: §9 Security (non-bypassable enforcement), §1 Placement (makes egress real), §24 Gateway (egress proxy).

Proxies are configuration-level controls. Network policy is the substrate that makes them non-bypassable. The two together form a defense-in-depth; neither is sufficient alone.

Enforcement layers

Layer Mechanism Enforcement point Bypass vector
HTTP client HTTP_PROXY env var Application Any runtime that ignores the env var
iptables / nftables Kernel packet filter Node Root in the sandbox (shouldn't exist, but)
Kubernetes NetworkPolicy CNI plugin Pod network Misconfigured default-allow
Service mesh (Istio, Linkerd) mTLS + L7 policy Sidecar Sidecar bypass if not enforced
eBPF (Cilium, Tetragon) Kernel syscalls Syscall boundary Kernel vulnerability
Private subnet + NAT Network topology Route table Misconfigured route

Options

K8s NetworkPolicy + default-deny. Baseline. Namespace-level pod egress restricted to named destinations. Works, but is L3/L4 only — it can permit "egress to 10.0.1.0/24" but cannot enforce "only this HTTP path."

eBPF L7 policy (Cilium). L7-aware policy at the kernel level: "can call GET /api/v1/users but not DELETE." ARMO [56] measures eBPF enforcement at 1–2.5% CPU overhead. Strong fit when application-layer controls are bypassable by prompt injection.

Service mesh with L7 policy. Istio, Linkerd, Consul. mTLS between services by default, L7 policy per-service. Mesh sidecars are themselves trust-sensitive; bypass modes include pod-spec escapes and hostNetwork: true.

Private subnet + sole egress proxy. Network topology guarantees there is no path except through the proxy. Strongest enforcement; requires cloud-level design and doesn't move with the pod.

DNS policy. Separate layer from HTTP. An allowlist of HTTP hosts is bypassable if DNS resolves to attacker-controlled IPs or if DNS-over-HTTPS is permitted. DNS tunneling [13] is a documented exfiltration channel even through "isolated" sandboxes.

The non-bypassability question

"Egress is controlled" is a spectrum, not a binary. The right question to ask is: what is the minimum capability needed to bypass? If the answer is "any unprivileged process in the sandbox," controls are bypassable. If the answer is "a kernel exploit," controls are probably sufficient for non-state-actor threat models.


26. Observability and telemetry

Primitive: Touch point (cross-cutting — all primitives emit audit / telemetry). Touch points: §27 Evaluation (traces feed eval), §33 Audit (PII in traces), §31 Cost (per-tenant attribution).

Agent observability has two regimes: infrastructure observability (CPU, memory, pod health, latency) is the same as any Kubernetes service and covered by existing tools. Agent-behavior observability — what the model decided, why, and whether it was good — is AI-specific and requires dedicated infrastructure.

Infrastructure layer

Prometheus + Grafana for metrics; OpenTelemetry for traces; Loki/Elastic for logs. Standard Kubernetes ops. The AI-specific twist: cost attribution. Token spend per agent, per session, per user, per tenant. This is derivable from the LLM gateway's logs but only if the gateway emits the identity tags.

Agent-behavior layer

OpenTelemetry has published GenAI semantic conventions [80] — still marked experimental as of 2026, with standard span names of the form gen_ai.{operation} (e.g., chat, execute_tool) [158]. A GenAI agent-and-framework sub-spec [159] was added in 2025 specifically to cover multi-agent trajectories. Vendor implementations: Langfuse [81] (with OTLP endpoint since Feb 2025 [160]), LangSmith [82], Arize Phoenix [83], Helicone [76], Braintrust [84]. Datadog v1.37 (2025) added native OTel GenAI support. Each captures:

  • The full conversation (user turns, model responses, tool calls, tool results).
  • Per-span latency and cost.
  • Trace correlation across multi-agent and multi-tool flows.
  • Replay: re-run a historical session against a different prompt or model.

Trade-offs

Sampling vs. full capture. Full capture of every prompt and response produces hundreds of gigabytes per day at modest scale. Sampling (1–10%) is cheaper but loses the rare failure the operator actually wants to see. The common compromise: full capture of errors and flagged sessions, sampled capture of successful ones.

PII in traces. Traces contain prompts. Prompts contain user data. Trace storage becomes a data-residency and right-to-erasure concern (covered in §33). Automated PII redaction is available (Presidio, vendor-specific) but is imperfect and adds latency.

Cardinality. Tagging spans with session_id, user_id, tool_name, model_version produces high-cardinality metrics. Prometheus will choke; dedicated trace stores handle it at higher cost.

Trace lag. Exporting traces synchronously blocks the harness; asynchronously risks loss on crash. Most vendors default to buffered async with a bounded buffer.

What operators actually need

  • Tool call success rate per agent and per tool, trended over time. Primary signal for model drift (§10).
  • Turn count per task as a proxy for agent efficiency.
  • Cost per resolved task, not just cost per call.
  • Compaction events and the fraction of session preserved (§7).
  • Failure-mode taxonomy: tool errors, context overflow, permission denied, user abandonment. A per-failure-type trend line is more useful than aggregate SLO.

27. Evaluation infrastructure

Primitive: Harness (eval suite gates rollout). Touch points: §26 Observability (trace capture), §2 Agent lifecycle (gate enables canary), §10 Model capability (validates swap), §28 Model lifecycle (gates fine-tune).

An eval suite is to an agent what a test suite is to a library. Without it, every prompt change and model swap is a gamble. With it, regressions are caught before they reach users.

Eval types

Unit-level (tool call correctness). Does the agent call the right tool with the right arguments for a known task? BFCL [3] is the standard benchmark; in-house versions mirror its methodology for custom tools.

Trajectory-level (multi-turn). Can the agent complete a multi-step task? SWE-Bench [85] for coding — note that SWE-Bench Verified scores now exceed 93% on top models, so Scale's harder SWE-Bench Pro [161] (which drops top models to ~23%) is the current signal-bearing benchmark. OSWorld [86] for computer use (OSWorld-Verified was released July 2025 with evaluation-time fixes). WebArena [87] for web tasks. τ²-Bench [162] supersedes τ-Bench [88] with dual-control and telecom/banking domains. METR's [89] "time horizons" framing now has concrete 2025 numbers: GPT-5 at ~2h17m 50%-horizon, doubling roughly every 4 months in 2024–2025.

Quality judgment (LLM-as-judge). An LLM scores outputs against a rubric. Cheap, scales; biased toward verbose outputs and struggles with correctness judgments [90]. The IJCNLP 2025 position-bias study [163] and the CALM framework [164] empirically quantify judge-model choice as the dominant bias factor — direct support for cross-validating with human review. Practical systems use LLM-judge for style/tone and automated metrics for correctness.

A/B testing in production. Route a fraction of traffic to variant; compare quality metrics. Ground truth for real-user behavior; expensive and slow; needs a statistical framework to avoid false positives (Optimizely, Statsig).

Shadow mode. Run a new prompt/model alongside production, log both outputs, compare offline. Risk-free for the user; expensive (doubles inference cost) and doesn't capture user-facing signal (click-through, task abandonment).

Integration with CI/CD

graph LR
    classDef gate fill:#fdebd0,stroke:#e67e22

    DEV[Prompt / tool change] --> OFFLINE[Offline eval suite]
    OFFLINE --> GATE1{Passes?}
    GATE1 -->|yes| SHADOW[Shadow deploy]
    SHADOW --> COMPARE[Compare vs. baseline]
    COMPARE --> GATE2{No regression?}
    GATE2 -->|yes| CANARY[Canary: 1-5%]
    CANARY --> GATE3{Quality holds?}
    GATE3 -->|yes| FULL[Full rollout]

    class GATE1,GATE2,GATE3 gate
Loading

Ecosystem

LangSmith, Braintrust, Promptfoo [91], Inspect AI [92], Patronus [93], Ragas [94] for RAG specifically. Vendor-neutral patterns: golden dataset in git, eval runner in CI, result published to a dashboard, merge gated on threshold.

Failure modes

  • Eval overfitting. Prompts tuned to pass the eval suite stop generalizing to real users. Mitigated by holding out an unseen set and rotating it.
  • Static evals for dynamic tasks. A trajectory eval run against a frozen environment (e.g., a snapshot repo) doesn't catch changes in live dependencies.
  • LLM-judge bias. The judge model's failures (verbosity bias, style bias, self-preference) leak into scoring. Cross-validated by periodic human review.
  • No trajectory replay on real sessions. The most useful eval is replaying last week's production sessions with a proposed change. This requires trace infrastructure (§26) to actually capture trajectories, which many systems skip. AgentRR [165] formalizes trajectory record-and-replay as a research area and provides a concrete protocol.
  • Benchmark contamination. Berkeley RDI [166] showed top agent benchmarks can be gamed via training-set memorization; periodic benchmark refresh or held-out private sets are a defense.

28. Model lifecycle: fine-tuning, distillation, specialization

Primitive: Harness (model routing) + external training pipeline. Touch points: §10 Model capability (base caps fine-tune), §26 Observability (trace data feeds training), §33 Audit (lawful basis for training data).

Most agents run on a foundation model via API. The capability ceiling and the cost floor are both set by that model. Training, fine-tuning, and distillation are the levers that change either one.

Options

No training (base model + prompt engineering). The default. Lowest cost, fastest iteration, maximum dependence on provider behavior (§10). Appropriate when the agent's tasks are well-covered by the base model's pretraining distribution.

Supervised fine-tuning (SFT) on task data. Curate examples of successful completions; fine-tune a smaller model to match. Reduces cost and often latency; requires labeling infrastructure and data curation. Predibase [95], Together [96], OpenAI, and Anthropic all offer managed SFT.

Preference-based tuning (DPO, KTO, RLHF). Fine-tune on pairs of (better, worse) completions. Often described as more sample-efficient than SFT for style and judgment, but empirical comparisons are contested: Xu et al. (ICML 2024) [167] show PPO outperforming DPO on reasoning and code tasks, reversing DPO's original sample-efficiency claim, with gains up to +6 pp on HumanEval. Far more complex pipeline than SFT. Session logs with user feedback (thumbs up/down, task-accepted signals) are the usual data source. Self-distillation fine-tuning (SDFT) [168] reports consistent gains over vanilla SFT specifically for agent trajectories.

Distillation. Train a small model to match a large model's behavior on in-distribution tasks. Routine at scale; cuts per-call cost by 5–50×; requires throwing more inference at the large model for data generation. Open-source distillation recipes are mature.

Per-agent adapters (LoRA, QLoRA). Instead of fine-tuning a full model, train a small adapter for each agent type. Enables per-agent specialization without per-agent full-model costs. Serving infrastructure (vLLM [97], TGI) supports multi-adapter routing.

Continuous pretraining on domain corpus. Adapt the base model to a domain (legal, medical, internal codebase) before fine-tuning. Highest cost; largest capability lift when the domain is under-represented in pretraining.

The data loop

graph LR
    classDef data fill:#d6eaf8,stroke:#2980b9
    classDef model fill:#fdebd0,stroke:#e67e22
    classDef eval fill:#d5f5e3,stroke:#27ae60

    SESS[Session traces] --> CURATE[Curation<br/>PII scrub, dedup, rating]
    CURATE --> DATA[Training dataset]
    DATA --> TRAIN[Fine-tune / adapter]
    TRAIN --> MODEL[Candidate model]
    MODEL --> EVAL[Eval suite §27]
    EVAL -->|passes| DEPLOY[Canary deploy]
    DEPLOY --> SESS

    class SESS,CURATE,DATA data
    class TRAIN,MODEL,DEPLOY model
    class EVAL eval
Loading

The loop is virtuous when every piece is in place and self-reinforcing when one is missing. Without curation, training drifts toward whatever the model already did. Without eval gates, bad data silently degrades the model. Without trace infrastructure (§26), there's no dataset to curate.

Trade-offs

  • Data privacy. Session logs contain user data. Using them for training requires opt-in, per-region handling, and scrubbing. EU AI Act and GDPR (§33) apply directly.
  • Model staleness. A fine-tuned model ages. The base model it was derived from improves; the fine-tune stays on the older base unless retrained.
  • Capability coupling. A fine-tune inherits the base model's failure modes. Distilling from GPT-4o to a smaller model inherits its tool-call quirks.
  • Vendor lock-in. Fine-tuned models on a provider's API cannot be ported. Adapters on open-source bases can, at the cost of self-hosting.
  • Evaluation overhead. Every fine-tune requires a full eval pass. Teams without §27 in place cannot safely ship fine-tunes.

When to reach for it

Fine-tuning is worth the overhead when: (a) the base model is near but not quite capable enough for the task, (b) inference cost is a hard constraint, (c) the task has enough labeled examples (thousands to tens of thousands), and (d) the team has eval infrastructure to validate the result. Before any of those, prompt engineering and tool design usually have more headroom.


29. Supply chain and package governance

Primitive: Sandbox fabric (package install occurs inside the sandbox; allowlists at the sandbox-fabric proxy). Touch points: §9 Security (compromised pkg → sandbox escape), §25 Network policy (registry proxy as sole egress).

Agents install packages, pull container images, and fetch dependencies. Every one of these is a supply-chain attack surface. Between 2021 and 2025, compromised npm and PyPI packages have caused multiple widely-publicized incidents; an agent that runs npm install on whatever the model asks for is an unusually permissive attack vector.

Options

Registry proxy with allowlist. Artifactory, Verdaccio, Sonatype Nexus. Sandbox npm install resolves through the proxy; the proxy serves a curated subset of public packages plus private ones. Supply-chain controls attach to the proxy (signing, scanning, vulnerability gates).

Malicious package detection. Socket.dev [98], Snyk, Phylum. Pre-install scanning for typosquats, install-script exfiltration, and known IoCs. Integrates as a registry proxy or npm/pip wrapper.

SBOM generation. Software Bill of Materials recording everything installed in the sandbox. Syft, Trivy [99]. Required for SOC2/FedRAMP; useful in incident response.

Container image scanning. Trivy, Grype, Clair. Scan the sandbox base image for CVEs before deployment. Continuous as new CVEs are disclosed.

Reproducible builds. Locked package versions (lockfiles), pinned base images (digest not tag), deterministic build scripts. Makes the "what did the agent actually install" question answerable.

The agent-specific risk

A general developer running npm install foo trusts that foo is what they think it is. An agent running npm install foo may have been prompt-injected into installing a typosquat, or — categorically worse — may have hallucinated the package name itself. Spracklen et al. (USENIX Security 2025) [169] found 19.7% of LLM-recommended packages do not exist, with 205,474 unique hallucinated names across 576k samples; open-source models hallucinate at 21.7% vs 5.2% for commercial. The attack — register the hallucinated name, wait for an agent to install it — is distinct enough from typosquatting to deserve its own term: slopsquatting. The Shai-Hulud npm worm (CISA alert, Sep 23 2025) [170] and Shai-Hulud 2.0 (Nov 2025, Unit 42) operationalized this at scale: a self-replicating worm that used stolen npm tokens to republish infected packages, affecting >180 packages with 2.6 B weekly downloads initially and 25,000+ malicious repos in wave 2.

Mitigation: route all package installs through a curated allowlist; force explicit human approval for any package not on the list; log every install in a tamper-evident store (§33). For slopsquatting specifically, verify package existence and provenance before the model sees the name-to-install in its output.

Failure modes

  • Transitive compromise. The agent installs an allowlisted package; that package depends on a newly-compromised upstream. Allowlisting direct dependencies alone is insufficient.
  • Build-time code execution. npm install runs install scripts. Scanning the package source doesn't catch what the install script does at runtime. The sandbox is the last line of defense.
  • Stale allowlist. An allowlist that isn't maintained either denies legitimate work or permits dependencies long after they've been deprecated.

30. Policy, approvals, and human-in-the-loop

Primitive: Harness (policy engine is called per tool dispatch). Touch points: §18 Interruption (pause enables approval), §22 Identity (caller identity for decisions), §9 Security (enforcement arm).

An agent with autonomy has actions. A production system needs rules about which actions it can take unilaterally, which require approval, and which are forbidden. This is a policy-engine and workflow problem, not a model-capability problem.

Options

Hard-coded guardrails. The harness checks, on every tool call, "is this action allowed?" Simple; brittle; policy changes require code deploys. Any non-trivial system outgrows this within weeks.

Policy engine (OPA/Rego, Cedar, Kyverno). Externalize policy as declarative rules. OPA [100] is the Kubernetes-native default; AWS Cedar [101] is newer and has better type-safety; Kyverno is Kubernetes-focused. The harness calls out to the engine per action; the engine returns allow/deny/approve-required.

Per-action approval gates. Actions classified as high-risk (production database write, external email send, payment) require explicit human approval before execution. Implemented as a workflow queue with UI, notifications, and timeout.

Risk-scored approval. Each action is scored for risk (model confidence, action reversibility, blast radius); high scores auto-approve, medium routes to human, low auto-denies. Portkey's guardrails [73] and Nemo Guardrails [102] implement variants.

Capability scoping. The agent is granted a time-bounded capability set at session start. Actions outside the set are denied at the policy layer. Composes with §22's capability tokens.

Approval workflow structure

graph TB
    classDef low fill:#d5f5e3,stroke:#27ae60
    classDef med fill:#fdebd0,stroke:#e67e22
    classDef high fill:#f9d6d6,stroke:#c0392b

    ACTION[Agent proposes action] --> POLICY{Policy engine}
    POLICY -->|allow| EXEC[Execute]
    POLICY -->|deny| ERR[Surface error to agent]
    POLICY -->|review| QUEUE[Approval queue]
    QUEUE --> HUMAN[Human reviews]
    HUMAN -->|approve| EXEC
    HUMAN -->|reject| ERR
    QUEUE -->|timeout| REJECT[Auto-reject]

    class EXEC low
    class QUEUE,HUMAN med
    class REJECT,ERR high
Loading

Trade-offs

Gate fatigue. Every gate that fires is latency the user sees and attention the reviewer spends. A system that gates too aggressively trains reviewers to click-through without reading.

Gate bypass. If gates live in the harness and the harness is compromised (or prompt-injected into skipping a call), gates are bypassable. Non-bypassable gates live in the trust plane (§1) or at the target system.

Policy drift. Policies written once and not maintained become either too permissive (drift toward user convenience) or too restrictive (drift toward CYA). Needs periodic review cadence.

Cold-path approvers. A gate that requires human approval with no on-call schedule becomes a blocker when the approver is asleep. Either the gate is 24/7-staffed or the workflow has a deterministic auto-approve or escalation path.

Ecosystem

OPA and Rego for general policy; Cedar for typed policy; Pomerium for identity-aware proxying; Portkey and Nemo Guardrails for LLM-specific content filters; Permit.io for authorization-as-a-service. HITL workflows are usually hand-rolled on top of existing ticketing (Linear, Jira, Slack workflows).


31. Cost, quotas, and admission control

Primitive: Touch point (harness ↔ LLM gateway) with an identity-fabric dependency. Touch points: §22 Identity (tenant resolution is required for any per-tenant cap — budget attribution depends on the identity fabric, and per-tenant caps require tenant resolution on every LLM call), §24 Gateway (enforcement point), §8 Operational (budget before runaway), §26 Observability (per-tenant attribution).

LLM inference is expensive enough that cost is a first-class operational concern, not a billing-system afterthought. A runaway agent can produce a five-figure bill in hours (§8). Controls need to live at the harness layer. Every control below that mentions "per-tenant" or "per-user" is a consumer of the identity fabric (§22): without a verified (user_id, tenant_id) on every LLM call the gateway cannot enforce the right cap and the cost record cannot be billed.

Controls

Per-session token budget. Each session has a hard cap on tokens. Exceeding the cap terminates the session with a structured error the agent can observe. No soft retries.

Per-agent token budget. Aggregate across all sessions of an agent type, rolled over per period (hour, day, month). Prevents a single agent from consuming shared budget.

Per-tenant quotas. Per-org/per-team budgets enforced at the gateway or API layer. Standard for multi-tenant platforms.

Per-user rate limits. Tokens-per-minute and requests-per-minute enforced at ingress. Blocks runaway clients before they reach the model.

Admission control under pressure. When gateway queue depth or downstream error rate exceeds a threshold, reject new sessions with explicit "try again later" rather than accepting and timing out.

Prompt caching. Identical prompt prefixes are cached at the LLM level (Anthropic, OpenAI, Gemini all support this). Anthropic prices cache hits at 10% of input cost with 1.25× write for 5-min TTL and 2× for 1-hour TTL [171]. "Don't Break the Cache" [172] evaluates prompt caching across long-horizon agentic tasks with 500+ evaluations and finds statistically significant cost reductions across all four tested models; ProjectDiscovery reports production savings of 59–70% [173]. Requires prompt stability across sessions; interacts with §7 compaction.

Model routing by cost. Route "easy" tasks to a cheap model; escalate to expensive models only when needed. Implemented at the LLM gateway (§24); requires classification of task difficulty. RouteLLM [174] reports 85% cost reduction on MT-Bench versus GPT-4-only; Cascade Routing (ICLR 2025) [175] unifies routing and cascading with +4% quality at 80% relative cost improvement.

Spot / preemptible for batch. Scheduled agents with tolerant completion deadlines run on preemptible compute. Halves compute cost; adds preemption as a failure mode.

Enforcement layers

Budgets enforced at the billing system alone are alerts, not enforcement — they notify after the spend has already occurred. Enforcement has to happen synchronously with the call, which means at the gateway or in the harness. The harness has the context (session, agent, user) to make per-call decisions; the gateway has the vantage to enforce aggregate caps.

Chargeback and showback

Multi-tenant platforms need to attribute cost back to the tenant that caused it. Requires that every LLM call is tagged with tenant/agent/session at the gateway layer (§24) and that those tags land in the cost reporting pipeline (§26). Systems without this end up with a big undifferentiated LLM bill and no way to argue cost with the tenants producing it.

Failure modes

  • Budget as alert, not enforcement. The $47,000 LangChain loop [50] and the $50,000 Claude Code recursion [51] both happened because budgets existed but fired after the spend.
  • Caching poisoning. Cached prompt prefixes that include sensitive data leak across sessions. Requires per-tenant cache partitioning.
  • Cache-breaking prefix drift. Agents whose system prompts include timestamps, session IDs, or git SHAs silently invalidate cache on every call — common enough to be worth calling out as a distinct category.
  • Model routing that skips the expensive model when needed. A cost-router that misclassifies "hard" tasks as easy produces worse outputs and sometimes infinite retries. The retry then costs more than the direct expensive call.

32. Long-term memory and knowledge stores

Primitive: Persistence fabric (memory backend) with identity-fabric dependency. Consumed by: Harness (loaded into context). Touch points: §7 Context management (feeds compaction policy), §22 Identity (per-tenant partitioning, ownership, right-to-erasure cascade), §33 Audit (PII in memory, right-to-erasure cascade).

Sessions are bounded by compaction (§7). Agents that outlast sessions need durable, queryable memory that the harness can load into context. This is a separate system from the session log; the session log is append-only intent, memory is curated and retrievable.

Memory types

graph TB
    classDef short fill:#d6eaf8,stroke:#2980b9
    classDef long fill:#fdebd0,stroke:#e67e22
    classDef external fill:#d5f5e3,stroke:#27ae60

    subgraph "Short-term"
        ST[Session context<br/>within one conversation]
    end

    subgraph "Long-term"
        EM[Episodic memory<br/>past interactions]
        SM[Semantic memory<br/>facts about user / domain]
        PM[Procedural memory<br/>learned workflows]
    end

    subgraph "External knowledge"
        RAG[RAG indexes<br/>documentation, tickets, code]
        KG[Knowledge graphs<br/>entities, relationships]
    end

    class ST short
    class EM,SM,PM long
    class RAG,KG external
Loading

Options

Vector store + embedding retrieval. Pinecone [103], Weaviate, pgvector [104], Chroma [105], LanceDB. Each interaction or document is embedded and indexed; queries retrieve top-k by cosine similarity. Simple, scales, known failure modes (retrieval quality degrades with corpus size, embeddings lag model upgrades).

Memory framework. Letta [176] (the successor company to MemGPT [106]), Zep [107], and Mem0 [177] are opinionated layers on top of a vector store: automatic summarization of old messages, entity extraction, per-user memory partitioning. Faster path to "something works"; locks you into the framework's memory model. Benchmark numbers are contested: Mem0 [177] reports +26% LLM-as-Judge over OpenAI memory on LOCOMO [178] with 91% lower p95 latency and 90%+ token-cost reduction; Letta's counter-benchmark [179] reports 74.0% LoCoMo with GPT-4o-mini vs Mem0's reported 68.5%, with disputed methodology; Zep's temporal knowledge graph reaches 63.8% on LongMemEval vs Mem0's 49.0% — evidence that KG-based memory wins on temporal reasoning specifically. Treat cross-vendor memory benchmarks as contested until independently reproduced.

Knowledge graph. Entities and relationships instead of flat text. Better for compositional queries ("who reports to Alice who worked on project X") but harder to populate automatically.

RAG over existing corpora. Index the team's documentation, tickets, and code; retrieve relevant chunks into context per query. The most common pattern for domain-specialized agents.

Curated memory (pinned facts). High-value facts written by the agent or user and pinned into every session's context. Low-volume, high-signal. Usually a small markdown file per user or per project.

Trade-offs

  • Retrieval quality vs. recall. Top-k retrieval misses context the model needed. Hybrid search (vector + keyword) is the common upgrade.
  • Embedding model coupling. Changing the embedding model invalidates the existing index. Reindexing at scale is expensive.
  • Privacy and isolation. Memory must be per-user or per-tenant by default. Bleed-over (user A's memory surfacing in user B's session) is a data-breach class failure. Pinecone namespaces, pgvector row-level security, or logical per-tenant indexes are the common patterns.
  • Memory staleness. Facts about a user (job title, current project) age. Without a refresh mechanism, memory drifts from reality.
  • Memory as tool vs. ambient. Load all relevant memory into every prompt (ambient, simple, expensive) or expose memory as a queryable tool the agent chooses when to call (on-demand, cheaper, requires the agent to know when to query). Ambient is simpler to reason about; tool-based scales better.

Failure modes

  • Retrieval that poisons context. A misretrieved document injected into the prompt changes the agent's behavior in subtle ways. Hard to debug because the trace shows retrieval "succeeded."
  • Memory injection. A user's memory is written by the agent based on user messages. A malicious user can craft messages that plant persistent misinformation in memory (a prompt-injection analogue for memory systems).
  • Consolidation loss. Systems that periodically summarize memory (Letta, MemGPT) drop detail. Same invisible-loss problem as §7 compaction, at a different timescale.
  • Cross-tenant memory bleed-over. Retrieval against a shared vector store without per-tenant namespace isolation surfaces another tenant's data. A top-k query against a shared index without an enforced tenant filter is a data-breach class failure. The identity fabric (§22) must either partition the index physically (per-tenant namespace, Pinecone namespaces, pgvector row-level security, separate pgvector schemas) or attribute every retrieved chunk so the harness can filter post-query — partition-at-write-time is strongly preferred, since filter-at-read-time is one SQL bug away from a leak.

33. Audit, compliance, and data residency

Primitive: Persistence fabric (tamper-evident log, retention) with identity-fabric dependency. Touch points: audit event stream (all four primitives emit), §22 Identity (every audit record is identity-attributed; right-to-erasure cascades follow identity-ownership graphs defined in §22), §26 Observability (shapes capture policy), §3 Session durability (training-data lawful basis), §32 Memory (right-to-erasure cascade).

Tamper-evident audit records are identity-attributed: every entry carries the (user_id, tenant_id, agent_id, workload_id) tuple resolved by the identity fabric (§22). Right-to-erasure cascades are graph traversals over the identity-ownership graph §22 maintains — without that graph, "delete all of Alice's data" is a best-effort search instead of a verifiable operation. Retention and tamper-evidence can make strict erasure impossible (cryptographic chaining means removing a link breaks the chain); the usual compromise is a tombstone plus a signed proof-of-deletion, with the original payload purged while the chain integrity holds.

Agents handle data. In regulated contexts (healthcare, finance, EU users, enterprise customers under SOC2/ISO 27001 commitments), how that data is handled is a compliance question with teeth.

Regulatory context

  • GDPR (EU users). Right to erasure, data minimization, lawful basis for processing, data residency within the EU. Applies to session logs and memory.
  • EU AI Act. Entered force August 2024; GPAI (general-purpose AI) obligations became enforceable 2 August 2025, with Commission enforcement actions delayed to 2 August 2026 [108]. The operational compliance artifact is the GPAI Code of Practice (published 10 July 2025) [180], covering Transparency, Copyright, and Safety & Security chapters. High-risk AI systems have risk-management, transparency, and human-oversight obligations.
  • NIST AI RMF. Voluntary US framework; increasingly expected in enterprise RFPs [109]. NIST AI 600-1 (GenAI Profile, July 2024) [181] adds 12 enumerated risk categories, and the CSA Agentic AI Profile v1 (2025) [182] explicitly covers the agent-specific gaps the base RMF misses.
  • ISO 42001. AI management system standard; auditable certification available [110]; CSA 2025 survey reports 76% of enterprises planning pursuit, with Microsoft and Cornerstone among 2024–25 certified adopters.
  • SOC 2 Type II. Controls over confidentiality, availability, processing integrity. The default enterprise compliance bar in the US.
  • HIPAA (US healthcare). BAAs required; PHI handling rules; audit-log retention.

Controls

Tamper-evident session logs. Append-only with cryptographic chaining (Merkle-tree log, transparency log like Sigstore Rekor). Required for audit trails that hold up in dispute. Most event-log implementations are append-only but not tamper-evident.

PII redaction in logs and traces. Presidio, cloud-provider DLP, inline regex. Applied before storage, not at query time. Partial solutions leak through rare formats; needs periodic red-teaming.

Per-region deployment. EU data stays in EU-region infrastructure end-to-end: harness, LLM gateway, sandbox, session store, memory store. LLM providers have regional endpoints (OpenAI EU, Azure OpenAI regions, Anthropic AWS Bedrock regions).

Right-to-erasure pipelines. Deleting a user deletes their sessions, memory, training data derivatives, and traces. Straightforward for session stores; painful for trained models (data is effectively baked in).

Data processing agreements. Every external service (LLM providers, MCP hosts, observability vendors) processing user data needs a DPA. Vendors that don't offer DPAs don't pass enterprise procurement.

Audit-log retention. Typically 1–7 years depending on jurisdiction and regulation. Storage cost is non-trivial at scale; partitioning and archival to cold storage (S3 Glacier) is standard.

The training-data question

Using session logs to fine-tune (§28) requires a lawful basis under GDPR and explicit opt-in in most cases. Providers that include "we may use your data for training" in their terms (ChatGPT free tier, early Copilot) are not usable for enterprise data without an opt-out or enterprise agreement. Most enterprise LLM APIs now default to no-training; worth verifying per provider, per tier.

Trade-offs

  • Observability (§26) vs. privacy (§33). Full trace capture conflicts with data minimization. Regulated workloads require either heavy redaction or scoped capture (errors only, no successful request bodies).
  • Long-term memory (§32) vs. right to erasure. A user's memory is data about the user; deletion must cascade to memory stores. Framework-managed memory (Letta, Mem0) may or may not expose a delete API.
  • Per-region deployment vs. model availability. The best models may not be available in every region. EU deployments may be stuck on older Azure OpenAI versions.
  • Cross-border fine-tune data residency. Fine-tuning on EU user traces using a US-region training cluster is a GDPR transfer-mechanism problem distinct from inference residency, and needs to be assessed independently.
  • Audit completeness vs. cost. Capturing every prompt, response, tool call, and intermediate state with 7-year retention is expensive. Most production systems capture a structured subset (decisions, not deliberation).

Interactions and timeline

The trade-offs are not independent. The sharpest production failures come from interactions among them, and the interactions cross the Part I / Part II boundary as often as they stay within it. Three views below: core runtime interactions (Part I), ecosystem interactions and bridges to runtime (Part II ↔ Part I), and a timeline of when each surfaces in production.

Part I — core runtime interactions (clustered by primitive)

Nodes are coloured by owning primitive: harness (blue), persistence fabric (green), sandbox fabric (orange), identity fabric (pink), touch points (red). The shape of the cross-primitive edges is the design question. Identity does not surface as its own axis in Part I (it is covered by §22 in Part II), but Part II's cross-bridge graph below shows how it flows into Part I.

graph LR
    classDef harness fill:#d6eaf8,stroke:#2980b9
    classDef persistence fill:#d5f5e3,stroke:#27ae60
    classDef sandbox fill:#fdebd0,stroke:#e67e22
    classDef identity fill:#fadbd8,stroke:#b03a5b
    classDef touchpoint fill:#f9d6d6,stroke:#c0392b

    T1["1. Harness<br/>placement"]
    T2["2. Agent<br/>lifecycle"]
    T3["3. Session<br/>durability"]
    T4["4. Workspace<br/>persistence"]
    T5["5. External<br/>connections"]
    T6["6. Concurrency"]
    T7["7. Context<br/>management"]
    T8["8. Operational<br/>failure modes"]
    T9["9. Security<br/>scope"]
    T10["10. Model<br/>capability"]
    T11["11. Transport"]
    T12["12. Execution<br/>mode"]
    T13["13. Scheduler"]
    T14["14. Multi-agent<br/>composition"]
    T15["15. Tool dispatch<br/>statefulness"]
    T16["16. Sandbox<br/>lifecycle"]
    T17["17. Session<br/>branching"]
    T18["18. Interruption /<br/>steering"]
    T19["19. Agent<br/>authoring"]
    T20["20. Tool registry /<br/>selection"]
    T21["21. Artifact<br/>surface"]

    %% Model capability propagates
    T10 -->|"prompt/tool drift"| T2
    T10 -->|"compaction quality"| T7
    T10 -->|"env disambiguation"| T1
    T10 -->|"recovery prompt"| T3
    T2 -->|"all-or-nothing rollout"| T10

    %% Recovery chain
    T3 -->|"non-idempotent replay"| T4
    T4 -->|"lost workspace"| T3
    T16 -->|"teardown-resume<br/>depends on bootstrap"| T4
    T16 -->|"non-reproducible<br/>bootstrap"| T3

    %% Connection lifecycle
    T5 -->|"reconnect loses state"| T6
    T8 -->|"no circuit breaker"| T5
    T15 -->|"generalizes"| T5

    %% Placement effects
    T1 -->|"inner-loop latency"| T7
    T1 -->|"multi-env confusion"| T10

    %% Security
    T9 -->|"prompt injection"| T6

    %% Flow axes
    T11 -->|"at-least-once<br/>delivery"| T3
    T12 -->|"persistent mode<br/>drives compaction"| T7
    T12 -->|"hybrid mode<br/>overlap"| T6
    T12 -->|"scheduled mode<br/>needs scheduler"| T13
    T13 -->|"tick replay<br/>side effects"| T3
    T13 -->|"runaway ticks /<br/>circuit breaker"| T8
    T14 -->|"pipeline races<br/>on shared state"| T6
    T14 -->|"loop / hop<br/>amplifies cost"| T8
    T15 -->|"stateful-resource<br/>coordination"| T6

    %% New primitive axes
    T17 -->|"fork diverges<br/>from external state"| T6
    T17 -->|"branching needs<br/>CoW or snapshots"| T4
    T18 -->|"cancellation-aware<br/>tool dispatch"| T15
    T18 -->|"pause enables<br/>HITL gates"| T9
    T19 -->|"authored capability<br/>surface"| T14
    T19 -->|"plugin trust<br/>boundary"| T9
    T20 -->|"schema bloat<br/>hits model"| T10
    T20 -->|"poisoned tool<br/>descriptions"| T9
    T21 -->|"retention vs<br/>workspace TTL"| T4

    %% Primitive ownership: harness / persistence / sandbox / touch-point
    class T2,T5,T7,T10,T12,T13,T14,T15,T18,T19,T20 harness
    class T3,T4,T17,T21 persistence
    class T9,T16 sandbox
    class T1,T6,T8,T11 touchpoint
Loading

Part II ↔ Part I — ecosystem interactions and bridges (clustered by primitive)

Five-colour scheme: harness (blue), persistence fabric (green), sandbox fabric (orange), identity fabric (pink), touch points (red). This view emphasises how Part II axes plug into the four primitives; the load-bearing edges are the cross-primitive ones. Identity fabric (§22) is the substrate other primitives consume — edges from E22 flow into persistence (owner tag), sandbox (workload id), harness (OBO / user / agent), and the audit sink.

graph LR
    classDef harness fill:#d6eaf8,stroke:#2980b9
    classDef persistence fill:#d5f5e3,stroke:#27ae60
    classDef sandbox fill:#fdebd0,stroke:#e67e22
    classDef identity fill:#fadbd8,stroke:#b03a5b
    classDef touchpoint fill:#f9d6d6,stroke:#c0392b

    %% Part I anchors (subset)
    P1_1["1. Placement"]
    P1_2["2. Agent lifecycle"]
    P1_3["3. Session durability"]
    P1_4["4. Workspace"]
    P1_6["6. Concurrency"]
    P1_7["7. Context"]
    P1_8["8. Operational"]
    P1_9["9. Security"]
    P1_10["10. Model capability"]
    P1_11["11. Transport"]
    P1_14["14. Multi-agent"]
    P1_16["16. Sandbox lifecycle"]

    %% Part II
    E22["22. Identity fabric<br/>(IdP + broker + ABAC)"]
    E23["23. Storage"]
    E24["24. Gateways"]
    E25["25. Network policy"]
    E26["26. Observability"]
    E27["27. Evaluation"]
    E28["28. Model lifecycle"]
    E29["29. Supply chain"]
    E30["30. Policy / HITL"]
    E31["31. Cost / admission"]
    E32["32. Long-term memory"]
    E33["33. Audit / compliance"]

    %% Identity fabric as substrate — edges outward into the other primitives
    E22 -->|"owner tag on<br/>every resource"| P1_3
    E22 -->|"tenant partition"| E32
    E22 -->|"workload id<br/>(SPIFFE)"| P1_16
    E22 -->|"OBO / user /<br/>agent tuple"| P1_14
    E22 -->|"caller attrs<br/>for decisions"| E30
    E22 -->|"tenant attribution<br/>for budget"| E31
    E22 -->|"audit attribution +<br/>erasure graph"| E33
    E22 -->|"ambient authority<br/>if missing"| P1_9
    P1_11 -->|"transport auth →<br/>canonical user"| E22
    P1_14 -->|"A2A identity<br/>propagation"| E22

    %% Infra substrate
    E23 -->|"one substrate<br/>for workspace"| P1_4
    E24 -->|"gateway SPOF"| P1_8
    E25 -->|"non-bypassable<br/>enforcement"| P1_9
    E25 -->|"makes egress<br/>real"| P1_1

    %% Quality & feedback loop
    E26 -->|"traces feed"| E27
    E26 -->|"traces feed"| E28
    E27 -->|"eval gate enables<br/>canary"| P1_2
    E27 -->|"validates<br/>model swap"| P1_10
    E28 -->|"fine-tune shifts<br/>capability"| P1_10
    P1_10 -->|"base capability<br/>caps fine-tune"| E28

    %% Governance
    E29 -->|"compromised pkg →<br/>sandbox escape"| P1_9
    E30 -->|"gates at external<br/>resource"| P1_6
    E30 -->|"enforcement arm<br/>of security"| P1_9
    E31 -->|"budget before<br/>runaway"| P1_8
    E32 -->|"loaded into<br/>context"| P1_7
    E32 -->|"PII in memory"| E33
    E26 -->|"PII in traces"| E33
    E33 -->|"shapes capture<br/>policy"| E26
    P1_3 -->|"training data<br/>needs lawful basis"| E33

    %% Sandbox & storage
    P1_16 -->|"lifecycle mode<br/>picks substrate"| E23

    %% Primitive ownership: harness / persistence / sandbox / identity / touch-point
    class P1_2,P1_7,P1_10,P1_14,E27,E28,E30 harness
    class P1_3,P1_4,E23,E32,E33 persistence
    class P1_9,P1_16,E25,E29 sandbox
    class E22 identity
    class P1_1,P1_6,P1_8,P1_11,E24,E26,E31 touchpoint
Loading

The compounding clusters

Four interaction clusters produce the highest-impact failure modes:

  1. Crash-recovery chain (§3 + §4 + §13 + §16). Session durability, workspace persistence, scheduler recovery, and sandbox lifecycle all replay on crash. A non-idempotent tool call at the intersection of any two produces duplicate side effects. Idempotency keys or a step-log layer sits on top of all four.

  2. Model behavior chain (§10 + §2 + §7 + §27 + §28). Model capability shifts — swaps, upgrades, fine-tunes — propagate through agent lifecycle (all-at-once rollout), compaction quality, and trajectory-level evals. Without the eval gate, every change in this chain is a gamble.

  3. Trust-and-policy chain (§1 + §9 + §22 + §24 + §25 + §30). Placement sets the secret reachability boundary; identity fabric (§22) resolves who the caller is, what tenant they belong to, and what attributes gate access; gateways mediate access; network policy makes enforcement non-bypassable; policy engines decide per-action allow/deny. Each layer alone is insufficient.

  4. Cost-and-observability chain (§8 + §22 + §26 + §31 + §33). Runaway cost is detected by observability, enforced by admission control, attributed by the identity fabric, and constrained by compliance (what can be captured). Systems that wire budgets only to the billing system without synchronous enforcement — or without tenant attribution on every LLM call — recreate the $47K / $50K loops on a schedule and cannot say which tenant caused them.

  5. Identity-cascade chain (§22 + §3 + §32 + §33 + §31). Every identity-owned resource — sessions, workspaces, memory entries, artifacts, audit records, cost records — must cascade correctly under user deletion, tenant re-parenting, and revocation. A chain that leaks at any node (memory that outlives user deletion, audit entries without an identity tuple, cost records that cannot be rebilled on re-parenting) is a compliance liability that appears only on the first DPIA.

When each bites

gantt
    title When trade-offs tend to surface
    dateFormat X
    axisFormat %s
    todayMarker off

    section Immediate
    §1 Inner-loop latency (placement)                        :active, 0, 2
    §8 Gateway SPOF / scheduling exhaustion                  :active, 0, 2
    §11 Transport ID mapping / reconnect                     :active, 0, 2
    §24 Gateway setup and limits                             :active, 0, 2

    section First incidents
    §3 Mid-turn crash → non-idempotent replay                :crit, 1, 3
    §4 AZ failure → PVC unreachable                          :crit, 1, 3
    §5 Replica restart → MCP reconnect storm                 :crit, 1, 3
    §6 Duplicate commits / tickets                           :crit, 2, 4
    §12 Runaway schedule / replay of side-effectful ticks    :crit, 1, 3
    §13 Tick lock stuck / overlap                            :crit, 1, 3
    §16 Cold-start under load / non-reproducible resume      :crit, 2, 4
    §18 Stale steering / resource leak on hard cancel        :crit, 2, 4
    §22 Ambient authority / token revocation race            :crit, 2, 4
    §25 First enforcement bypass attempt                     :crit, 2, 4
    §29 First compromised package / slopsquat                :crit, 2, 4
    §31 Budget-as-alert (not enforcement) — first runaway    :crit, 1, 3

    section Growth phase
    §2 Singleton deploy pain (>3 agent types)                :2, 5
    §9 First adversarial input → prompt injection            :3, 5
    §10 Model swap / deprecation → silent regression         :3, 6
    §14 First multi-agent pipeline conflict                  :3, 5
    §15 First stateful-resource tool introduced              :3, 5
    §17 First bad-turn fork / workspace divergence           :3, 5
    §19 Plugin/authoring portability tax                     :3, 6
    §20 Tool-name collision or schema bloat                  :3, 5
    §21 Artifact lost to workspace TTL race                  :3, 5
    §26 Observability cost / cardinality blowup              :3, 6
    §27 Eval overfitting or missing trajectory replay        :3, 6
    §30 Gate fatigue / gate bypass via prompt injection      :3, 5
    §32 First memory bleed-over or retrieval poisoning       :3, 5

    section Long-term
    §7 Context quality decay (invisible)                     :4, 7
    §23 Storage substrate cost / snapshot bloat              :4, 7
    §28 Fine-tune staleness / vendor lock-in                 :4, 7
    §33 Regulated customer / DPIA / right-to-erasure         :5, 7
Loading

Summary table

Primitive legend: H = agent harness, P = persistence fabric, S = sandbox fabric, I = identity fabric, T = touch point (cross-primitive contract). Hybrid entries list multiple tags (e.g., P+H means the axis owns a persistence contract that the harness depends on; X + I means the axis's primary primitive is X with a first-class identity-fabric dependency).

Part I — Runtime design axes

# Primitive Design axis Main options Primary risk When it typically surfaces
1 T Harness placement Outside / hybrid / inside Inner-loop latency vs isolation vs operational complexity Immediate for latency-sensitive agents
2 H Agent lifecycle Singleton / registry / per-request All-or-nothing rollout vs resolution cost When agent types exceed 3–4
3 P+H Session durability Intent log / step log / filesystem Non-idempotent replay on crash First production crash during a side-effectful operation
4 P Workspace persistence PVC / snapshot / reproducible bootstrap Zone limits, teardown loss, drift First AZ failure, preemption, or timeout race
5 H External connections In-memory pool / sticky routing / session service Reconnect storms, token races, lost server state First replica restart under load
6 T Concurrency Session lock / resource lock / worktree / reconciliation Conflicting external effects First duplicate commit or ticket
7 H Context management Automatic compaction / externalization / large context Silent quality decay Gradual, invisible
8 T Operational failure modes Circuit breakers, timeouts, admission control, storage compaction Cascading outage, runaway cost First major outage or budget incident
9 S+H Security scope Isolation / policy / kernel enforcement / guardrails Prompt injection, overprivileged actions, bypassable egress First adversarial input
10 H Model capability Pinned / versioned / gateway-abstracted Silent behavior drift on swap or upgrade First model swap or provider deprecation
11 T Transport and client interaction Single / pluggable adapters; SSE / polling / webhook ID mapping loss, dup events, reconnect gaps First non-HTTP transport integration
12 H Execution mode Request-driven / scheduled / hybrid / event-driven; persistent vs ephemeral session Runaway schedules, replay of side-effectful ticks First scheduled or event-driven agent
13 H Scheduler design In-process / K8s Cron / durable-execution; agent lock / shard / leader Lock stuck on dead replica, clock skew, overlapping ticks Scheduled workload under load
14 H Multi-agent composition Sequential / supervisor / peer (A2A) / parallel; shared vs isolated state Loops, identity confusion, coordination failure First pipeline or cross-platform delegation
15 H Tool dispatch by statefulness Stateless / stateful-connection / stateful-resource Misclassified tool dispatch, lost correlation First stateful-resource tool introduced
16 S Sandbox lifecycle and bootstrap Always-on / hibernation / lazy / pool; pinned vs floating bootstrap Cold-start under load, non-reproducible resume First teardown-and-resume cycle
17 P Session branching and fork Linear-only / rewind / forkable tree External state divergence, orphan branches, merge-back unsolved First time a user needs to undo a bad turn
18 H Interruption and steering Hard cancel / cooperative pause / queued steering / resume-at-point Stale steering, leaked resources on hard cancel, approval-gate latency First HITL approval or user redirect
19 H Agent authoring model Monolithic / declarative bundles / runtime-composable Plugin trust boundary, manifest drift, discovery collisions, portability tax First user-authored capability or plugin
20 H Tool registry and selection Static list / tool-RAG / hierarchical namespacing / model-driven discovery Name collisions, schema bloat, unselected tools, poisoned descriptions When tool count exceeds ~30
21 P Artifact surface In-workspace-only / artifact events / typed contract Workspace-TTL race, unbounded storage, missing attribution First user-facing deliverable produced

Part II — Ecosystem around the harness

# Primitive Design axis Main options Primary risk When it typically surfaces
22 I Identity fabric (auth + tenancy + ABAC) OAuth scopes / capability tokens / OBO / SPIFFE; flat / team / org+project / ABAC tenancy; owner / ACL / ABAC / logical / physical isolation; ABAC via OPA / Cedar / Oso / Permit.io Ambient authority, over-scoped tokens, revocation latency, cross-tenant bleed-over, implicit tenancy, erasure-graph gaps First multi-tenant or high-privilege action
23 P Storage substrate PVC / CoW overlay / snapshot / lazy-clone / FUSE / ephemeral Node-pinned state, snapshot lag, cost of full snapshots Cold-start or cross-zone recovery
24 T Proxy and gateway topology LLM gw / tool gw / egress proxy / ingress gw; collapsed or separated Gateway SPOF, cache poisoning, gateway-held tokens First provider outage or multi-tenant cost attribution
25 S Network policy and egress NetworkPolicy / eBPF L7 / service mesh / private subnet + proxy Bypassable enforcement, DNS tunneling, TLS opacity First adversarial exfiltration attempt
26 T Observability and telemetry OTel GenAI / vendor (Langfuse, Langsmith, Arize, Helicone) Trace PII leak, sampling blind spots, cardinality blowup First incident review that needs a replay
27 H Evaluation infrastructure Unit / trajectory / LLM-judge / A/B / shadow; CI gates Overfitting, static evals for dynamic tasks, LLM-judge bias Before every prompt or model change
28 H Model lifecycle Base + prompt / SFT / DPO / distillation / LoRA / CPT Data privacy, staleness, capability coupling, vendor lock-in When base model capability is near-insufficient
29 S Supply chain Registry proxy / allowlist / SBOM / CVE scan / reproducible builds Slopsquat / typosquat / worm, install-script RCE, transitive compromise First slopsquat or compromised dependency
30 H Policy and HITL Hard-coded / OPA / Cedar / Kyverno; per-action / risk-scored Gate fatigue, gate bypass via prompt injection, policy drift First high-privilege action requiring approval
31 T + I Cost and admission Per-session / per-agent / per-tenant budget; gateway enforcement Budget-as-alert (not enforcement), cache-breaking prefix drift, router misclassification, untagged tenant on LLM call First five-figure runaway
32 P + I Long-term memory Vector store / memory framework / KG / RAG / curated Retrieval poisoning, memory injection, cross-tenant bleed-over, per-tenant partition missing First persistent-memory agent
33 P + I Audit and compliance Tamper-evident log / PII redaction / per-region / DPA; identity-attributed audit; right-to-erasure graph traversal Training-data lawful basis, right-to-erasure cascade, cross-border fine-tune, audit entry missing identity tuple First regulated customer or DPIA

References

[1] L. Chen, M. Zaharia, J. Zou. "How is ChatGPT's Behavior Changing over Time?" Harvard Data Science Review, 2023. arXiv:2307.09009

[2] Y. Jiang et al. "When LLMs Get Significantly Worse: A Statistical Approach to Detect Model Degradations." 2026. arXiv:2602.10144

[3] F. Yan et al. "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models." ICML, 2025. OpenReview

[4] "ToolACE: Winning the Points of LLM Function Calling." ICLR, 2025. Paper

[5] "ACEBench: A Comprehensive Evaluation of LLM Tool Usage." Findings of EMNLP, 2025. ACL Anthology

[6] Z. Zhu et al. "What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering." 2024. arXiv:2406.12334

[7] M. Sclar et al. "Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design." 2023. arXiv:2310.11324

[8] F. Shi et al. "LLMs Get Lost In Multi-Turn Conversation." 2025. arXiv:2505.06120

[9] N. Jain et al. "Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems Over Extended Interactions." 2026. arXiv:2601.04170

[10] S. Bhatt et al. "Evaluating Goal Drift in Language Model Agents." 2025. arXiv:2505.02709

[11] Z. Shi et al. "Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications." 2025. arXiv:2511.19933

[12] Z. Xu et al. "MUSCLE: A Model Update Strategy for Compatible LLM Evolution." 2024. arXiv:2407.09435

[13] Palo Alto Networks Unit 42. "Cracks in the Bedrock: Escaping the AWS AgentCore Sandbox." 2026. Unit 42

[14] Anthropic Engineering. "Managed Agents." 2026. Anthropic

[15] Browser Use Engineering. "How We Built Secure, Scalable Agent Sandbox Infrastructure." 2026. Browser Use

[16] H. Chase. "The Two Patterns by Which Agents Connect Sandboxes." LangChain, 2026. LangChain

[17] Superagent. "AI Code Sandbox Benchmark 2026." Superagent

[18] Anthropic. "Securely Deploying AI Agents." Claude Code Docs. Anthropic

[19] M. Ubl, H. Arora. "Security Boundaries in Agentic Architectures." Vercel, 2026. Vercel

[20] OpenAI. "Sycophancy in GPT-4o: What happened and what we're doing about it." 2025. OpenAI

[21] ZenML. "What 1,200 Production Deployments Reveal About LLMOps in 2025." ZenML

[22] MLflow. "Prompt Registry for LLM and Agent Applications." MLflow

[23] Portkey. "Canary Testing for LLM Apps." 2025. Portkey

[24] Deepchecks. "How Prompt Updates Drive Most Incidents." 2025. Deepchecks

[25] "AI Agent Reliability Report 2025." LLM agent retry rate 15–30%. fast.io

[26] S. Ewen, G. van Dongen, I. Shilman. "Durable AI Loops: Fault Tolerance across Frameworks." Restate, 2025. Restate

[27] C. Davis. "Durable Execution meets AI." Temporal, 2025. Temporal

[28] C. Poly. "Durable Execution: The Key to Harnessing AI Agents in Production." Inngest, 2026. Inngest

[29] Y. Zhou et al. "Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering." 2026. arXiv:2604.08224

[30] Rack2Cloud. "Kubernetes Day 2 Failures: 5 Incidents & the Metrics That Predict Them." 2026. Rack2Cloud

[31] Kubernetes Issue #121436. "PVC Binding Prevents Pod Rescheduling." GitHub

[32] Gitpod Issue #9544. "Data loss when workspace timeout fires before sync completes." 2022. GitHub

[33] Replit. "Checkpoints and Rollbacks." Replit Docs

[34] Model Context Protocol. "The 2026 MCP Roadmap." MCP Blog

[35] Claude Code Issue #30224. "Auto-reconnect SSE MCP servers after server-side restart." GitHub

[36] MCP Python SDK Issue #520. "MCP Server Session Lost in Multi-Worker Environment." GitHub

[37] Claude Code Issue #27933. "OAuth token refresh race condition with multiple concurrent CLI processes." GitHub

[38] Claude Code Issue #24317. "Frequent re-authentication required with multiple concurrent sessions." GitHub

[39] D. Ogenrwot, J. Businge. "AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub." 2026. arXiv:2604.03551

[40] J. Geng, G. Neubig. "Effective Strategies for Asynchronous Software Engineering Agents." CMU, 2026. arXiv:2603.21489

[41] M. Cemri et al. "Why Do Multi-Agent LLM Systems Fail?" NeurIPS 2025 Datasets and Benchmarks Track. OpenReview

[42] Anthropic. "Context Engineering: Memory, Compaction, and Tool Clearing." Claude Cookbook, 2026. Anthropic

[43] "Cursor vs Claude Code vs Windsurf: Which One Handles Context Loss the Worst?" dev.to, 2026. dev.to

[44] L. Lindenbauer et al. "The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management." JetBrains Research / TU Munich, 2025. arXiv:2508.21433

[45] Cognition. "Rebuilding Devin for Claude Sonnet 4.5: Lessons and Challenges." 2025. Cognition

[46] R. Du et al. "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval." 2025. arXiv:2510.05381

[47] N. Hong et al. "Context Rot: How Increasing Input Tokens Impacts LLM Performance." Chroma Research, 2025. Chroma

[48] Langfuse. "November 2025 Incident Report." 2025. Langfuse

[49] OpenAI. "Routing Layer Memory Exhaustion Incident." OpenAI Status

[50] DEV Community. "The $47,000 Agent Loop." 2025. dev.to

[51] MindStudio. "AI Agent Token Budget Management." 2025. MindStudio

[52] Azguards. "The Checkpoint Bloat: Mitigating Write-Amplification in LangGraph Postgres Savers." 2025. Azguards

[53] S. Willison. "The Lethal Trifecta for AI Agents." 2025. simonwillison.net

[54] OWASP. "LLM01:2025 — Prompt Injection." OWASP

[55] Obsidian Security, SwarmSignal, Airia. "Prompt Injection in 73% of Production AI Deployments." 2025. Obsidian

[56] ARMO Security. "AI Agent Sandboxing: Kubernetes-Native Enforcement." 2025. ARMO

[57] n8n CVE-2026-25049. Sandbox escape, CVSS 10.0. cyberdesserts

[58] OpenAI. "Agents SDK Sandboxes." 2026. OpenAI

[59] Daytona. "Architecture Documentation." Daytona

[60] A. Luzzardi. "Agent Harness: Inside vs Outside the Sandbox." Mendral, 2026. Mendral

[61] Fly.io. "Sprites: Persistent MicroVMs for AI Agents." 2026. SDxCentral

[62] GitHub. "Fine-grained personal access tokens." GitHub Docs

[63] C. Brunel et al. "Biscuit: Decentralized Authorization with Attenuable Tokens." biscuitsec.org

[64] A. Birgisson et al. "Macaroons: Cookies with Contextual Caveats for Decentralized Authorization in the Cloud." NDSS 2014. Google Research

[65] IETF RFC 8693. "OAuth 2.0 Token Exchange." RFC Editor

[66] SPIFFE/SPIRE. "Secure Production Identity Framework for Everyone." spiffe.io

[67] Salesforce. "Agentforce Trust Layer and Agent Identity." 2025. Salesforce

[68] Auth0. "Identity for AI Agents." 2026. Auth0

[69] AWS. "Firecracker Snapshotting." Firecracker Docs

[70] Depot. "Lazy Container Pulls with Stargz and eStargz." 2025. Depot

[71] Namespace Labs. "Sub-second Cold Starts for Sandboxed Agents." 2026. Namespace

[72] LiteLLM. "Unified LLM Gateway." docs.litellm.ai

[73] Portkey. "AI Gateway and Guardrails." portkey.ai

[74] Kong. "AI Gateway." konghq.com

[75] Cloudflare. "AI Gateway." Cloudflare Docs

[76] Helicone. "LLM Observability and Gateway." helicone.ai

[77] Arcade. "Agent Tool Platform." arcade.dev

[78] Composio. "Tool Integration Platform for AI Agents." composio.dev

[79] Pomerium. "Model Context Protocol (MCP) Gateway." 2026. Pomerium

[80] OpenTelemetry. "Semantic Conventions for Generative AI." OpenTelemetry

[81] Langfuse. "Open-source LLM Engineering Platform." langfuse.com

[82] LangSmith. "LLM Application Development, Monitoring, and Evaluation." smith.langchain.com

[83] Arize. "Phoenix: Open-source ML Observability." arize.com/phoenix

[84] Braintrust. "LLM Evaluation and Observability." braintrust.dev

[85] C. Jimenez et al. "SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024. arXiv:2310.06770

[86] T. Xie et al. "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." NeurIPS 2024. arXiv:2404.07972

[87] S. Zhou et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." ICLR 2024. arXiv:2307.13854

[88] K. Yao et al. "τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains." 2024. arXiv:2406.12045

[89] METR. "Measuring AI Ability to Complete Long Tasks." 2025. metr.org

[90] A. Panickssery et al. "LLM Evaluators Recognize and Favor Their Own Generations." 2024. arXiv:2404.13076

[91] Promptfoo. "Open-source LLM Testing and Evaluation." promptfoo.dev

[92] UK AI Safety Institute. "Inspect: A Framework for Large Language Model Evaluations." inspect.aisi.org.uk

[93] Patronus AI. "Automated Evaluation for LLMs." patronus.ai

[94] Ragas. "Evaluation Framework for Retrieval-Augmented Generation." docs.ragas.io

[95] Predibase. "Fine-tuning Platform for Open-source LLMs." predibase.com

[96] Together AI. "Fine-tuning API." together.ai

[97] W. Kwon et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM)." SOSP 2023. arXiv:2309.06180

[98] Socket. "Socket.dev: Supply Chain Security for Open Source." socket.dev

[99] Aqua Security. "Trivy: Vulnerability Scanner." trivy.dev

[100] CNCF. "Open Policy Agent (OPA)." openpolicyagent.org

[101] AWS. "Cedar Policy Language." cedarpolicy.com

[102] NVIDIA. "NeMo Guardrails: Toolkit for Programmable LLM Guardrails." github.com/NVIDIA/NeMo-Guardrails

[103] Pinecone. "Managed Vector Database for AI." pinecone.io

[104] pgvector. "Open-source Vector Similarity Search for Postgres." github.com/pgvector/pgvector

[105] Chroma. "Open-source Embedding Database." trychroma.com

[106] C. Packer et al. "MemGPT: Towards LLMs as Operating Systems." 2023. arXiv:2310.08560

[107] Zep. "Temporal Knowledge Graph for Agent Memory." getzep.com

[108] European Parliament. "EU Artificial Intelligence Act." 2024. artificialintelligenceact.eu

[109] NIST. "AI Risk Management Framework (AI RMF 1.0)." 2023. NIST

[110] ISO. "ISO/IEC 42001:2023 — AI Management Systems." ISO

[111] WHATWG. "Server-sent events." HTML Living Standard. html.spec.whatwg.org

[112] Google, Microsoft et al. "Agent-to-Agent (A2A) Protocol." 2025. a2aprotocol.org

[113] CrewAI. "Multi-Agent Framework." crewai.com

[114] OpenAI. "Swarm: Educational Framework for Multi-Agent Orchestration." github.com/openai/swarm

[115] E2B. "Secure Sandboxes for AI Code Execution." e2b.dev

[116] Modal. "Serverless Compute for AI Workloads." modal.com

[117] Anthropic. "Claude Code: Checkpointing and Session Forks." Claude Code Docs. code.claude.com

[118] Cursor. "Checkpoints in the Agent Workflow." 2025. stevekinney.com notes

[119] LangChain. "LangGraph Time Travel and Branching." 2025. LangChain Docs

[120] OpenAI. "Codex CLI /rewind Proposal." Issue #11626, 2025. GitHub

[121] LangChain. "LangGraph Interrupts." 2025. LangChain Docs

[122] OpenAI. "Agents SDK: Lifecycle Hooks and Handoffs." 2025. openai.github.io

[123] Cognition. "Devin Can Now Manage Devins." 2025. Cognition

[124] Anthropic. "Claude Code Plugins and Subagents Reference." code.claude.com

[125] OpenAI. "Agents SDK: Handoffs." 2025. openai.github.io

[126] Mastra. "Agents and Tools." mastra.ai

[127] OpenHands. "Runtime Architecture and File Operations." docs.openhands.dev

[128] OpenAI. "Responses API — Structured Outputs and Attached Artifacts." 2025. platform.openai.com

[129] Northflank. "Daytona vs E2B: AI Code Execution Sandbox Benchmarks." 2026. northflank.com

[130] "How I Built Sandboxes That Boot in 28 ms with Firecracker Snapshots." dev.to, 2025. dev.to

[131] runc. "CVE-2025-31133 / CVE-2025-52881: Container Escape via Namespace Handling." November 2025. GitHub Security Advisories

[132] WorkOS. "Maxim Fateev on Durable Execution for AI Agents." 2025. workos.com

[133] Golem. "The Emerging Landscape of Durable Computing." 2025. golem.cloud

[134] Karpenter Issue #2777. "PVC Zone Injection Prevents Pod Rescheduling." 2026. GitHub

[135] Replit. "Inside Replit's Snapshot Engine." December 2025. blog.replit.com

[136] Neon. "Replit App History Powered by Neon Branches." 2025. neon.com

[137] Model Context Protocol. "MCP Transport Future: Streamable HTTP Replaces SSE." December 2025. blog.modelcontextprotocol.io

[138] Model Context Protocol. "Transports — Mcp-Session-Id and Last-Event-ID." Spec 2025-11-25. modelcontextprotocol.io

[139] IETF RFC 8707. "Resource Indicators for OAuth 2.0." RFC Editor

[140] "MultiAgentBench: Evaluating Coordination Protocols in Multi-Agent LLM Systems." ACL 2025. ACL AnthologyarXiv:2503.01935

[141] "ACON: Context Compaction for Long-Horizon Agentic Tasks." October 2025. arXiv:2510.00615

[142] "LoCoBench-Agent: Evaluating Compaction Cliffs in Long-Context Agents." November 2025. arXiv:2511.13998

[143] Wiz Research / SwarmSignal. "AI Agent Security Report 2026." 2026. swarmsignal.net; sqmagazine.co.uk

[144] "SandboxEscapeBench: Quantifying Kernel vs Namespace Defenses." March 2026. arXiv:2603.02277

[145] "Drift No More? Evaluating Agent Drift on τ-Bench." October 2025. arXiv:2510.07777

[146] "AWS Outage and the Kiro AI Bot: A Post-Mortem." December 2025. singhajit.com; InfoQ

[147] M. Kleppmann. "How to do Distributed Locking." 2016. martin.kleppmann.com

[148] "MAFBench: Benchmarking Multi-Agent Framework Design Choices." 2026. arXiv:2602.03128

[149] "Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets." April 2026. arXiv:2604.02460

[150] Kubernetes SIG / Google. "GKE Pod Snapshots and agent-sandbox CRD." November 2025. GitHub; Google Cloud Blog

[151] Y. Chai et al. "Fork in the Road: Reflections and Optimizations for Cold-Start Latency in Production Serverless Systems." OSDI 2025. Tsinghua MADSys

[152] IETF. "OAuth 2.0 On-Behalf-Of User for AI Agents." draft-oauth-ai-agents-on-behalf-of-user-00, 2025. IETF Datatracker

[153] Okta. "Okta for AI Agents — Early Access." September 2025. Okta

[154] Invariant Labs. "GitHub MCP: Ambient Authority and Cross-Repo Token Scope Creep." May 2025. invariantlabs.ai

[155] "Firecracker + REAP: Prefetching for Snapshot-Based MicroVM Starts." USENIX ATC 2024. USENIX

[156] Nydus / stargz-snapshotter. "Lazy-Load OCI Images (containerd)." GitHub; GitHub

[157] Microsoft. "Dissecting LLM Container Cold-Start: Where the Time Actually Goes." 2025. techcommunity.microsoft.com

[158] OpenTelemetry. "Generative AI Spans." opentelemetry.io

[159] OpenTelemetry. "Generative AI Agent and Framework Spans." 2025. opentelemetry.io

[160] Langfuse. "OpenTelemetry OTLP Endpoint." February 2025. langfuse.com

[161] Scale. "SWE-Bench Pro: A Harder Benchmark for Coding Agents." 2025. scale.com

[162] "τ²-Bench: Dual-Control Benchmark with Telecom and Banking Domains." June 2025. arXiv:2506.07982

[163] "Position Bias in LLM Judges: An Empirical Study." IJCNLP 2025. ACL Anthology

[164] "CALM: Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge." 2024. arXiv:2410.02736

[165] "AgentRR: Trajectory Record-and-Replay for Agent Evaluation." 2025. arXiv:2505.17716

[166] Berkeley RDI. "Trustworthy Benchmarks for Agents." 2025. rdi.berkeley.edu

[167] S. Xu et al. "Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study." ICML 2024. arXiv:2404.10719

[168] "Self-Distillation Fine-Tuning for Agent Trajectories." October 2025. arXiv:2510.00482

[169] J. Spracklen et al. "We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code-Generating LLMs." USENIX Security 2025. USENIX

[170] CISA. "Widespread Supply Chain Compromise Impacting npm Ecosystem (Shai-Hulud)." September 2025. CISA; Unit 42. "Shai-Hulud npm Supply-Chain Attack." Unit 42

[171] Anthropic. "Pricing, Including Prompt Cache." platform.claude.com

[172] "Don't Break the Cache: Evaluating Prompt Caching for Long-Horizon Agentic Tasks." January 2026. arXiv:2601.06007

[173] ProjectDiscovery. "How We Cut LLM Cost with Prompt Caching." 2025. projectdiscovery.io

[174] "RouteLLM: Learning to Route LLMs with Preference Data." 2024. arXiv:2406.18665

[175] "Cascade Routing: Unified Model Routing and Cascading for LLMs." ICLR 2025. OpenReview

[176] Letta. "Stateful Agents with Memory as a First-Class Primitive." letta.com

[177] "Mem0: Memory for the AI Era." April 2025. arXiv:2504.19413

[178] Snap Research. "LOCOMO: Long-Term Conversational Memory Benchmark." snap-research.github.io

[179] Letta. "Benchmarking AI Agent Memory." 2025. letta.com

[180] European Commission. "GPAI Code of Practice." July 2025. code-of-practice.ai; EU Digital Strategy

[181] NIST. "AI 600-1 — Generative AI Profile." July 2024. NIST

[182] Cloud Security Alliance. "Agentic AI Profile for NIST AI RMF v1." 2025. CSA Labs

[183] Keycloak. "Open Source Identity and Access Management." keycloak.org

[184] AWS. "IAM Identity Federation and Attribute-Based Access Control." AWS Docs

[185] Google Cloud. "IAM Overview — Resource Hierarchy and Conditions." Google Cloud Docs

[186] Oso. "Authorization as a Library — Relationship-Based Access Control." osohq.com

[187] Permit.io. "Authorization-as-a-Service for Fine-Grained Access Control." permit.io

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment