AI Context for Prod, Optimized by AIs in Prod

A few months ago we wrote about shipping ACE, the agentic context-engineering loop from Stanford / SambaNova that gives Annie (our AI SRE agent) institutional memory by agentically curating cheatsheets from past runs. It worked. Clients liked watching Annie "grow up" on their stack.

But agentic curation of what gets remembered is only half of optimizing AI context for prod. The other half is agentic indexing: how that memory is organized, surfaced, and retrieved. ACE-as-shipped left that axis unattended. Production context is the load-bearing thing we build, and we want it optimized by the AIs that use it. This post is about the five things we added on top of base ACE to close that gap: (1) a fixed set of memory items always presented to the agent, (2) per-query retrieval over the rest of the memory store, (3) an agent-optimized index of that store, (4) the ability for the agent to query the store mid-run, and (5) tried-and-true freshness mechanisms on top. The cover diagram above shows all five wrapped around the original ACE loop, numbered the same way.

Base ACE in one diagram

Annie runs an investigation. A Reflector reads the run and extracts what was learned. A Curator translates that into ADD / UPDATE / DELETE operations against a per-agent playbook of cheatsheets. Next investigation, the playbook is injected as part of the system prompt. The agent now has whatever institutional knowledge the loop has produced so far.

We mostly kept the curation engine intact. What we changed is what each cheatsheet carries and what happens after the playbook is curated. The question is how it gets into the next run.

Three failure modes, three places the literature points

Part of how we evaluate Annie's quality is going through traces weekly with the SRE team and meeting recurring with each client. It's the discipline PostHog landed on for agents-in-the-wild, for similar reasons. Same complaints kept coming up.

Annie trusts cheatsheets that are no longer true. A pattern captured two weeks ago still gets followed today, after the relevant deployment has changed twice. ACE's playbook carries a timestamp on each entry and a helpful / harmful counter (the agent tags each cheatsheet it used as helpful or harmful at the end of every run, and those tags accumulate per cheatsheet over time). Annie does react to those, but it isn't enough. The original ACE paper is evaluated on tasks where the environment to be learned is static; the design isn't asked to handle a stack that mutates underneath the memory. Plenty of memory constructs in the literature are: Zep / Graphiti attaches bitemporal validity to facts, MemoryBank borrows Ebbinghaus-style reinforce-on-recall, and the STALE benchmark reports the best model in the suite scores 55.2% at inferring memory age, meaning we have to show the model staleness rather than hope it figures it out. We needed to bring this discipline in.

The playbook size is fixed. Our Curator caps each agent's playbook so the system prompt doesn't grow without bound. Past the cap, every ADD requires a DELETE elsewhere, and the ceiling was forcing the Curator to drop knowledge worth keeping. Lifting the cap on the store without lifting it on the prompt requires retrieval.

Information she has, but doesn't find. When the cap isn't binding, the playbook is still injected as one long blob in section order. There's no signal to the agent about which of the entries are likely relevant to the current alert. She would routinely answer questions whose answer was sitting in her context, just unsurfaced. And the literature is brutal on the cost of having stuff sit in context that isn't relevant. Goldberg et al. (Same Task, More Tokens) show accuracy drops monotonically with input length even when the added tokens are irrelevant padding: pure attention-budget cost, no content confusion. LongMemEval reports GPT-4o at 91.8% with only the relevant context vs 60.6% with the full 115K history. Lost in the Middle shows the U-shaped attention curve: models recall information well when it sits at the start or end of the context, and badly when it sits in the middle, regardless of how relevant it is. And this hasn't gotten better at the frontier. Chroma's 2025 Context Rot report ran the same experiment on Claude 4 Opus, GPT-4.1, Gemini 2.5 Pro and 15 others and concluded "models do not use their context uniformly; their performance grows increasingly unreliable as input length grows, even on simple tasks." A February 2026 paper, Long Context, Less Focus, supplies the theoretical reason: attention dilution as a structural limit of soft-attention transformers, not a model-generation artifact to be patched out. The "but our cheatsheets are relevant" defense doesn't survive Goldberg, and the "but frontier models handle long context now" defense doesn't survive the 2025/2026 evidence. Past a threshold, more tokens is a tax.

None of these are bugs in ACE. They're the natural ceiling of treating context as a static artifact to dump in.

ACE's authors have already moved past ACE

Two recent papers from the same line of authors are worth sitting with.

PEEK is ACE's authors revisiting the problem and building, essentially, the opposite of ACE. A constant-sized ~1024-token "context map" of orientation knowledge: what's in the corpus, how it's organized, useful entities, schemas. A Distiller tags items, a Cartographer decides ADD/DELETE/REPLACE, an Evictor enforces the bound. The same authors who told us to grow contexts incrementally now tell us to keep them small, curated for orientation, separately retrievable.

MCE goes further: agents engineer not just the content of context but the machinery that produces it. They write and execute code to maintain a SKILL.md (a learnable methodology document) plus the surrounding retrieval policy. Meta-Harness (same author, with Chelsea Finn's group) compares these head-to-head on classification benchmarks: 48.6% accuracy at 11.4K tokens vs ACE's 40.9% at 50.8K. 4.5× fewer tokens, +7.7 points.

Both ideas are excellent. Neither ships to a production SRE agent as-is. Letting an agent write and execute arbitrary code to optimize its own context is a security non-starter, and the optimization loops require labelled training sets and budget we don't have. What we wanted was the posture of these papers (context as a designable surface, not a dump) translated into a system where every change is something a human-reviewed deploy can vouch for.

Five additions, in order

We built five things, in this order:

1. A fixed set of memory items always presented to the agent (is_main).

2. Per-query retrieval over the rest of the memory store, scored against tags and descriptions the Curator writes alongside the content.

3. An agent-optimized index of the store, what we call the org/query guide.

4. The ability for the agent to query the store mid-run, for what the retriever didn't pick.

5. Tried-and-true memory freshness mechanisms layered on top of retrieval.

Each gets its own section below.

Across the diagrams in this post: cyan = data structures inherited from base ACE. Terracotta = added by this work. Beige = the agents themselves and cheatsheet content bodies.

Always-presented memory items (`is_main`)

PEEK's central claim is that orientation knowledge (what is this corpus, how is it organized, what entities exist) compresses to a small budget and earns its keep on every single run regardless of query. That maps cleanly onto a piece of Annie's memory: the company-level facts about infrastructure, escalation paths, naming conventions. These are query-independent. There's no scenario in which they're irrelevant.

We give them a flag: is_main = true. A small number of cheatsheets per agent are marked main, and those are injected every run, query or no query.

Per-query retrieval for the rest

The rest of the store gets retrieved by relevance to the current run. The query is the raw alert (for RCA) or the user message (for chat). To make retrieval actually work, each cheatsheet now arrives at write time with an agent-optimized description and a list of long-phrase tags. These are the retrieval handles MCE and PEEK both push for. Items shouldn't just be content; they should carry their hooks.

Scoring is hybrid: Reciprocal Rank Fusion over semantic embeddings + Postgres tsvector keyword match, multiplied by a freshness factor.

We took the design straight from Mem0, which is the production-validated reference point here: hybrid scoring, LongMemEval 94.4%, 91% lower latency than full-context baselines. Industry has converged on hybrid for a reason.

The store is no longer capped. Per run, the agent still only sees a small number of cheatsheets: the hot tier plus the retrieved top-K.

An aside, in observability terms. What a Curator-written tag is, structurally, is just an index dimension on a memory event. Your monitoring stack indexes events on dimensions you decided would matter (service, env, region, http_status), and that index is what makes ad-hoc filtering possible. Pull the index, the data is still there, but you can't find anything fast. The Curator is doing the same job on memory: it picks the dimensions, writes them onto each cheatsheet at write-time, and the retriever queries on them later. Different optimizer (an agent, not a human SRE), different unit (a learned pattern, not a request), same primitive: you can't find what you didn't index.

An agent-optimized index of the store

PEEK injects a "context roadmap." MCE maintains a SKILL.md. Both make the same bet: agents need an index of what's in memory and how to query it, separate from the memory itself.

We let the Curator maintain an org/query guide, a small character-bounded document prepended every run. It documents the tag taxonomy, which sections are dense vs sparse, vocabulary conventions, and how to query. It's O(1) regardless of how much the store grows.

After a few weeks in production, the guides started picking up patterns we hadn't prompted for. One client's guide opens with a project-boundary warning: the AWS account hosts two unrelated projects, and the Curator caught Annie repeatedly conflating them. Another guide, already deep into double-digit versions, documents section densities and kafka cluster anchors so future-Annie searches with the right terms. I clicked through expecting boilerplate the first time and found the guide had organized itself around boundaries we hadn't taught it.

The agent gets to query the store mid-run

Top-K is the wrong answer when the agent realizes mid-investigation that it needs something the retriever didn't pick. Anthropic's production memory pattern treats memory as a tool the agent calls on demand, with +39% on agentic search and −84% tokens on long workflows.

We expose a memory search tool to the master agent. Same backend as the retriever, one source of truth, two callers. With the org/query guide above already prepended every run, the agent has the vocabulary to write good queries with. The Curator also calls it at write time to dedup before ADD (find near-duplicates, UPDATE them instead of inserting a near-copy).

In practice we observed the master agent reaches for it rarely. The optimistic read is that the retrieved bundle is mostly enough. The cautious read is that we haven't trained the agent well enough to know when to reach. Probably some of both.

Memory freshness, three tried-and-true mechanisms

The natural first attempt at handling outdated memory is to stamp each cheatsheet with a create/update date and trust the agent to discount older ones from context. That's roughly what ACE-as-shipped already did, and it's what wasn't working. The STALE benchmark cited earlier explains part of it: the best model in the suite scores 55.2% at inferring memory age, so passive timestamps don't reliably get discounted. The rest is that "is this stale" isn't actually one question; it's three, and each one is a different lever from a different memory paper:

Anchor	Lever
MemoryBank	`last_validated_at` is bumped on every helpful mark. Useful cheatsheets stay fresh; ignored ones decay.
Zep	`valid_until`, an optional explicit expiry on factual cheatsheets. Drives a hard `expiry_gate` in the retrieval score.
STALE benchmark	Inline staleness labels in the prompt ("last confirmed 12 days ago"). The model can't reliably infer age, so it has to see it.

All three layered on top of the retrieval scoring. A stale cheatsheet doesn't just rank lower; the agent gets told it's stale in plain text next to the content.

A few weeks in: prompt cut in half, harmful rate down six-fold

The stores grew past the old ceiling. Sixteen clients had a chat in the last week. Several of them now have master-agent stores roughly twice the old cap. Before retrieval, every one of these would have been clamped.

Harmful marks per execution fell sharply. Matched 12-day windows on either side of the rollout:

	OLD ACE	NEW ACE
Helpful marks / execution	5.63	4.26
Harmful marks / execution	0.30	0.05

Helpful per execution drifted a touch. Harmful per execution dropped much more cleanly: about a six-fold reduction. The agent is getting hurt by stale or misapplied cheatsheets meaningfully less often than before.

Prompt size: about half, drawn from a much bigger store. The cheatsheets section is now roughly half the size it used to be, even though the underlying memory it's drawn from is several times larger than the old capped set. Each retrieved cheatsheet now carries its own retrieval and freshness metadata, and an org/query guide gets prepended every run, so we spent some of the savings on navigation aids the agent needs to use the smaller content set well. The net is meaningfully smaller per-run, against a meaningfully bigger memory.

No memory-centered failure modes since rollout. This is the signal that matters most. PostHog made the same argument for agents in messy environments: evals miss the failures that matter, and the highest-signal practice is staring at production traces every week. We go through traces weekly with the SRE team and have a recurring sync with each design-partner client. No memory-centered failure mode has surfaced since shipping. That's the result we're putting the most weight on.

The bet

We're not claiming a benchmark beat. We don't have one to claim. Our agent runs in the wild, where the eval problem is genuinely hard, security constraints rule out the most unhinged research patterns (no agent writing and executing its own context-optimization code), and the cadence of useful iteration is set by what clients say in the next sync, not by a leaderboard. What we do have is a construct: a posture toward agent memory borrowed from the most interesting recent papers, translated into something a normal engineering team can deploy, review, and roll back.

And the shape we picked is one practitioners would recognize on sight. Not unhinged context-engineering machinery an agent writes for itself, but the same indexing-and-retrieval discipline an SRE already applies to production events. Different optimizer, same primitives. We toned the research down into a vocabulary that fits how production context is already built. That alignment matters more to us than a benchmark would.

The curation engine ACE pioneered is still the right engine for accumulating institutional knowledge. The right use-path is a small always-on orientation tier, per-query retrieval scored against retrieval-handles the Curator writes alongside the content, an agent-optimized guide that documents the store's shape, freshness gates the model can see, and a search tool for the long tail.

A few weeks of production says the harmful-rate has collapsed and the stores keep growing the way we hoped. The Curator wrote the AWS-account-with-two-projects boundary warning into the guide on its own. That wasn't prompted by us, and it's the part I keep watching.