AI Agents · Production · Reliability · Cost Control

The tool-call context window that quietly crippled our agent

Published 22 June 2026 · Updated 22 June 2026

Autonomous agents accumulate tool results until they hit the context limit and silently truncate system instructions. We fixed it by capping the working memory. Here is the exact threshold we chose.

We deployed a procurement agent that negotiated with three vendor APIs sequentially. It worked flawlessly in staging. In production, on day two, it started ignoring its output format constraints and returning unparseable JSON to the orchestrator. The culprit was not a prompt issue. It was context window economics.

The silent truncation trap

Each vendor API returned roughly 4,000 tokens of structured data. After three consecutive tool calls, the agent was sitting on 12,000 tokens of raw results. Add the base system prompt, the scratchpad, and the conversation history, and we were brushing against the 128k context ceiling.

Models do not fail gracefully when they hit context limits. They silently truncate from the top. In our case, the model dropped the system instructions that enforced the strict JSON schema, while happily retaining the massive tool payloads in the middle. The agent simply forgot its rules.

This is the tool-call context trap. Stateless APIs treat every call as independent, but an agent loop is stateful by design. Every tool result you append to the message history is a one-way ratchet. You consume context budget you never get back, and the model degrades well before hard failure.

Measuring the compliance cliff

We measured the output quality drop carefully. At 60 percent context utilisation, schema compliance sat at 99 percent. At 85 percent utilisation, compliance fell to 72 percent. The model started omitting required keys and hallucinating fallback values. It was not an outage, but it was quiet data corruption.

The naive fix is summarisation. You run a secondary LLM call to compress tool results before feeding them back into the loop. We tried this. It added 1.4 seconds of latency per step and cost 0.003 EUR per summarisation. On a ten-step agent run, that compounds into real money and noticeable delay.

Capping the working memory

Instead, we implemented a hard context budget at the orchestrator level. The agent loop now tracks cumulative token count from tool results. When the working memory exceeds 40 percent of the model context window, old tool results get truncated to just their schema headers and a status code.

This means the agent remembers that vendor API one returned a 200 OK with a price field, but it forgets the exact 4,000-token payload. We traded raw recall for instruction adherence. It is a deliberate trade-off. An agent that forgets its rules is more dangerous than an agent that forgets its data.

We set the 40 percent threshold empirically. Below that, we saw zero degradation in downstream task accuracy. Above it, the compliance curve fell off a cliff. Your exact threshold depends on your prompt density, but the pattern is universal: unbounded tool context will eventually break your agent.

Monitor your agent loop for cumulative tool token growth. Set a hard budget for working memory, and define a deterministic truncation strategy before the model decides what to forget for you. A constrained agent is a reliable agent.

Working on a project where these methods apply?

Talk to the studio