The token burst that crippled our agent loop
Published · Updated
Our multi-step agent hit a 94-token retry that bloated to 3,100 tokens per run. The culprit was not the LLM, but an unbounded context accumulation strategy we forgot to cap.
We deployed a customer-support agent that iteratively refined database queries. It worked flawlessly in staging, averaging 400 tokens per turn across three steps. In production, one edge case pushed it into a retry spiral that consumed 3,100 tokens per run and spiked costs by 340% overnight.
The failing pattern
The agent appended every failed SQL attempt and its error output to the prompt before asking the model to try again. The logic assumed the model would naturally correct itself with more context. In reality, the growing error history just gave it more noise to overfit to, producing increasingly bizarre queries.
A single malformed date filter caused the initial failure. By retry three, the prompt contained three distinct stack traces. The model latched onto the wrong schema assumptions buried in those traces, ignoring the original user question entirely. It was not reasoning; it was hallucinating from its own error log.
Why truncation was not enough
Our first fix was a naive sliding window that truncated the error history to the last 1,000 tokens. This stopped the token burst, but it created a worse problem. The model lost the specific column constraint that caused the original error, so it repeatedly generated the exact same flawed query.
The truncation destroyed the causal link between the error and the fix. Without the full context of why the query failed, the model had no signal to correct its syntax. We traded a cost explosion for an infinite retry loop that hit our max iteration guardrail and failed silently.
The state diff solution
We replaced the raw error accumulation with a state diff mechanism. Instead of appending full stack traces, a lightweight script extracted just the specific Postgres error code and the offending column name. This summary replaced the verbose logs, capping the retry context at roughly 80 tokens per attempt.
The model now received a precise signal: error 42703 on column discharge_date. It corrected the schema reference immediately. Average tokens per run dropped back to 460, and our retry success rate actually climbed from 62% to 88%. Less context meant less confusion.
The hard limit backstop
We also added a hard cap of two retries for any iterative tool call. If the agent cannot resolve the error in two attempts, it halts and returns a structured failure payload to the orchestrator. This prevents unbounded token growth regardless of the summarisation logic.
Unbounded context accumulation is a hidden cost multiplier in agent systems. If your agent appends raw errors or tool outputs on retry, cap the context growth and summarise the failure signal. A precise error digest beats a verbose stack trace every time, for both the model and your margin.
Working on a project where these methods apply?