AI Agents · Production · Reliability · Cost Control

Cheaper to rollback: checkpointing in multi-step agent loops

Published · Updated

When an LLM tool call fails halfway through a ten-step agent run, re-running from scratch doubles latency and cost. Checkpointing state at each step cut our retry spend by 62%.

Autonomous agents executing long tool chains face an uncomfortable reality: LLM tool calls fail. Network timeouts, API rate limits, and malformed outputs routinely break runs halfway through. The default response is to retry the entire sequence from step one. This is expensive and slow.

We tracked a procurement agent over three weeks. It averaged 8.4 tool calls per run. The failure rate at any individual step was 3.1 percent. Doing the maths, nearly a quarter of runs failed somewhere in the chain. Naive retries meant re-executing successful steps, paying for tokens and API calls twice.

The cost of stateless retries

The core problem is statelessness. If step seven fails, you lose the context of steps one through six. Re-prompting the LLM to redo those steps burns tokens. It also increases latency, as downstream APIs must process duplicate requests. In our case, failed runs cost 1.8 times the successful runs.

We introduced checkpointing. After every successful tool call, the agent writes its accumulated state, including tool outputs and the shortened prompt, to a temporary store. Redis works fine for this. When a step fails, the agent catches the exception and reloads the last valid checkpoint, not the initial prompt.

Implementing the checkpoint loop

The implementation is straightforward. Wrap your tool execution loop in a try-except block. On success, append the tool response to a state array and save it. On failure, fetch the latest state array and re-prompt the LLM with only the remaining task context. The LLM never sees the failed step.

There is a trade-off. Checkpoint state takes up memory, and you must handle cache invalidation. We set a 30-minute TTL on checkpoints. You also need idempotent tool APIs. If step three writes to a database, retrying from step three must not duplicate that write. Use transaction IDs.

Measuring the rollback dividend

The numbers from our production deployment were clear. Checkpointing reduced our average retry cost by 62 percent. Latency on recovered runs dropped from 14 seconds to 3.1 seconds, because we skipped re-calling four successful upstream APIs. The Redis overhead was negligible compared to LLM token costs.

We also observed an unexpected benefit. Shorter prompts on retry improved success rates. When we re-prompted with the full history, the LLM occasionally got confused by the length. Resuming with a compact state summary yielded cleaner tool calls, dropping step-level failures from 3.1 percent to 2.2 percent.

Not every agent needs this. If your agent averages two steps and failures are rare, the engineering overhead is not justified. But for any agent orchestrating four or more dependent API calls, checkpointing is a basic reliability requirement, not an optimisation.

Stop re-running successful steps. Identify your longest agent chains, instrument the per-step failure rates, and calculate what a full restart costs you in tokens and latency. Add a simple state save after each tool call, and retry from the last known good state. Your latency and your invoice will thank you.

Working on a project where these methods apply?