Process Automation · Reliability · Circuit Breakers · Observability

The circuit breaker that saved our automation from an API ratelimit cascade

Published · Updated

When a downstream API tightened rate limits, our deterministic pipeline retried itself into a 4-hour backlog. A per-key circuit breaker cut recovery time from hours to 90 seconds.

Our invoice processing pipeline handles 12,000 documents daily through a deterministic sequence of OCR, classification, and ERP posting. It ran smoothly for months until the accounting API provider silently tightened its rate limits from 600 to 200 requests per minute.

The retry cascade

The pipeline's default retry logic treated every 429 response as a transient failure. It backed off for 60 seconds, then dumped its entire queued backlog at the API again. Within ten minutes, all 16 worker threads were stuck in exponential backoff loops.

By the time alerts fired, the queue depth had grown from 200 to 8,400 documents. Processing latency spiked from an average of 4 seconds per document to over 20 minutes. The automation was technically running, but entirely stuck in retry cycles.

Retries without circuit breakers are a denial-of-service attack on your own infrastructure. When a downstream service degrades, immediate retries compound the failure. You need a mechanism that stops your pipeline from hammering a service that cannot accept traffic.

Implementing the breaker

We implemented a per-API-key circuit breaker with a simple state machine. After 5 consecutive 429 responses on a single key, the breaker trips to open state. In open state, the pipeline immediately rejects the request locally without calling the API, returning a controlled failure.

The breaker stays open for 30 seconds, then transitions to half-open. It allows exactly one test request through. If that request succeeds, the breaker closes and the pipeline resumes normal throughput. If it fails, the open timer resets for another 30 seconds.

This pattern changed our failure dynamics completely. Instead of 16 workers accumulating 429 errors and backing off independently, the breaker coordinated them. When the API limit was hit, the breaker tripped within 2 seconds, stopping all downstream traffic instantly.

The queue still built up during the 30-second open windows, but the ERP posting stage recovered within 90 seconds of the rate limit resetting. Compare that to the previous 4 hours of exponential backoff thrashing across independent worker threads.

Observability and tuning

We instrumented the breaker state changes as Prometheus counters. Tracking open-to-half-open transitions gave us a direct measure of downstream API pressure. A sudden increase in breaker trips now triggers a PagerDuty alert at 3 trips per minute, well before queue depth spikes.

Circuit breakers add a small fixed latency cost in half-open state while testing recovery, and they require tuning the failure threshold and reset timer to your specific downstream service. We found 5 failures and 30 seconds optimal for a rate-limited HTTP API.

If your deterministic pipeline relies on external APIs without circuit breakers, you are one silent rate limit change away from a multi-hour queue backlog. Add per-endpoint breakers with observability, tune them to your actual throughput limits, and test them by deliberately throttling your staging API.

Circuit breakers are not optional safety nets for process automation. They are the only mechanism that lets a pipeline fail fast, protect its downstream dependencies, and self-recover in seconds rather than hours. Instrument the state transitions, and you turn a silent cascade into a visible, manageable event.

Working on a project where these methods apply?

The circuit breaker that saved our automation from an API ratelimit cascade — Neurolinks