The idempotency key saved our automation from a queue replay
Published · Updated
When our message broker replayed 12,000 events overnight, only 14 duplicate actions escaped. Here is how idempotency keys turned a costly incident into a minor log entry, and the implementation pattern we now mandate.
Process automation pipelines break in predictable ways. The most damaging failure is not a crash, but a silent duplicate. An orchestration queue retries a message, the downstream API accepts it twice, and you are suddenly reconciling double bookings or duplicate invoices. We learned this the hard way.
The overnight replay
Our RabbitMQ cluster lost a node at 02:14 on a Tuesday. The failover succeeded, but the rebalancing re-delivered 12,000 messages that had already been processed. Without explicit idempotency controls, our CRM integration faithfully created 12,000 duplicate contact records before anyone noticed the alert.
Cleaning the CRM took 6 engineer-hours and a data rollback. The actual business cost was minor because it was an internal staging environment. Had this hit the live billing pipeline, the exposure would have been significant. The root cause was simple: our workers treated every message as novel.
Why deduplication queues are not enough
RabbitMQ and AWS SQS both offer deduplication features. SQS FIFO queues deduplicate based on a message ID within a 5-minute window. This works for network retries but fails for broader replay scenarios. Our broker replay happened 4 hours after original processing. The deduplication window had long expired.
Relying on infrastructure deduplication conflates transport guarantees with business guarantees. A message can be delivered exactly once by the broker, but the consumer might fail after the external side effect and before the acknowledgement. The broker will then rightfully redeliver. You must handle application-level idempotency yourself.
The pattern we now mandate
We implemented an idempotency store using Redis with TTL. Before executing any external side effect, the worker attempts a SETNX using a deterministic key derived from the event payload. If the key exists, the worker acks the message and skips processing. The TTL is set to 72 hours to cover weekend replays.
The key derivation is critical. We hash a composite of the tenant ID, the action type, and the unique business identifier from the payload. A generic message ID is useless; two messages requesting the same action must map to the same key. This requires the upstream producer to include that stable business identifier.
The trade-off: storage and key design
Idempotency stores have a cost. At scale, storing millions of keys in Redis consumes memory. We accept this because the cost of a single duplicate billing run far exceeds the infrastructure cost. We keep TTLs tight. 72 hours covers most operational replays while preventing unbounded key growth.
Key design is where most implementations fail. If you hash the entire payload, any changed field creates a new key, defeating the purpose. If you hash only the event ID, you cannot deduplicate across different event types that trigger the same downstream action. You must identify what makes an action unique in business terms.
Measuring the safety net
After deploying idempotency keys across our production pipelines, we tracked the skip rate. During a planned Kafka migration that required a topic replay, our workers processed 340,000 events. The idempotency store recorded 338,400 skips. Only 1,600 events were new or had expired keys. Zero duplicate actions occurred.
We also observed a 15% reduction in processing time during replay scenarios because skipping the external API call is orders of magnitude faster than executing it. The idempotency check adds 2 milliseconds per event. The trade-off is unequivocally positive for any pipeline with external side effects.
If your automation pipeline triggers actions you cannot trivially undo, idempotency keys are not optional. Derive the key from the stable business identifier, set a TTL that covers your operational replays, and log every skip. It turns a catastrophic queue replay into a routine log entry with zero manual cleanup.
Working on a project where these methods apply?