Data Intelligence · RAG · pgvector · Reliability

The rerank precision tax that tripled our RAG latency

Published · Updated

Adding a cross-encoder reranker lifted our RAG recall from 68% to 89%, but the synchronous inference call tripled p99 latency. Here is the asymmetric retrieval pattern that cut delay without surrendering quality.

We deployed a Cohere reranker to fix ragged retrieval from our pgvector index. It worked. Recall at k=10 jumped from 68% to 89%. But our p99 latency spiked from 340ms to 1120ms. The cross-encoder inference was murdering the user experience, and we had to find a compromise that did not sacrifice the quality we just bought.

The synchronous rerank trap

Our initial pipeline was straightforward. We fetched 50 candidates from pgvector using HNSW, then passed all 50 to the reranker synchronously before returning the top 10. At 50 documents, the cross-encoder took 600ms on average. The vector search itself was only 80ms. The reranker was 88% of our total latency budget, and users noticed the lag immediately.

The problem compounds under load. At 40 concurrent requests, our rerank service saturated its GPU allocation. Queue times pushed p99 past 2.4 seconds. We had bought precision at a steep interactivity cost. The retrieval was accurate, but the application felt broken. We needed to break the synchronous dependency.

The asymmetric retrieval pattern

We split the retrieval into two tiers. The first request fetches 150 candidates from pgvector and returns the top 15 immediately to the user, relying on dense vector similarity alone. This gives us a 90ms response. The UI renders these preliminary results right away, giving the user something to read.

Simultaneously, we fire an asynchronous rerank job for those same 150 candidates. By the time the user has scanned the first two preliminary results, the reranked top 10 arrives via a streaming update. We cache the reranked results in Redis with a 6-hour TTL. Subsequent queries for similar intents hit the cache directly, bypassing the reranker entirely.

We also reduced the reranker input. Instead of sending 50 full documents, we send only the 150-word chunk stored alongside the embedding. This trimmed the cross-encoder payload by 40%, cutting average rerank time from 600ms to 380ms. The model needs text, not metadata, to score relevance.

Measuring the trade-off

The results were clear. End-to-end p99 dropped from 1120ms to 410ms for cache misses, and 140ms for cache hits. Our cache hit rate sits at 34% for production traffic. Recall at k=10 held steady at 88.6%, a negligible 0.4% dip from the synchronous baseline. Users stopped complaining about lag, and our GPU costs fell by 22% due to fewer redundant rerank calls.

We track this with a simple dashboard. We graph dense-only recall versus reranked recall weekly. If the gap widens beyond 5 points, we tune the HNSW ef_search parameter. If latency creeps up, we check the cache eviction rate. The monitoring cost is minimal, but the visibility is essential for catching embedding drift before it ruins the cache utility.

Rerankers are powerful but expensive. Decouple the slow cross-encoder from the critical rendering path. Return dense results immediately, stream the reranked results asynchronously, and cache aggressively. You keep the precision gain without forcing users to wait for synchronous inference.

Start by measuring what percentage of your p99 latency the reranker consumes. If it exceeds 50%, move it out of the hot path. Return fast preliminary results, update the UI asynchronously, and track your cache hit rate. Precision is useless if the user abandons the page before it arrives.

Working on a project where these methods apply?

The rerank precision tax that tripled our RAG latency — Neurolinks