The HNSW probe budget that tanked our RAG latency
Published · Updated
Switching to pgvector HNSW cut our index build time by 80% but spiked RAG p99 latency to 4.2 seconds. The culprit was a single default we failed to override for production queries.
We migrated a 1.8 million row semantic search index from pgvector IVFFlat to HNSW. Build times dropped from 3 hours to 35 minutes, and recall at 10 looked solid in our offline evals. We shipped it on a Friday and watched p99 query latency climb to 4.2 seconds by Monday morning.
The HNSW speed trap
IVFFlat does a fixed number of centroid probes per query. HNSW traverses a graph, and the traversal depth is controlled by the ef_search parameter. If you do not set it explicitly, PostgreSQL falls back to a conservative default of 40.
At 40 probes on our 768-dimensional dataset, the index returned 10 candidate vectors fast enough. But recall was just 72 percent. The downstream LLM then wasted 3.1 seconds hallucinating to fill context gaps that better retrieval would have supplied outright.
Tuning the probe budget
We benchmarked ef_search values between 40 and 400 against our holdout eval set. At ef_search 200, recall hit 91 percent and average query latency settled at 18 milliseconds. Pushing to 400 only yielded 93 percent recall, but p95 latency jumped to 41 milliseconds.
The trade-off curve is sharp. Every doubling of ef_search past 200 added 12 milliseconds of CPU time but recovered less than two percentage points of recall. We set the GUC at the session level before every RAG query, keeping the default low for non-critical lookups.
Setting it at the session level
We wrapped the retrieval call in a transaction that sets the parameter dynamically. For user-facing RAG we execute SET LOCAL hnsw.ef_search = 200. For async background indexing jobs we leave it at the 40 default. This isolated the latency impact to the queries that actually needed it.
The SET LOCAL approach is critical. If you set it globally in postgresql.conf, your write throughput and background maintenance queries suffer. Session-level configuration lets you allocate probe budget strictly where recall justifies the compute cost.
Measuring recall in production
We log the top 10 returned chunk IDs alongside the known relevant set from our weekly eval sample. This gives us a live recall metric per query class. When recall drops below 85 percent for any tenant, we alert and bump their ef_search for the next session.
This operational visibility matters because data distribution shifts. As new documents enter the index, the graph topology changes. A static ef_search that worked at 1.8 million rows degraded noticeably at 2.3 million rows. You need real measurement, not assumptions.
What we learned
HNSW is not a free lunch. The index build is faster and base queries are quicker, but you must actively manage the probe budget per query class. Default parameters will silently sacrifice recall, and your LLM will compensate by burning tokens on guesswork.
Set ef_search per session based on query criticality. Measure recall against a held-out set continuously. Re-benchmark your probe budget every time the index grows by 20 percent. Treat the parameter like a cache TTL: a knob you tune with evidence, not a set-and-forget constant.
Working on a project where these methods apply?