Semantic Search · Embeddings · Production · Evals

The embedding drift that broke our semantic search

Published · Updated

We swapped an embedding model to cut latency by 40ms. Three weeks later, recall silently dropped 18%. Here is the exact measurement that caught it and the reindexing trade-off we chose.

Six months ago we upgraded the embedding model for a client's semantic search pipeline. The goal was simple: cut vectorisation latency by 40 milliseconds per document. The new model passed our offline benchmark suite cleanly. We rolled it out. Three weeks later, support tickets about missing results trickled in.

The silent regression

Our eval suite measured precision at 95% and recall at 88% against a static test set. The new model hit 94% precision and 87% recall. Statistically indistinguishable. But production traffic is not a static test set. The new model mapped short, ambiguous queries to a tighter cluster radius than the old one.

Queries with three words or fewer saw a measured 18% recall drop. Longer queries were unaffected. The static test set was weighted toward detailed, multi-word queries because those were easier to label. We had a blind spot baked into our evaluation methodology.

Diagnosing the vector shift

We caught it because we log the cosine similarity distribution of top results per query pattern. The average top-1 similarity for short queries shifted from 0.84 to 0.71. The retrieved documents were technically relevant, just not the specific ones users expected. The embedding space had rotated.

When you change embedding models, you do not just change speed or accuracy. You reshape the entire topology of the vector space. Nearest neighbours shift. Documents that sat at the boundary of two concepts get pulled firmly into one cluster. That boundary movement is where production regressions hide.

The reindexing trade-off

We had two options. Roll back to the old model and reindex 2.3 million documents, costing roughly 14 hours of compute. Or accept the new model and retrain the query expansion layer to compensate for the tighter clustering. Rolling back meant 14 hours of degraded search freshness for a catalogue that updates daily.

We chose to retrain the expansion layer. It took two days of engineering work and a full regression run. We added 1200 short-query examples to the eval set, weighted to match the production query distribution we pulled from application logs. Recall on short queries recovered to 85%.

What we changed

Now every embedding model swap requires a distribution shift report. We measure the cosine distance between the same 10,000 sample documents in both the old and new vector spaces. If the average pairwise distance shifts by more than 0.05, we flag it for manual review before deployment.

We also split our eval set by query length. Short queries, long queries, and entity-heavy queries get separate recall metrics. A single aggregate number hides the failure modes that actually reach users. You have to segment your measurements the same way your users segment their intent.

Swapping embedding models is not a drop-in operation. Measure the vector space topology, segment your evaluations by query type, and never trust a single aggregate recall number to protect your production search experience.

The pragmatic takeaway: before you switch an embedding model, embed a fixed sample of your production corpus with both models and compare the neighbourhood structures. If the nearest neighbours for your most common query types change by more than 10%, expect visible regressions and budget for query-side compensation work.

Working on a project where these methods apply?