Generative AI · RAG · Evals · Production

The grounded assistant trade-off: precision at 96 %, recall at 41 %

Published · Updated

Raising citation strictness to 96 % precision in grounded assistants cuts useful recall to 41 %. Here is how we calibrate the threshold and why 85 % precision is the pragmatic ceiling for most enterprise copilots.

Grounded assistants are supposed to cite their sources. In practice, forcing an LLM to attach a source to every claim creates a brutal precision versus recall trade-off that most teams discover only after deployment. We ran into this building a policy copilot for a logistics firm, and the numbers were sobering.

The naive approach and its failures

The initial prompt was simple: answer the question using only the provided documents, and cite the document ID for each sentence. In offline tests, answers looked solid. In production, users started complaining that the assistant was refusing half their queries.

The LLM had become so conservative about citing that it would only answer when the retrieved chunk matched the query nearly word for word. We measured a citation precision of 96 %, meaning almost every cited source was truly relevant. But recall sat at 41 %. The system ignored 59 % of the valid, applicable policy information.

Users do not read precision metrics. They just see a stubborn assistant that cannot answer straightforward questions because the evidence was phrased differently in the source text. High precision looks excellent in a dashboard but fails the actual job.

Rewriting the grounding prompt for tolerance

We relaxed the citation requirement. Instead of demanding a citation per sentence, we asked for citations per claim and explicitly instructed the model to infer connections when the source used synonyms or indirect language. Precision dropped to 84 %. Recall climbed to 78 %.

That 84 % precision meant roughly one in six citations was slightly off, pointing to a related but not definitive paragraph. For the client, this was acceptable. Users could still verify the answer against the source with a single click. The alternative, a 96 % precise model that refused most queries, was useless.

We also found that chunk size mattered more than retrieval top-k. Switching from 150-token chunks to 350-token chunks reduced context fragmentation. Citations became more accurate because the model had enough surrounding context to anchor its claims properly.

Measuring the threshold in production

We built a simple eval set of 240 query-document pairs labelled by the client's compliance team. Every time we adjusted the grounding prompt, we ran this set. The eval cost was roughly three hours of SME time per sprint, which is cheap compared to fixing a broken user experience.

The pattern we see across grounded assistants is consistent. Above 88 % precision, recall falls off a cliff. Between 80 % and 88 % precision sits the functional range where users get answers they can verify. Below 80 %, trust erodes because too many citations are spurious.

Calibrate your grounding threshold to the user's verification cost. If the user can quickly check a citation, target 82 % precision and maximise recall. If a bad citation triggers a compliance review, push towards 88 % and accept lower recall. Measure both, decide deliberately, and ignore the vanity of perfect precision.

Working on a project where these methods apply?