RAG Grounding And Citations

The setup

RAG evals fail when they collapse retrieval, reasoning, citation, and answer style into one score. A bad answer can come from missing documents, poor ranking, unsupported synthesis, or a model inventing a bridge between facts.

Picture this

Good, bad, and ugly paths for grounding evaluation.

Mental model

Evaluate the pipeline in layers: did retrieval find the right evidence, did the answer use only that evidence, did citations support each claim, and did the system abstain when evidence was insufficient?

Good

The good version has separate metrics for retrieval recall, citation precision, answer faithfulness, and refusal quality. It includes unanswerable questions and stale-document traps, not just questions where the answer is conveniently in paragraph one.

Question	Why it matters
Is retrieval evaluated separately from answer synthesis?	Different failures need different fixes.
Do citations support specific claims?	Citation presence is not citation faithfulness.
Are unanswerable questions included?	Abstention is part of grounded behavior.

Chapter takeaway

RAG evals should inspect the evidence trail. Otherwise the model can hallucinate with footnotes, which is just plagiarism with confidence.

References

LangSmith evaluate RAG tutorial

RAG Grounding And Citations

The setup

Picture this

Mental model

Good

Bad

Ugly

Artifact to produce

Grounding review

References

Quiz