MCP Mastery
About
Chapter 9
mid
~40 min

RAG Grounding And Citations

Evaluate grounded answers, citation faithfulness, and retrieval misses separately.

Eval Toolkit 2026.05
Observability trace-first
Python 3.11
LangSmith linked
Reviewed 2026-05-17

Reading this chapter helps prevent 6 common Eval Writing mistakes.

The setup

RAG evals fail when they collapse retrieval, reasoning, citation, and answer style into one score. A bad answer can come from missing documents, poor ranking, unsupported synthesis, or a model inventing a bridge between facts.

Picture this

Good, bad, and ugly paths for grounding evaluation.

Mental model

Evaluate the pipeline in layers: did retrieval find the right evidence, did the answer use only that evidence, did citations support each claim, and did the system abstain when evidence was insufficient?

Good

The good version has separate metrics for retrieval recall, citation precision, answer faithfulness, and refusal quality. It includes unanswerable questions and stale-document traps, not just questions where the answer is conveniently in paragraph one.

Bad

The bad version asks a judge if the final answer is "helpful" and calls the whole pipeline good. It never checks whether the cited source says the thing. Footnotes are not magical truth seasoning.

Ugly

The ugly reality is messy corpora: duplicate docs, stale policies, conflicting sources, and partial permissions. Grounding evals should record document versions and expose source conflicts instead of hiding them behind a smooth answer.

Artifact to produce

Create a RAG eval sheet with query, expected evidence, retrieved docs, answer claims, cited spans, unsupported claims, and abstention expectation.

Grounding review

QuestionWhy it matters
Is retrieval evaluated separately from answer synthesis?Different failures need different fixes.
Do citations support specific claims?Citation presence is not citation faithfulness.
Are unanswerable questions included?Abstention is part of grounded behavior.

Chapter takeaway

RAG evals should inspect the evidence trail. Otherwise the model can hallucinate with footnotes, which is just plagiarism with confidence.

References

Quiz

  1. Why split retrieval and faithfulness metrics?

  2. Which is the bad version of grounding evaluation?

  3. What should the ugly reality change about your process?