Retrieval and Eval

Robi is measured because RAG quality is easy to overestimate by feel. A few good demo questions can hide weak retrieval, false confidence, and missing refusal behavior.

Retrieval strategy

Robi uses hybrid retrieval:

Stage	Purpose
pgvector dense search	Finds semantically similar chunks even when wording differs
PostgreSQL tsvector search	Preserves exact names, project titles, tools, and acronyms
Merge	Combines dense and lexical candidates
Cross-encoder rerank	Scores candidates against the actual question
Retrieval gate	Decides whether context is strong enough to answer

Dense search handles natural language. Keyword search protects proper nouns. Reranking keeps the final context small and relevant.

Metrics

Metric	Result
Recall@1	0.971
Recall@5	0.971
MRR	0.971
Refusal accuracy	1.000
False refusals	0.000
Faithfulness	0.943

The golden set has 35 on-topic questions labeled with the source document that should answer them, plus 12 off-topic questions that should be refused.

What the eval caught

The eval caught two real bugs during development.

Keyword search returned nothing for natural-language questions. Dense retrieval still worked, but the hybrid path was weaker than it looked. Fixing the keyword query path made exact-match recall useful again.

One chunk could not be retrieved because the body lacked the product name. The content was correct, but the retriever had no lexical clue that the section belonged to the project being asked about. Adding contextual chunk headers fixed it and lifted Recall@1 from 0.943 to 0.971.

Faithfulness check

Faithfulness is checked with an LLM-as-judge pass that asks whether the answer's claims are supported by retrieved context. It is not treated as perfect truth. It is a regression signal that helps catch obvious unsupported claims before they ship.

Why refusal metrics matter

For a public assistant, wrong answers are worse than no answer. Refusal accuracy measures whether Robi declines off-topic questions. False refusal rate checks the opposite failure mode: declining valid questions it should answer. Both matter because a RAG system needs to be useful and restrained at the same time.