rpmjp/portfolio
rpmjp/projects/robi/retrieval-eval.md
Completed2026

Robi: production RAG assistant

Retrieval-augmented chatbot with hybrid search, guardrails, eval, and live monitoring.

LIVE DEMO
FastAPIPythonPostgrespgvectorRedisGroqPrometheusGrafanaDocker
retrieval-eval.md

Retrieval and Eval

Robi is measured because RAG quality is easy to overestimate by feel. A few good demo questions can hide weak retrieval, false confidence, and missing refusal behavior.

Retrieval strategy

Robi uses hybrid retrieval:

StagePurpose
pgvector dense searchFinds semantically similar chunks even when wording differs
PostgreSQL tsvector searchPreserves exact names, project titles, tools, and acronyms
MergeCombines dense and lexical candidates
Cross-encoder rerankScores candidates against the actual question
Retrieval gateDecides whether context is strong enough to answer

Dense search handles natural language. Keyword search protects proper nouns. Reranking keeps the final context small and relevant.

Metrics

MetricResult
Recall@10.971
Recall@50.971
MRR0.971
Refusal accuracy1.000
False refusals0.000
Faithfulness0.943

The golden set has 35 on-topic questions labeled with the source document that should answer them, plus 12 off-topic questions that should be refused.

What the eval caught

The eval caught two real bugs during development.

Keyword search returned nothing for natural-language questions. Dense retrieval still worked, but the hybrid path was weaker than it looked. Fixing the keyword query path made exact-match recall useful again.

One chunk could not be retrieved because the body lacked the product name. The content was correct, but the retriever had no lexical clue that the section belonged to the project being asked about. Adding contextual chunk headers fixed it and lifted Recall@1 from 0.943 to 0.971.

Faithfulness check

Faithfulness is checked with an LLM-as-judge pass that asks whether the answer's claims are supported by retrieved context. It is not treated as perfect truth. It is a regression signal that helps catch obvious unsupported claims before they ship.

Why refusal metrics matter

For a public assistant, wrong answers are worse than no answer. Refusal accuracy measures whether Robi declines off-topic questions. False refusal rate checks the opposite failure mode: declining valid questions it should answer. Both matter because a RAG system needs to be useful and restrained at the same time.