Robi: production RAG assistant
Retrieval-augmented chatbot with hybrid search, guardrails, eval, and live monitoring.
LIVE DEMORetrieval and Eval
Robi is measured because RAG quality is easy to overestimate by feel. A few good demo questions can hide weak retrieval, false confidence, and missing refusal behavior.
Retrieval strategy
Robi uses hybrid retrieval:
| Stage | Purpose |
|---|---|
| pgvector dense search | Finds semantically similar chunks even when wording differs |
| PostgreSQL tsvector search | Preserves exact names, project titles, tools, and acronyms |
| Merge | Combines dense and lexical candidates |
| Cross-encoder rerank | Scores candidates against the actual question |
| Retrieval gate | Decides whether context is strong enough to answer |
Dense search handles natural language. Keyword search protects proper nouns. Reranking keeps the final context small and relevant.
Metrics
| Metric | Result |
|---|---|
| Recall@1 | 0.971 |
| Recall@5 | 0.971 |
| MRR | 0.971 |
| Refusal accuracy | 1.000 |
| False refusals | 0.000 |
| Faithfulness | 0.943 |
The golden set has 35 on-topic questions labeled with the source document that should answer them, plus 12 off-topic questions that should be refused.
What the eval caught
The eval caught two real bugs during development.
Keyword search returned nothing for natural-language questions. Dense retrieval still worked, but the hybrid path was weaker than it looked. Fixing the keyword query path made exact-match recall useful again.
One chunk could not be retrieved because the body lacked the product name. The content was correct, but the retriever had no lexical clue that the section belonged to the project being asked about. Adding contextual chunk headers fixed it and lifted Recall@1 from 0.943 to 0.971.
Faithfulness check
Faithfulness is checked with an LLM-as-judge pass that asks whether the answer's claims are supported by retrieved context. It is not treated as perfect truth. It is a regression signal that helps catch obvious unsupported claims before they ship.
Why refusal metrics matter
For a public assistant, wrong answers are worse than no answer. Refusal accuracy measures whether Robi declines off-topic questions. False refusal rate checks the opposite failure mode: declining valid questions it should answer. Both matter because a RAG system needs to be useful and restrained at the same time.