Robi: production RAG assistant
Retrieval-augmented chatbot with hybrid search, guardrails, eval, and live monitoring.
LIVE DEMOChallenges
Robi looked small from the outside, but the hard parts were the same ones that show up in larger RAG systems: retrieval quality, refusal behavior, prompt safety, and observability.
1. Making retrieval measurable
Challenge: It is easy to test a chatbot by asking a few questions and deciding it feels good. That does not catch regressions.
Solution: Built a golden set with on-topic questions mapped to source documents and off-topic questions that should be refused. The eval reports Recall@1, Recall@5, MRR, refusal accuracy, false refusals, and faithfulness.
What I learned: RAG systems need tests that look like product behavior, not just unit tests. Retrieval quality is a contract.
2. Hybrid retrieval was only half working
Challenge: The keyword retrieval path returned nothing for some natural-language questions. Dense search covered it enough that the bug was not obvious during casual testing.
Solution: The eval exposed the gap. I fixed the keyword path so dense and lexical retrieval both contribute candidates before reranking.
What I learned: Hybrid search has to be evaluated as a whole pipeline and as separate components. Otherwise one side can silently stop helping.
3. Context-free chunks were hard to retrieve
Challenge: One section contained the right answer but did not include the product name in the body. The retriever could not reliably connect it to questions about that product.
Solution: Prefix each chunk with its document title and heading before embedding and indexing.
What I learned: Chunk text needs context. A human sees the file and heading around a paragraph; the retriever only sees the chunk.
4. Refusal threshold tuning
Challenge: A threshold that is too low increases hallucination risk. A threshold that is too high makes the assistant refuse valid questions.
Solution: Tuned the retrieval gate against both on-topic and off-topic questions, tracking refusal accuracy and false refusals together.
What I learned: Refusal is part of the product. It should be tuned and measured like any other behavior.
5. Public deployment needs observability
Challenge: Once Robi is live, failures are not just local exceptions. They can be provider latency, retrieval errors, Redis issues, threshold drift, or unexpected traffic.
Solution: Added Prometheus and Grafana with latency percentiles, per-stage timing, outcomes, refusal rate, cost, and component errors.
What I learned: Monitoring is not extra polish. For an AI service, it is how you know whether the system is still behaving.