Challenges

Robi looked small from the outside, but the hard parts were the same ones that show up in larger RAG systems: retrieval quality, refusal behavior, prompt safety, and observability.

1. Making retrieval measurable

Challenge: It is easy to test a chatbot by asking a few questions and deciding it feels good. That does not catch regressions.

Solution: Built a golden set with on-topic questions mapped to source documents and off-topic questions that should be refused. The eval reports Recall@1, Recall@5, MRR, refusal accuracy, false refusals, and faithfulness.

What I learned: RAG systems need tests that look like product behavior, not just unit tests. Retrieval quality is a contract.

2. Hybrid retrieval was only half working

Challenge: The keyword retrieval path returned nothing for some natural-language questions. Dense search covered it enough that the bug was not obvious during casual testing.

Solution: The eval exposed the gap. I fixed the keyword path so dense and lexical retrieval both contribute candidates before reranking.

What I learned: Hybrid search has to be evaluated as a whole pipeline and as separate components. Otherwise one side can silently stop helping.

3. Context-free chunks were hard to retrieve

Challenge: One section contained the right answer but did not include the product name in the body. The retriever could not reliably connect it to questions about that product.

Solution: Prefix each chunk with its document title and heading before embedding and indexing.

What I learned: Chunk text needs context. A human sees the file and heading around a paragraph; the retriever only sees the chunk.

4. Refusal threshold tuning

Challenge: A threshold that is too low increases hallucination risk. A threshold that is too high makes the assistant refuse valid questions.

Solution: Tuned the retrieval gate against both on-topic and off-topic questions, tracking refusal accuracy and false refusals together.

What I learned: Refusal is part of the product. It should be tuned and measured like any other behavior.

5. Public deployment needs observability

Challenge: Once Robi is live, failures are not just local exceptions. They can be provider latency, retrieval errors, Redis issues, threshold drift, or unexpected traffic.

Solution: Added Prometheus and Grafana with latency percentiles, per-stage timing, outcomes, refusal rate, cost, and component errors.

What I learned: Monitoring is not extra polish. For an AI service, it is how you know whether the system is still behaving.