Robi: production RAG assistant
Retrieval-augmented chatbot with hybrid search, guardrails, eval, and live monitoring.
LIVE DEMOMonitoring
Robi has monitoring because public AI systems need more than a green deploy. The system tracks whether it is fast, whether it is refusing too much, whether errors are isolated to one component, and how much generation is costing.
Metrics collected
Prometheus scrapes the backend every 15 seconds. The dashboard tracks:
- Request volume
- Request outcome
- p50, p95, and p99 latency
- Per-stage timing
- Refusal rate
- Provider errors
- Retrieval errors
- Cumulative cost
- Errors by component
Grafana as code
Grafana datasource, dashboard, and alert rules are provisioned as code. Rebuilding the stack recreates the same monitoring surface instead of relying on manual dashboard setup.
What I watch
Latency: RAG has multiple stages, so total latency alone is not enough. Per-stage timing shows whether embedding, retrieval, reranking, generation, or provider response time is the bottleneck.
Refusal rate: A sudden jump can mean the retrieval threshold is too strict, the corpus changed, or the retriever is failing. A sudden drop can be worse because it may mean Robi is answering when it should refuse.
Cost: Generation is external, so cost is part of operations. Tracking cumulative cost makes unexpected traffic visible.
Component errors: Splitting errors by component makes failures actionable. A provider outage, Redis issue, retrieval failure, and app exception should not all look the same.