Monitoring

Robi has monitoring because public AI systems need more than a green deploy. The system tracks whether it is fast, whether it is refusing too much, whether errors are isolated to one component, and how much generation is costing.

Metrics collected

Prometheus scrapes the backend every 15 seconds. The dashboard tracks:

Request volume
Request outcome
p50, p95, and p99 latency
Per-stage timing
Refusal rate
Provider errors
Retrieval errors
Cumulative cost
Errors by component

Grafana as code

Grafana datasource, dashboard, and alert rules are provisioned as code. Rebuilding the stack recreates the same monitoring surface instead of relying on manual dashboard setup.

What I watch

Latency: RAG has multiple stages, so total latency alone is not enough. Per-stage timing shows whether embedding, retrieval, reranking, generation, or provider response time is the bottleneck.

Refusal rate: A sudden jump can mean the retrieval threshold is too strict, the corpus changed, or the retriever is failing. A sudden drop can be worse because it may mean Robi is answering when it should refuse.

Cost: Generation is external, so cost is part of operations. Tracking cumulative cost makes unexpected traffic visible.

Component errors: Splitting errors by component makes failures actionable. A provider outage, Redis issue, retrieval failure, and app exception should not all look the same.

Robi: production RAG assistant

Monitoring

Metrics collected

Grafana as code

What I watch