Robi: production RAG assistant
Retrieval-augmented chatbot with hybrid search, guardrails, eval, and live monitoring.
LIVE DEMORobi: a production RAG assistant
Robi is a retrieval-augmented chatbot that answers questions about me and my work, grounded in a curated corpus with source citations and layered guardrails. It is live on the About page of this site. The backend runs on my own VPS, fronted by Caddy with automatic HTTPS, deployed by a push-to-main pipeline.
This page documents how it works and how it is measured, because the measurement is the point. A toy chatbot pastes a resume into one LLM call. Robi is built and evaluated like a real retrieval system.
At a glance
| Retrieval | Hybrid: pgvector dense + tsvector keyword, reranked |
| Recall@1 | 0.971 |
| Recall@5 | 0.971 |
| MRR | 0.971 |
| Refusal accuracy | 1.000 (off-topic correctly declined) |
| False refusals | 0.000 |
| Faithfulness | 0.943 (LLM-as-judge) |
| Guardrail layers | 4 |
| Embedding model | BAAI/bge-small-en-v1.5 (self-hosted) |
| Generation | Llama 3.3 70B via Groq |
| Deployment | Docker Compose on VPS, Caddy HTTPS |
| Monitoring | Prometheus + Grafana |
Why I built this
Most portfolio chatbots are demos. They stuff a resume into a prompt and hope the model behaves. That is useful for a quick prototype, but it is not the system you want answering strangers on the internet.
I built Robi to answer a more serious question: what does a small, production RAG system look like when it has to be accurate, refuse weak questions, cite its sources, survive public traffic, and tell me when it is failing?
What makes it different
It refuses when retrieval is weak. The retrieval gate is the main anti-hallucination control. If the reranked context is not strong enough, Robi declines instead of stretching.
It is measured with a golden set. Retrieval, refusals, false refusals, and faithfulness are all evaluated. The eval caught real bugs during development and drove changes to chunk headers.
It is operated like a service. Prometheus and Grafana track latency, outcomes, refusal rate, cost, and component errors. The dashboard and alert rules are provisioned as code.
Tech stack
| Layer | Technology |
|---|---|
| Backend | Python 3.12, FastAPI |
| Database | PostgreSQL 16 |
| Vector search | pgvector |
| Keyword search | PostgreSQL tsvector |
| Cache / rate limits | Redis |
| Embeddings | BAAI/bge-small-en-v1.5 |
| Reranking | sentence-transformers cross-encoder |
| Generation | Groq, Llama 3.3 70B |
| Monitoring | Prometheus, Grafana |
| Reverse proxy | Caddy |
| Containers | Docker Compose |
| CI/CD | GitHub Actions, auto-deploy to VPS |
Where to go from here
- architecture.md: ingestion flow, live query flow, storage, and deployment shape
- retrieval-eval.md: hybrid search, reranking, golden set metrics, and the bugs the eval caught
- guardrails.md: input guard, retrieval gate, grounding prompt, output handling
- monitoring.md: Prometheus, Grafana, latency, refusal rate, cost, and alerts
- deployment.md: Docker Compose, Caddy HTTPS, GitHub Actions, health checks
- challenges.md: the engineering problems that shaped the system
- api-reference.md: live API shape for ask and health endpoints
- links/live-demo.url: ask Robi on the About page
- links/api-health.url: backend health endpoint