Robi: a production RAG assistant

Robi is a retrieval-augmented chatbot that answers questions about me and my work, grounded in a curated corpus with source citations and layered guardrails. It is live on the About page of this site. The backend runs on my own VPS, fronted by Caddy with automatic HTTPS, deployed by a push-to-main pipeline.

This page documents how it works and how it is measured, because the measurement is the point. A toy chatbot pastes a resume into one LLM call. Robi is built and evaluated like a real retrieval system.

At a glance


Retrieval	Hybrid: pgvector dense + tsvector keyword, reranked
Recall@1	0.971
Recall@5	0.971
MRR	0.971
Refusal accuracy	1.000 (off-topic correctly declined)
False refusals	0.000
Faithfulness	0.943 (LLM-as-judge)
Guardrail layers	4
Embedding model	BAAI/bge-small-en-v1.5 (self-hosted)
Generation	Llama 3.3 70B via Groq
Deployment	Docker Compose on VPS, Caddy HTTPS
Monitoring	Prometheus + Grafana

Why I built this

Most portfolio chatbots are demos. They stuff a resume into a prompt and hope the model behaves. That is useful for a quick prototype, but it is not the system you want answering strangers on the internet.

I built Robi to answer a more serious question: what does a small, production RAG system look like when it has to be accurate, refuse weak questions, cite its sources, survive public traffic, and tell me when it is failing?

What makes it different

It refuses when retrieval is weak. The retrieval gate is the main anti-hallucination control. If the reranked context is not strong enough, Robi declines instead of stretching.

It is measured with a golden set. Retrieval, refusals, false refusals, and faithfulness are all evaluated. The eval caught real bugs during development and drove changes to chunk headers.

It is operated like a service. Prometheus and Grafana track latency, outcomes, refusal rate, cost, and component errors. The dashboard and alert rules are provisioned as code.

Tech stack

Layer	Technology
Backend	Python 3.12, FastAPI
Database	PostgreSQL 16
Vector search	pgvector
Keyword search	PostgreSQL tsvector
Cache / rate limits	Redis
Embeddings	BAAI/bge-small-en-v1.5
Reranking	sentence-transformers cross-encoder
Generation	Groq, Llama 3.3 70B
Monitoring	Prometheus, Grafana
Reverse proxy	Caddy
Containers	Docker Compose
CI/CD	GitHub Actions, auto-deploy to VPS

Where to go from here

architecture.md: ingestion flow, live query flow, storage, and deployment shape
retrieval-eval.md: hybrid search, reranking, golden set metrics, and the bugs the eval caught
guardrails.md: input guard, retrieval gate, grounding prompt, output handling
monitoring.md: Prometheus, Grafana, latency, refusal rate, cost, and alerts
deployment.md: Docker Compose, Caddy HTTPS, GitHub Actions, health checks
challenges.md: the engineering problems that shaped the system
api-reference.md: live API shape for ask and health endpoints
links/live-demo.url: ask Robi on the About page
links/api-health.url: backend health endpoint