rpmjp/portfolio
rpmjp/projects/robi/README.md
Completed2026

Robi: production RAG assistant

Retrieval-augmented chatbot with hybrid search, guardrails, eval, and live monitoring.

LIVE DEMO
FastAPIPythonPostgrespgvectorRedisGroqPrometheusGrafanaDocker
README.md

Robi: a production RAG assistant

Robi is a retrieval-augmented chatbot that answers questions about me and my work, grounded in a curated corpus with source citations and layered guardrails. It is live on the About page of this site. The backend runs on my own VPS, fronted by Caddy with automatic HTTPS, deployed by a push-to-main pipeline.

This page documents how it works and how it is measured, because the measurement is the point. A toy chatbot pastes a resume into one LLM call. Robi is built and evaluated like a real retrieval system.

At a glance

RetrievalHybrid: pgvector dense + tsvector keyword, reranked
Recall@10.971
Recall@50.971
MRR0.971
Refusal accuracy1.000 (off-topic correctly declined)
False refusals0.000
Faithfulness0.943 (LLM-as-judge)
Guardrail layers4
Embedding modelBAAI/bge-small-en-v1.5 (self-hosted)
GenerationLlama 3.3 70B via Groq
DeploymentDocker Compose on VPS, Caddy HTTPS
MonitoringPrometheus + Grafana

Why I built this

Most portfolio chatbots are demos. They stuff a resume into a prompt and hope the model behaves. That is useful for a quick prototype, but it is not the system you want answering strangers on the internet.

I built Robi to answer a more serious question: what does a small, production RAG system look like when it has to be accurate, refuse weak questions, cite its sources, survive public traffic, and tell me when it is failing?

What makes it different

It refuses when retrieval is weak. The retrieval gate is the main anti-hallucination control. If the reranked context is not strong enough, Robi declines instead of stretching.

It is measured with a golden set. Retrieval, refusals, false refusals, and faithfulness are all evaluated. The eval caught real bugs during development and drove changes to chunk headers.

It is operated like a service. Prometheus and Grafana track latency, outcomes, refusal rate, cost, and component errors. The dashboard and alert rules are provisioned as code.

Tech stack

LayerTechnology
BackendPython 3.12, FastAPI
DatabasePostgreSQL 16
Vector searchpgvector
Keyword searchPostgreSQL tsvector
Cache / rate limitsRedis
EmbeddingsBAAI/bge-small-en-v1.5
Rerankingsentence-transformers cross-encoder
GenerationGroq, Llama 3.3 70B
MonitoringPrometheus, Grafana
Reverse proxyCaddy
ContainersDocker Compose
CI/CDGitHub Actions, auto-deploy to VPS

Where to go from here

  • architecture.md: ingestion flow, live query flow, storage, and deployment shape
  • retrieval-eval.md: hybrid search, reranking, golden set metrics, and the bugs the eval caught
  • guardrails.md: input guard, retrieval gate, grounding prompt, output handling
  • monitoring.md: Prometheus, Grafana, latency, refusal rate, cost, and alerts
  • deployment.md: Docker Compose, Caddy HTTPS, GitHub Actions, health checks
  • challenges.md: the engineering problems that shaped the system
  • api-reference.md: live API shape for ask and health endpoints
  • links/live-demo.url: ask Robi on the About page
  • links/api-health.url: backend health endpoint