Robi: production RAG assistant
Retrieval-augmented chatbot with hybrid search, guardrails, eval, and live monitoring.
LIVE DEMOArchitecture
Robi has two main flows: offline ingestion and live question answering. The goal is to keep the runtime path small, fast, and measurable while making the corpus easy to update without manual cleanup.
Offline ingestion
The corpus lives as curated Markdown. During ingestion, documents are split by section so each chunk stays semantically focused. Every chunk is prefixed with its document title and heading before embedding. That small detail matters: the eval found that some chunks were technically correct but hard to retrieve because the body text never repeated the project or topic name.
Each chunk is embedded with a self-hosted BAAI/bge-small-en-v1.5 model and upserted into Postgres. The ingest path is idempotent through content hashing, so unchanged chunks are skipped and changed chunks are replaced cleanly. Removed files have a delete path so stale answers do not linger.
Live query flow
- The frontend sends a question to POST /ask.
- An input guard checks length and obvious prompt-injection patterns.
- The query embedding is computed or loaded from Redis cache.
- Dense retrieval pulls semantic candidates from pgvector.
- Keyword retrieval pulls lexical candidates from PostgreSQL tsvector.
- Candidates are merged and reranked with a cross-encoder.
- A retrieval gate checks whether the top context is strong enough.
- If retrieval passes, Llama 3.3 70B receives the grounded prompt and retrieved chunks.
- The response returns an answer plus source citations.
If retrieval fails, Robi refuses instead of answering from weak context. That is the main design choice that separates the system from a prompt-only chatbot.
Storage model
Postgres stores the chunk text, document metadata, source URL, content hash, dense embedding, and tsvector field. Keeping dense and keyword search in the same database makes the system simpler to operate than a separate vector service. Redis handles query embedding cache, rate limiting, and short-lived operational state.
Deployment shape
The backend runs in Docker Compose on a VPS. Caddy terminates HTTPS and routes traffic to the FastAPI service. GitHub Actions runs checks and deploys on push to main. Prometheus scrapes the service and Grafana renders the operations dashboard.