Security Engineering

The CSV upload endpoint accepts arbitrary user files and runs them through a scoring pipeline. That's a security surface — every CSV is untrusted input from the network. Sentinel's upload pipeline is hardened against six attack classes, with defense-in-depth at every layer.

Multi-tenant isolation is enforced separately at the query level, so cross-tenant data leakage is structurally impossible.

The CSV upload attack surface

A naive CSV upload endpoint is vulnerable to:

Oversized files — DoS by uploading multi-GB files that exhaust memory or disk
Schema violations — malformed CSVs that crash the parser or produce garbage predictions
Formula injection — CSV cells starting with =, +, -, or @ that execute as formulas in Excel/Sheets when an analyst opens the audit export
Rate-limit abuse — flooding the endpoint with requests to amplify other attacks or exhaust compute
Unauthorized access — non-admin users uploading data outside their authorization scope
Tenant boundary violations — uploads scored against the wrong tenant's model

The defense stack

1. Nginx body size cap

Reverse proxy rejects requests larger than 5 MB at the edge. The Python process never sees an oversized payload. Configured in infra/nginx/upload_limits.conf with client_max_body_size 5M;.

2. FastAPI schema validation

Pydantic validates every CSV row against the expected schema. Type mismatches, missing required columns, and invalid values are rejected with structured error responses. The parser uses streaming reads — it never loads the full file into memory.

3. Formula injection neutralization

Every cell value is scanned for leading =, +, -, or @. Matches get prefixed with a single-quote (') so spreadsheet apps treat the cell as text, not a formula. This blocks attacks like =cmd|'/c calc'!A0 that would otherwise execute when an analyst opens the audit CSV in Excel.

4. Per-user rate limiting

Each user gets a rolling window allowance. Exceeding it returns 429. This caps the damage from a compromised credential and slows down brute-force discovery of valid schema patterns.

5. Role-based access control

The upload endpoint requires admin or senior_analyst role. Regular analysts can score individual transactions but can't bulk-upload. The check happens via a FastAPI dependency before any handler code runs — there's no path to the endpoint that skips it.

6. Audit trail

Every upload is logged to upload_audits with the filename, file size, row count, success/failure status, and per-risk-band counts. Failed uploads are logged with the error reason in JSONB. The audit panel shows the full history per tenant.

Multi-tenant isolation

Separate from the upload pipeline, every database query is scoped by tenant_id automatically. The SQLAlchemy session yielded by the get_db() dependency is already filtered — a query like db.query(Transaction).all() returns only the current tenant's transactions, not the global set.

A bug in a handler cannot cause cross-tenant data leakage because there's no code path that produces an unscoped query. This is the same multi-tenancy guarantee that companies like Stripe and Notion enforce at the database layer.

What's not implemented (and why)

No virus scanning. The CSV is parsed in Python, not executed. There's no upload path that runs binaries.

No DLP scanning. PaySim is synthetic data. A real deployment processing actual financial data would add a DLP layer to redact PAN/SSN patterns before persistence.

No CSP / WAF. The frontend is statically built and served — there's no SSR injection surface to defend. Adding a WAF for the API would be the next step in a production deployment.

Sentinel — Fraud Detection Platform