Sentinel — Fraud Detection Platform
Production-grade fraud operations platform with calibrated LightGBM scoring at 8.5ms, SHAP explainability on every prediction, and $1.23M in modeled net savings from cost-aware threshold tuning.
Machine Learning Deep Dive
The problem
Fraud detection at scale has three tensions in conflict. Models that maximize accuracy miss the rare fraud cases. Models that maximize fraud recall flood analysts with false positives. And models that look great on validation data often fail in production due to distribution drift or label leakage that wasn't caught during training.
The approach
A calibrated LightGBM classifier with strict leakage controls, isotonic probability calibration, and SHAP-based explainability on every prediction.
Why LightGBM?
- Handles class imbalance natively via
scale_pos_weight(fraud is roughly 0.13% of transactions) - Trains fast enough for iterative experimentation across hundreds of hyperparameter combinations
- Pairs with SHAP TreeExplainer for fast, exact attributions on every prediction
- Outperformed XGBoost and Logistic Regression baselines on PR-AUC across multiple ablations
Dataset
PaySim — 6.36M synthetic mobile money transactions with a ~0.13% fraud rate. Tracked via DVC so the exact training data is reproducible from a clone.
| Feature | Source | Description |
|---|---|---|
amount | raw | Transaction amount |
type_TRANSFER, type_CASH_OUT, etc. | one-hot | Transaction type indicators |
old_balance_org | raw | Sender balance before transaction |
new_balance_org | raw | Sender balance after transaction |
old_balance_dest | raw | Receiver balance before transaction |
new_balance_dest | raw | Receiver balance after transaction |
amount_to_balance_ratio | engineered | Amount relative to sender balance |
drains_full_balance | engineered | Flag for transactions that empty the sender account |
hour, day | engineered | Temporal patterns derived from the step column |
Total: 16 features after one-hot encoding.
Results
Final evaluation on the hidden test set, which was never used for model selection:
| Metric | Value | Notes |
|---|---|---|
| Test PR-AUC | 0.992 | PaySim signal is clean; real-world is usually 0.3 to 0.7 |
| Test ROC-AUC | 0.999 | Less informative than PR-AUC for rare-event problems |
| Validation PR-AUC | 0.993 | 0.001 gap from test = no overfitting |
| Precision at default τ | 0.972 | At threshold τ=0.01 |
| Recall at default τ | 0.995 | |
| Net savings | $1.23M | At τ=0.01 with cost model $1000/missed fraud, $5/false positive |
| Latency | 8.5ms | Single prediction including SHAP attribution |
Top SHAP features by global importance
| Feature | Interpretation |
|---|---|
oldbalanceDest | Receiver's balance before the transaction (laundering accounts often start at zero) |
amount_to_balance_ratio | Transactions that move most of the sender's balance |
amount | Raw transaction size |
drains_full_balance | Engineered flag for cash-out patterns |
day, hour | Temporal patterns (fraud spikes at specific times) |
The hard engineering decisions
1. Stratified random split, not temporal split. PaySim's step column looks like a time index, but the simulator generates non-uniform fraud distribution. Fraud rate jumps 10x in the last 40% of the timeline. A naive temporal split caused validation PR-AUC to collapse below 0.01. The decision was documented in docs/model_card.md as a defensible portfolio-grade tradeoff — true temporal honesty is deferred to the production drift monitoring system.
2. Dropped sender/receiver aggregate features. The first version included sender_avg_amount and sender_txn_count. Ablation proved these were not necessary — PR-AUC went from 0.971 (with aggregates) to 0.993 (without). The simpler model won, and the ablation itself serves as the leakage-free proof.
3. Hidden test set discipline. The test set was never loaded during training or hyperparameter search. It was revealed exactly once in scripts/final_eval.py to produce the locked metrics in models/lightgbm_final_test_report.json.
4. Isotonic calibration via FrozenEstimator. Raw LightGBM probabilities are not well-calibrated, which matters when threshold tuning is driven by an explicit cost model. The trained model is wrapped in CalibratedClassifierCV with isotonic regression so that a score of 0.8 actually corresponds to roughly an 80% fraud probability.
Production integration
The model loads into the FastAPI process at startup via a lifespan context manager. Every /score request goes through:
- Pydantic schema validation on the request body
- Feature engineering (one-hot encoding, ratio computation, temporal features)
model.predict_proba()for the calibrated scoreshap_explainer(X)for per-feature attribution- Risk band assignment based on the active threshold
- Persistence to the
predictionstable with the explanation as JSONB
The whole path completes in 8.5ms p50, including SHAP. Loading the model once at startup (rather than per-request) cuts ~2 seconds off every prediction.