rpmjp/portfolio
rpmjp/projects/sentinel/ml-deep-dive.md
CompletedOctober 2025 – January 2026

Sentinel — Fraud Detection Platform

Production-grade fraud operations platform with calibrated LightGBM scoring at 8.5ms, SHAP explainability on every prediction, and $1.23M in modeled net savings from cost-aware threshold tuning.

Python 3.12FastAPILightGBMSHAPPostgreSQL 16React 19TypeScriptTailwind v4
Languages
TypeScript56.7%
Python41.6%
CSS1%
Makefile0.4%
JavaScript0.1%
Mako0.1%
HTML0.1%
ml-deep-dive.md

Machine Learning Deep Dive

The problem

Fraud detection at scale has three tensions in conflict. Models that maximize accuracy miss the rare fraud cases. Models that maximize fraud recall flood analysts with false positives. And models that look great on validation data often fail in production due to distribution drift or label leakage that wasn't caught during training.

The approach

A calibrated LightGBM classifier with strict leakage controls, isotonic probability calibration, and SHAP-based explainability on every prediction.

Why LightGBM?

  • Handles class imbalance natively via scale_pos_weight (fraud is roughly 0.13% of transactions)
  • Trains fast enough for iterative experimentation across hundreds of hyperparameter combinations
  • Pairs with SHAP TreeExplainer for fast, exact attributions on every prediction
  • Outperformed XGBoost and Logistic Regression baselines on PR-AUC across multiple ablations

Dataset

PaySim — 6.36M synthetic mobile money transactions with a ~0.13% fraud rate. Tracked via DVC so the exact training data is reproducible from a clone.

FeatureSourceDescription
amountrawTransaction amount
type_TRANSFER, type_CASH_OUT, etc.one-hotTransaction type indicators
old_balance_orgrawSender balance before transaction
new_balance_orgrawSender balance after transaction
old_balance_destrawReceiver balance before transaction
new_balance_destrawReceiver balance after transaction
amount_to_balance_ratioengineeredAmount relative to sender balance
drains_full_balanceengineeredFlag for transactions that empty the sender account
hour, dayengineeredTemporal patterns derived from the step column

Total: 16 features after one-hot encoding.


Results

Final evaluation on the hidden test set, which was never used for model selection:

MetricValueNotes
Test PR-AUC0.992PaySim signal is clean; real-world is usually 0.3 to 0.7
Test ROC-AUC0.999Less informative than PR-AUC for rare-event problems
Validation PR-AUC0.9930.001 gap from test = no overfitting
Precision at default τ0.972At threshold τ=0.01
Recall at default τ0.995
Net savings$1.23MAt τ=0.01 with cost model $1000/missed fraud, $5/false positive
Latency8.5msSingle prediction including SHAP attribution

Top SHAP features by global importance

FeatureInterpretation
oldbalanceDestReceiver's balance before the transaction (laundering accounts often start at zero)
amount_to_balance_ratioTransactions that move most of the sender's balance
amountRaw transaction size
drains_full_balanceEngineered flag for cash-out patterns
day, hourTemporal patterns (fraud spikes at specific times)

The hard engineering decisions

1. Stratified random split, not temporal split. PaySim's step column looks like a time index, but the simulator generates non-uniform fraud distribution. Fraud rate jumps 10x in the last 40% of the timeline. A naive temporal split caused validation PR-AUC to collapse below 0.01. The decision was documented in docs/model_card.md as a defensible portfolio-grade tradeoff — true temporal honesty is deferred to the production drift monitoring system.

2. Dropped sender/receiver aggregate features. The first version included sender_avg_amount and sender_txn_count. Ablation proved these were not necessary — PR-AUC went from 0.971 (with aggregates) to 0.993 (without). The simpler model won, and the ablation itself serves as the leakage-free proof.

3. Hidden test set discipline. The test set was never loaded during training or hyperparameter search. It was revealed exactly once in scripts/final_eval.py to produce the locked metrics in models/lightgbm_final_test_report.json.

4. Isotonic calibration via FrozenEstimator. Raw LightGBM probabilities are not well-calibrated, which matters when threshold tuning is driven by an explicit cost model. The trained model is wrapped in CalibratedClassifierCV with isotonic regression so that a score of 0.8 actually corresponds to roughly an 80% fraud probability.


Production integration

The model loads into the FastAPI process at startup via a lifespan context manager. Every /score request goes through:

  1. Pydantic schema validation on the request body
  2. Feature engineering (one-hot encoding, ratio computation, temporal features)
  3. model.predict_proba() for the calibrated score
  4. shap_explainer(X) for per-feature attribution
  5. Risk band assignment based on the active threshold
  6. Persistence to the predictions table with the explanation as JSONB

The whole path completes in 8.5ms p50, including SHAP. Loading the model once at startup (rather than per-request) cuts ~2 seconds off every prediction.