Machine Learning Deep Dive

The problem

Fraud detection at scale has three tensions in conflict. Models that maximize accuracy miss the rare fraud cases. Models that maximize fraud recall flood analysts with false positives. And models that look great on validation data often fail in production due to distribution drift or label leakage that wasn't caught during training.

The approach

A calibrated LightGBM classifier with strict leakage controls, isotonic probability calibration, and SHAP-based explainability on every prediction.

Why LightGBM?

Handles class imbalance natively via scale_pos_weight (fraud is roughly 0.13% of transactions)
Trains fast enough for iterative experimentation across hundreds of hyperparameter combinations
Pairs with SHAP TreeExplainer for fast, exact attributions on every prediction
Outperformed XGBoost and Logistic Regression baselines on PR-AUC across multiple ablations

Dataset

PaySim — 6.36M synthetic mobile money transactions with a ~0.13% fraud rate. Tracked via DVC so the exact training data is reproducible from a clone.

Feature	Source	Description
`amount`	raw	Transaction amount
`type_TRANSFER`, `type_CASH_OUT`, etc.	one-hot	Transaction type indicators
`old_balance_org`	raw	Sender balance before transaction
`new_balance_org`	raw	Sender balance after transaction
`old_balance_dest`	raw	Receiver balance before transaction
`new_balance_dest`	raw	Receiver balance after transaction
`amount_to_balance_ratio`	engineered	Amount relative to sender balance
`drains_full_balance`	engineered	Flag for transactions that empty the sender account
`hour`, `day`	engineered	Temporal patterns derived from the `step` column

Total: 16 features after one-hot encoding.

Results

Final evaluation on the hidden test set, which was never used for model selection:

Metric	Value	Notes
Test PR-AUC	0.992	PaySim signal is clean; real-world is usually 0.3 to 0.7
Test ROC-AUC	0.999	Less informative than PR-AUC for rare-event problems
Validation PR-AUC	0.993	0.001 gap from test = no overfitting
Precision at default τ	0.972	At threshold τ=0.01
Recall at default τ	0.995
Net savings	$1.23M	At τ=0.01 with cost model $1000/missed fraud, $5/false positive
Latency	8.5ms	Single prediction including SHAP attribution

Top SHAP features by global importance

Feature	Interpretation
`oldbalanceDest`	Receiver's balance before the transaction (laundering accounts often start at zero)
`amount_to_balance_ratio`	Transactions that move most of the sender's balance
`amount`	Raw transaction size
`drains_full_balance`	Engineered flag for cash-out patterns
`day`, `hour`	Temporal patterns (fraud spikes at specific times)

The hard engineering decisions

1. Stratified random split, not temporal split. PaySim's step column looks like a time index, but the simulator generates non-uniform fraud distribution. Fraud rate jumps 10x in the last 40% of the timeline. A naive temporal split caused validation PR-AUC to collapse below 0.01. The decision was documented in docs/model_card.md as a defensible portfolio-grade tradeoff — true temporal honesty is deferred to the production drift monitoring system.

2. Dropped sender/receiver aggregate features. The first version included sender_avg_amount and sender_txn_count. Ablation proved these were not necessary — PR-AUC went from 0.971 (with aggregates) to 0.993 (without). The simpler model won, and the ablation itself serves as the leakage-free proof.

3. Hidden test set discipline. The test set was never loaded during training or hyperparameter search. It was revealed exactly once in scripts/final_eval.py to produce the locked metrics in models/lightgbm_final_test_report.json.

4. Isotonic calibration via FrozenEstimator. Raw LightGBM probabilities are not well-calibrated, which matters when threshold tuning is driven by an explicit cost model. The trained model is wrapped in CalibratedClassifierCV with isotonic regression so that a score of 0.8 actually corresponds to roughly an 80% fraud probability.

Production integration

The model loads into the FastAPI process at startup via a lifespan context manager. Every /score request goes through:

Pydantic schema validation on the request body
Feature engineering (one-hot encoding, ratio computation, temporal features)
model.predict_proba() for the calibrated score
shap_explainer(X) for per-feature attribution
Risk band assignment based on the active threshold
Persistence to the predictions table with the explanation as JSONB

The whole path completes in 8.5ms p50, including SHAP. Loading the model once at startup (rather than per-request) cuts ~2 seconds off every prediction.

Sentinel — Fraud Detection Platform