rpmjp/portfolio
rpmjp/projects/sentinel/challenges.md
CompletedOctober 2025 – January 2026

Sentinel — Fraud Detection Platform

Production-grade fraud operations platform with calibrated LightGBM scoring at 8.5ms, SHAP explainability on every prediction, and $1.23M in modeled net savings from cost-aware threshold tuning.

Python 3.12FastAPILightGBMSHAPPostgreSQL 16React 19TypeScriptTailwind v4
Languages
TypeScript56.7%
Python41.6%
CSS1%
Makefile0.4%
JavaScript0.1%
Mako0.1%
HTML0.1%
challenges.md

Challenges and Solutions

Four real engineering problems that shaped the final shape of Sentinel. Each one cost real debugging time and changed how I think about ML systems, leakage discipline, and production-grade engineering tradeoffs.


1. Temporal split collapsed validation PR-AUC to near zero

Challenge: PaySim's step column looks like a clean time index. The first split was temporal — train on steps 1–500, validate on 500–700, test on 700–744. Validation PR-AUC came back at 0.008. Something was catastrophically wrong.

Investigation: I plotted fraud rate per step and discovered PaySim's simulator generates highly non-uniform fraud distribution. The last 40% of the timeline contains 10x the fraud rate of the early steps. A temporal split was training the model on almost no fraud and validating on almost all of it.

Solution: Switched to stratified random split, preserving the global fraud rate (0.13%) in train, validation, and test sets. Validation PR-AUC recovered to 0.993. Documented the decision in docs/model_card.md as a defensible portfolio-grade tradeoff: PaySim's temporal structure isn't realistic enough to make temporal honesty meaningful, so true temporal validation is deferred to the production drift monitoring system.

What I learned: Time-series-like features aren't automatically time-series problems. Always plot the label distribution against the candidate time axis before assuming a temporal split is correct. The drift monitoring system has to do the temporal work that the static evaluation can't.


2. Aggregate features looked helpful but were actually noise

Challenge: The first feature set included sender_avg_amount, sender_txn_count, and similar aggregate features computed across the sender's history. They felt useful — domain knowledge says fraud patterns differ from typical user behavior. PR-AUC with aggregates: 0.971.

Suspicion: Aggregate features risk label leakage if computed across the full dataset. The sender's average amount includes the transactions you're about to predict on. Even if the feature is technically computable at scoring time, the model can use it to memorize patterns from the train set in a way that doesn't generalize.

Solution: Ran a clean ablation — trained the identical pipeline with and without aggregates. PR-AUC without aggregates went up to 0.993. The aggregates weren't helping. They were either pure noise or the leakage was hurting test-set generalization. The simpler model won.

What I learned: Always ablate features you suspect might leak. The ablation itself serves as the leakage-free proof. "Domain knowledge says this should help" is not evidence — controlled experiments are evidence. And when a simpler model performs better, ship the simpler model.


3. Raw LightGBM probabilities lied to the threshold tuner

Challenge: The cost-aware threshold tuner depends on the model's probability output being calibrated — a score of 0.8 should mean roughly 80% probability of fraud. Raw LightGBM outputs are not calibrated by default. Tree ensembles overconfident on extreme cases and underconfident in the middle.

When I plotted predicted probability vs. actual fraud rate per probability bucket, the calibration curve was visibly wrong — the model said 70% probability for transactions that were fraud 90% of the time.

Solution: Wrapped the trained model in CalibratedClassifierCV with isotonic regression, using FrozenEstimator to avoid retraining the base model. The calibration step uses the validation set, fits a monotonic mapping from raw scores to calibrated probabilities, and applies it at inference time.

After calibration, the reliability diagram is nearly diagonal. The cost-aware threshold tuner now works — moving τ from 0.01 to 0.02 actually corresponds to a 1% probability shift, not an opaque score change.

What I learned: Calibration matters whenever downstream decisions use probabilities directly (cost models, ensemble blending, risk-band assignment). It's a 30-line fix that takes the model from "reports scores" to "reports probabilities you can reason about."


4. Hidden test set discipline under iteration pressure

Challenge: During hundreds of training runs across hyperparameter search, ablation studies, and feature engineering iterations, the temptation to peek at test set performance is constant. Every time validation looks good, the question is "how does this look on test?" One peek and the test set becomes a second validation set — the leakage is subtle but real.

Solution: Architectural separation. The test set lives in a separate file that's loaded exactly once, by scripts/final_eval.py, which I committed to running only after the final model was selected. Training scripts physically can't load the test set — they don't know the path.

The locked metrics in models/lightgbm_final_test_report.json are timestamped and committed. They're the ground truth.

What I learned: Discipline isn't a personality trait — it's an architecture. If the test set is reachable from the training script, it will eventually be touched. The right defense is making it physically unreachable from any code path that runs during iteration. This is the same principle as private fields and access modifiers — make the wrong thing hard to do.