Machine Learning Deep Dive

The problem

Given the last N hours of incident data for a Chicago police beat, predict whether the next reporting window will exceed the beat's historical median for a given crime category. This is framed as a binary classification (above/below median), not a regression on raw counts, because the absolute counts are small enough at the beat level that count regression is dominated by Poisson noise.

Models

Four XGBoost classifiers, one per high-frequency crime category:

Model	Crime category	Beat-level positive rate
`theft_classifier`	Theft	~24% above-median
`battery_classifier`	Battery	~22% above-median
`burglary_classifier`	Burglary	~18% above-median
`motor_theft_classifier`	Motor vehicle theft	~16% above-median

Plus an ensemble that averages the four predicted probabilities for the methodology page's predicted-vs-actual hot spot comparison.

Features

Feature group	Examples
Temporal	hour of day, day of week, week of year, is_weekend, is_holiday
Beat-level lag	counts in the last 1h / 6h / 24h / 7d / 30d at this beat
Beat-level baseline	beat's historical median count for this hour-of-week
Neighbor effects	rolling counts in adjacent beats (PostGIS `ST_Touches`)
Seasonal	month-of-year, year-over-year baseline

No demographic features. No socioeconomic indicators. The model has access only to temporal and spatial-temporal aggregates of the public incident data itself: see ethics.md for why.

Hyperparameter tuning

Optuna, 100 trials per model. Search space:

Hyperparameter	Range
`n_estimators`	100 to 1000
`max_depth`	3 to 12
`learning_rate`	0.01 to 0.3 (log scale)
`subsample`	0.6 to 1.0
`colsample_bytree`	0.6 to 1.0
`min_child_weight`	1 to 10
`reg_alpha`	1e-8 to 10 (log scale)
`reg_lambda`	1e-8 to 10 (log scale)

Objective: maximize PR-AUC on a temporal holdout (most recent 6 months held out, never used during tuning). PR-AUC over ROC-AUC because the class imbalance is moderate and the cost of a false positive (drawing attention to a beat that isn't actually elevated) matters more than ROC-AUC implies.

Six experiments that shaped the final design

Experiment 1: count regression vs. above-median classification. Tried regressing raw next-window counts first. RMSE was dominated by Poisson noise at the beat level: small counts mean huge relative error. Switched to binary classification against the beat's own historical median. Clearer signal, more useful product (the answer to is this beat elevated matters more than exactly how many incidents will occur).

Experiment 2: citywide vs. per-category models. Tried one model with crime_category as a categorical feature. The four-model setup beat it on every category's holdout PR-AUC. The features that predict burglary patterns are genuinely different from the features that predict motor vehicle theft.

Experiment 3: temporal split discipline. First version used random train/test split. PR-AUC looked great (0.78) but every analysis showed the model was learning seasonal patterns it would never see at production time. Switched to a strict temporal split: most recent 6 months held out, and PR-AUC dropped to a more honest 0.62. The temporal split is the real benchmark.

Experiment 4: neighbor-beat features. Added rolling counts from adjacent beats (PostGIS ST_Touches). PR-AUC improved by 0.03 across all four models. Crime patterns spill across beat boundaries, and the model picks that up.

Experiment 5: demographic features ablation. Tested adding census tract demographics: income, racial composition, age distribution: as a deliberate exercise to demonstrate why they're excluded from the production model. PR-AUC barely moved (+0.005), and feature importance showed the model could now learn proxies for race even though race wasn't a direct feature. Stripped them out and documented the decision in ethics.md.

Experiment 6: the data ceiling finding. After extensive tuning, all four models converge to PR-AUC in the 0.58 to 0.65 range and plateau. More trees, deeper trees, more features don't help. The honest interpretation: reported crime is a noisy signal of underlying criminal activity, and the noise floor is the data ceiling. A better model can't fix the data. This is the most important finding in the whole project and it's documented prominently on the methodology page.

Final metrics (temporal holdout)

Model	PR-AUC	ROC-AUC
Theft	0.64	0.81
Battery	0.61	0.79
Burglary	0.58	0.77
Motor vehicle theft	0.65	0.82

Good enough to surface real patterns at the beat level. Not good enough to drive enforcement decisions, and the product is explicit about that.

SHAP integration

Every prediction stored in the predictions table includes its SHAP attribution as JSONB. The frontend's prediction view renders the top contributing features with their attribution values: users see exactly why the model predicted what it did. No black-box outputs.

TreeExplainer is fast enough to compute attributions inline during the prediction request: sub-50ms for the full attribution on a single prediction. Loading the explainer once at FastAPI startup amortizes the model-load cost.

CommunityShield