CommunityShield
ML-powered crime pattern explorer for Chicago. 8.5M rows, 4 XGBoost models with SHAP explanations, beat-level heatmap, and an honest methodology page about what the data can and cannot tell you.
Machine Learning Deep Dive
The problem
Given the last N hours of incident data for a Chicago police beat, predict whether the next reporting window will exceed the beat's historical median for a given crime category. This is framed as a binary classification (above/below median), not a regression on raw counts, because the absolute counts are small enough at the beat level that count regression is dominated by Poisson noise.
Models
Four XGBoost classifiers, one per high-frequency crime category:
| Model | Crime category | Beat-level positive rate |
|---|---|---|
theft_classifier | Theft | ~24% above-median |
battery_classifier | Battery | ~22% above-median |
burglary_classifier | Burglary | ~18% above-median |
motor_theft_classifier | Motor vehicle theft | ~16% above-median |
Plus an ensemble that averages the four predicted probabilities for the methodology page's predicted-vs-actual hot spot comparison.
Features
| Feature group | Examples |
|---|---|
| Temporal | hour of day, day of week, week of year, is_weekend, is_holiday |
| Beat-level lag | counts in the last 1h / 6h / 24h / 7d / 30d at this beat |
| Beat-level baseline | beat's historical median count for this hour-of-week |
| Neighbor effects | rolling counts in adjacent beats (PostGIS ST_Touches) |
| Seasonal | month-of-year, year-over-year baseline |
No demographic features. No socioeconomic indicators. The model has access only to temporal and spatial-temporal aggregates of the public incident data itself — see ethics.md for why.
Hyperparameter tuning
Optuna, 100 trials per model. Search space:
| Hyperparameter | Range |
|---|---|
n_estimators | 100 – 1000 |
max_depth | 3 – 12 |
learning_rate | 0.01 – 0.3 (log scale) |
subsample | 0.6 – 1.0 |
colsample_bytree | 0.6 – 1.0 |
min_child_weight | 1 – 10 |
reg_alpha | 1e-8 – 10 (log scale) |
reg_lambda | 1e-8 – 10 (log scale) |
Objective: maximize PR-AUC on a temporal holdout (most recent 6 months held out, never used during tuning). PR-AUC over ROC-AUC because the class imbalance is moderate and the cost of a false positive (drawing attention to a beat that isn't actually elevated) matters more than ROC-AUC implies.
Six experiments that shaped the final design
Experiment 1: count regression vs. above-median classification. Tried regressing raw next-window counts first. RMSE was dominated by Poisson noise at the beat level — small counts mean huge relative error. Switched to binary classification against the beat's own historical median. Clearer signal, more useful product (the answer to is this beat elevated matters more than exactly how many incidents will occur).
Experiment 2: citywide vs. per-category models. Tried one model with crime_category as a categorical feature. The four-model setup beat it on every category's holdout PR-AUC. The features that predict burglary patterns are genuinely different from the features that predict motor vehicle theft.
Experiment 3: temporal split discipline. First version used random train/test split. PR-AUC looked great (0.78) but every analysis showed the model was learning seasonal patterns it would never see at production time. Switched to a strict temporal split — most recent 6 months held out — and PR-AUC dropped to a more honest 0.62. The temporal split is the real benchmark.
Experiment 4: neighbor-beat features. Added rolling counts from adjacent beats (PostGIS ST_Touches). PR-AUC improved by 0.03 across all four models. Crime patterns spill across beat boundaries, and the model picks that up.
Experiment 5: demographic features ablation. Tested adding census tract demographics — income, racial composition, age distribution — as a deliberate exercise to demonstrate why they're excluded from the production model. PR-AUC barely moved (+0.005), and feature importance showed the model could now learn proxies for race even though race wasn't a direct feature. Stripped them out and documented the decision in ethics.md.
Experiment 6: the data ceiling finding. After extensive tuning, all four models converge to PR-AUC in the 0.58 – 0.65 range and plateau. More trees, deeper trees, more features don't help. The honest interpretation: reported crime is a noisy signal of underlying criminal activity, and the noise floor is the data ceiling. A better model can't fix the data. This is the most important finding in the whole project and it's documented prominently on the methodology page.
Final metrics (temporal holdout)
| Model | PR-AUC | ROC-AUC |
|---|---|---|
| Theft | 0.64 | 0.81 |
| Battery | 0.61 | 0.79 |
| Burglary | 0.58 | 0.77 |
| Motor vehicle theft | 0.65 | 0.82 |
Good enough to surface real patterns at the beat level. Not good enough to drive enforcement decisions, and the product is explicit about that.
SHAP integration
Every prediction stored in the predictions table includes its SHAP attribution as JSONB. The frontend's prediction view renders the top contributing features with their attribution values — users see exactly why the model predicted what it did. No black-box outputs.
TreeExplainer is fast enough to compute attributions inline during the prediction request — sub-50ms for the full attribution on a single prediction. Loading the explainer once at FastAPI startup amortizes the model-load cost.