CommunityShield
ML-powered crime pattern explorer for Chicago. 8.5M rows, 4 XGBoost models with SHAP explanations, beat-level heatmap, and an honest methodology page about what the data can and cannot tell you.
Database Design
The schema is built around two ideas: raw incident data is immutable, and aggregations are precomputed. The combination delivers sub-100ms heatmap queries against an 8.5M-row dataset without resorting to caching layers or denormalization gymnastics.
Schema
-- Raw incident data
crimes (
id BIGSERIAL PRIMARY KEY,
case_number TEXT UNIQUE NOT NULL,
occurred_at TIMESTAMPTZ NOT NULL,
primary_type TEXT NOT NULL,
description TEXT,
location_description TEXT,
arrest BOOLEAN,
domestic BOOLEAN,
beat_id INT REFERENCES beats(id),
district INT,
ward INT,
community_area INT,
fbi_code TEXT,
latitude DOUBLE PRECISION,
longitude DOUBLE PRECISION,
location GEOMETRY(Point, 4326),
imported_at TIMESTAMPTZ DEFAULT NOW()
);
-- Police beat polygons
beats (
id INT PRIMARY KEY,
beat_number TEXT UNIQUE NOT NULL,
district INT NOT NULL,
geometry GEOMETRY(Polygon, 4326) NOT NULL
);
-- Pre-aggregated rollups (7.8M rows across all buckets)
rollups_hourly (beat_id, hour_bucket TIMESTAMPTZ, crime_category, count, PRIMARY KEY (beat_id, hour_bucket, crime_category))
rollups_daily (beat_id, day_bucket DATE, crime_category, count, PRIMARY KEY (beat_id, day_bucket, crime_category))
rollups_weekly (beat_id, week_bucket DATE, crime_category, count, PRIMARY KEY (beat_id, week_bucket, crime_category))
rollups_monthly (beat_id, month_bucket DATE, crime_category, count, PRIMARY KEY (beat_id, month_bucket, crime_category))
-- Model predictions cache
predictions (
id BIGSERIAL PRIMARY KEY,
beat_id INT REFERENCES beats(id),
prediction_window DATERANGE NOT NULL,
model_version TEXT NOT NULL,
predicted_count INT,
predicted_probability DOUBLE PRECISION,
shap_attribution JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
Key design decisions
Raw crimes table is append-only. Once an incident is ingested, the row is never updated. Corrections to Chicago's data portal arrive as new rows with the same case_number, and the unique constraint forces conflict resolution at ETL time, not at query time. This makes the table cache-friendly and time-travel queryable.
Four rollup granularities (hourly, daily, weekly, monthly) because the frontend's time-window selector spans "last 6 hours" to "last 5 years." Hitting the hourly table for a 5-year query would be slow; hitting the monthly table for a 6-hour query would have zero resolution. The query layer picks the right rollup based on window length: ≤72h hits hourly, ≤90d hits daily, ≤2y hits weekly, longer hits monthly.
GIST spatial index on beats.geometry and crimes.location. Point-in-polygon assignments (ST_Contains) run in milliseconds. The ETL pipeline uses this to populate crimes.beat_id at ingest, so the rollup queries never need to do spatial joins — they're already keyed by beat_id.
predictions.shap_attribution as JSONB, not a wide column or a separate table. Different model versions return different numbers of features, and the SHAP output shape evolves with the feature set. JSONB lets the schema absorb that without migration churn.
prediction_window as DATERANGE instead of two timestamp columns. PostgreSQL's range types compose cleanly with && (overlap) and @> (contains), so "what prediction was active for date X" is one operator, not a BETWEEN with edge-case bugs.
No foreign key from predictions to a model registry table. Model versions are tracked as strings (v1.2.3-xgboost-burglary) rather than a normalized lookup. The cost of a string column is trivial compared to the operational simplicity of not needing to insert a model registry row before logging a prediction.
ETL pipeline
Daily ingest from Chicago's Socrata API. The pipeline:
- Pulls incidents updated since the last successful run (idempotent via
case_number) - Spatial-joins each incident to its
beat_idusing PostGISST_Contains - Upserts into
crimeswithON CONFLICT (case_number) DO UPDATE - Incrementally updates the four rollup tables for affected (beat, time bucket, category) tuples
- Logs ingest metrics to a separate
etl_runsaudit table
The rollup update is incremental, not full-rebuild. A run that ingests 5,000 new rows updates roughly 5,000 rollup row counts, not 7.8M.