rpmjp/portfolio
rpmjp/projects/communityshield/database-design.md
CompletedMay – August 2025

CommunityShield

ML-powered crime pattern explorer for Chicago. 8.5M rows, 4 XGBoost models with SHAP explanations, beat-level heatmap, and an honest methodology page about what the data can and cannot tell you.

Python 3.12FastAPIPostgreSQL 16PostGISXGBoostSHAPReact 19MapLibre GL
Languages
TypeScript52.4%
Python41.8%
CSS3.2%
Other2.6%
database-design.md

Database Design

The schema is built around two ideas: raw incident data is immutable, and aggregations are precomputed. The combination delivers sub-100ms heatmap queries against an 8.5M-row dataset without resorting to caching layers or denormalization gymnastics.


Schema

-- Raw incident data
crimes (
  id BIGSERIAL PRIMARY KEY,
  case_number TEXT UNIQUE NOT NULL,
  occurred_at TIMESTAMPTZ NOT NULL,
  primary_type TEXT NOT NULL,
  description TEXT,
  location_description TEXT,
  arrest BOOLEAN,
  domestic BOOLEAN,
  beat_id INT REFERENCES beats(id),
  district INT,
  ward INT,
  community_area INT,
  fbi_code TEXT,
  latitude DOUBLE PRECISION,
  longitude DOUBLE PRECISION,
  location GEOMETRY(Point, 4326),
  imported_at TIMESTAMPTZ DEFAULT NOW()
);

-- Police beat polygons
beats (
  id INT PRIMARY KEY,
  beat_number TEXT UNIQUE NOT NULL,
  district INT NOT NULL,
  geometry GEOMETRY(Polygon, 4326) NOT NULL
);

-- Pre-aggregated rollups (7.8M rows across all buckets)
rollups_hourly  (beat_id, hour_bucket TIMESTAMPTZ, crime_category, count, PRIMARY KEY (beat_id, hour_bucket, crime_category))
rollups_daily   (beat_id, day_bucket DATE, crime_category, count, PRIMARY KEY (beat_id, day_bucket, crime_category))
rollups_weekly  (beat_id, week_bucket DATE, crime_category, count, PRIMARY KEY (beat_id, week_bucket, crime_category))
rollups_monthly (beat_id, month_bucket DATE, crime_category, count, PRIMARY KEY (beat_id, month_bucket, crime_category))

-- Model predictions cache
predictions (
  id BIGSERIAL PRIMARY KEY,
  beat_id INT REFERENCES beats(id),
  prediction_window DATERANGE NOT NULL,
  model_version TEXT NOT NULL,
  predicted_count INT,
  predicted_probability DOUBLE PRECISION,
  shap_attribution JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

Key design decisions

Raw crimes table is append-only. Once an incident is ingested, the row is never updated. Corrections to Chicago's data portal arrive as new rows with the same case_number, and the unique constraint forces conflict resolution at ETL time, not at query time. This makes the table cache-friendly and time-travel queryable.

Four rollup granularities (hourly, daily, weekly, monthly) because the frontend's time-window selector spans "last 6 hours" to "last 5 years." Hitting the hourly table for a 5-year query would be slow; hitting the monthly table for a 6-hour query would have zero resolution. The query layer picks the right rollup based on window length: ≤72h hits hourly, ≤90d hits daily, ≤2y hits weekly, longer hits monthly.

GIST spatial index on beats.geometry and crimes.location. Point-in-polygon assignments (ST_Contains) run in milliseconds. The ETL pipeline uses this to populate crimes.beat_id at ingest, so the rollup queries never need to do spatial joins — they're already keyed by beat_id.

predictions.shap_attribution as JSONB, not a wide column or a separate table. Different model versions return different numbers of features, and the SHAP output shape evolves with the feature set. JSONB lets the schema absorb that without migration churn.

prediction_window as DATERANGE instead of two timestamp columns. PostgreSQL's range types compose cleanly with && (overlap) and @> (contains), so "what prediction was active for date X" is one operator, not a BETWEEN with edge-case bugs.

No foreign key from predictions to a model registry table. Model versions are tracked as strings (v1.2.3-xgboost-burglary) rather than a normalized lookup. The cost of a string column is trivial compared to the operational simplicity of not needing to insert a model registry row before logging a prediction.


ETL pipeline

Daily ingest from Chicago's Socrata API. The pipeline:

  1. Pulls incidents updated since the last successful run (idempotent via case_number)
  2. Spatial-joins each incident to its beat_id using PostGIS ST_Contains
  3. Upserts into crimes with ON CONFLICT (case_number) DO UPDATE
  4. Incrementally updates the four rollup tables for affected (beat, time bucket, category) tuples
  5. Logs ingest metrics to a separate etl_runs audit table

The rollup update is incremental, not full-rebuild. A run that ingests 5,000 new rows updates roughly 5,000 rollup row counts, not 7.8M.