Database Design

The schema is built around two ideas: raw incident data is immutable, and aggregations are precomputed. The combination delivers sub-100ms heatmap queries against an 8.5M-row dataset without resorting to caching layers or denormalization gymnastics.

Schema

-- Raw incident data
crimes (
  id BIGSERIAL PRIMARY KEY,
  case_number TEXT UNIQUE NOT NULL,
  occurred_at TIMESTAMPTZ NOT NULL,
  primary_type TEXT NOT NULL,
  description TEXT,
  location_description TEXT,
  arrest BOOLEAN,
  domestic BOOLEAN,
  beat_id INT REFERENCES beats(id),
  district INT,
  ward INT,
  community_area INT,
  fbi_code TEXT,
  latitude DOUBLE PRECISION,
  longitude DOUBLE PRECISION,
  location GEOMETRY(Point, 4326),
  imported_at TIMESTAMPTZ DEFAULT NOW()
);

-- Police beat polygons
beats (
  id INT PRIMARY KEY,
  beat_number TEXT UNIQUE NOT NULL,
  district INT NOT NULL,
  geometry GEOMETRY(Polygon, 4326) NOT NULL
);

-- Pre-aggregated rollups (7.8M rows across all buckets)
rollups_hourly  (beat_id, hour_bucket TIMESTAMPTZ, crime_category, count, PRIMARY KEY (beat_id, hour_bucket, crime_category))
rollups_daily   (beat_id, day_bucket DATE, crime_category, count, PRIMARY KEY (beat_id, day_bucket, crime_category))
rollups_weekly  (beat_id, week_bucket DATE, crime_category, count, PRIMARY KEY (beat_id, week_bucket, crime_category))
rollups_monthly (beat_id, month_bucket DATE, crime_category, count, PRIMARY KEY (beat_id, month_bucket, crime_category))

-- Model predictions cache
predictions (
  id BIGSERIAL PRIMARY KEY,
  beat_id INT REFERENCES beats(id),
  prediction_window DATERANGE NOT NULL,
  model_version TEXT NOT NULL,
  predicted_count INT,
  predicted_probability DOUBLE PRECISION,
  shap_attribution JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

Key design decisions

Raw crimes table is append-only. Once an incident is ingested, the row is never updated. Corrections to Chicago's data portal arrive as new rows with the same case_number, and the unique constraint forces conflict resolution at ETL time, not at query time. This makes the table cache-friendly and time-travel queryable.

Four rollup granularities (hourly, daily, weekly, monthly) because the frontend's time-window selector spans "last 6 hours" to "last 5 years." Hitting the hourly table for a 5-year query would be slow; hitting the monthly table for a 6-hour query would have zero resolution. The query layer picks the right rollup based on window length: ≤72h hits hourly, ≤90d hits daily, ≤2y hits weekly, longer hits monthly.

GIST spatial index on beats.geometry and crimes.location. Point-in-polygon assignments (ST_Contains) run in milliseconds. The ETL pipeline uses this to populate crimes.beat_id at ingest, so the rollup queries never need to do spatial joins: they're already keyed by beat_id.

predictions.shap_attribution as JSONB, not a wide column or a separate table. Different model versions return different numbers of features, and the SHAP output shape evolves with the feature set. JSONB lets the schema absorb that without migration churn.

prediction_window as DATERANGE instead of two timestamp columns. PostgreSQL's range types compose cleanly with && (overlap) and @> (contains), so "what prediction was active for date X" is one operator, not a BETWEEN with edge-case bugs.

No foreign key from predictions to a model registry table. Model versions are tracked as strings (v1.2.3-xgboost-burglary) rather than a normalized lookup. The cost of a string column is trivial compared to the operational simplicity of not needing to insert a model registry row before logging a prediction.

ETL pipeline

Daily ingest from Chicago's Socrata API. The pipeline:

Pulls incidents updated since the last successful run (idempotent via case_number)
Spatial-joins each incident to its beat_id using PostGIS ST_Contains
Upserts into crimes with ON CONFLICT (case_number) DO UPDATE
Incrementally updates the four rollup tables for affected (beat, time bucket, category) tuples
Logs ingest metrics to a separate etl_runs audit table

The rollup update is incremental, not full-rebuild. A run that ingests 5,000 new rows updates roughly 5,000 rollup row counts, not 7.8M.

CommunityShield

Database Design

Schema

Key design decisions

ETL pipeline