Challenges and Solutions

Ten engineering problems from this build. Each one cost real time and changed how I approach ML systems, geospatial data, or production-grade UX tradeoffs.

1. 8.5M rows made the heatmap query unusable

Challenge: Initial heatmap implementation queried the raw crimes table, grouped by beat_id, filtered by time window. Even with indexes the query took 4 to 8 seconds. Interactive map dead on arrival.

Solution: Built four pre-aggregated rollup tables (hourly, daily, weekly, monthly), each keyed by (beat_id, time_bucket, crime_category). The heatmap query became a single indexed lookup against the correct rollup. Sub-100ms.

What I learned: Precomputation is the right answer when the query pattern is predictable. The disk cost is trivial compared to the latency win.

2. Temporal split changed the project

Challenge: Random train/test split produced PR-AUC of 0.78 across all models. Looked great. The model was learning seasonal patterns it would never see at production time.

Solution: Switched to strict temporal holdout: most recent 6 months held out, never seen during tuning. PR-AUC dropped to 0.62. The temporal number is the real number.

What I learned: If the production model will run on future data, the holdout has to be the future. Anything else inflates metrics and hides the data ceiling.

3. WebGL isn't universal

Challenge: MapLibre GL needs WebGL. Corporate browsers, older devices, and accessibility tools sometimes don't have it. First version showed a blank screen in those environments.

Solution: 3-tier rendering fallback. WebGL detection on mount selects MapLibre, server-side SVG choropleth, or a sortable data table: in that priority order. Same data, three rendering paths.

What I learned: Map libraries that depend on a single rendering technology are fragile. Plan the fallback before shipping.

4. Adding census features looked tempting and was wrong

Challenge: Demographic features at the census-tract level were trivial to add and feel like they should help.

Solution: Deliberate ablation. PR-AUC moved by less than 0.01, and feature importance showed the model could learn race proxies. Stripped them out and documented why in ethics.md.

What I learned: A feature that doesn't help accuracy and introduces a known harm is not a tradeoff. It's a strict loss. Run the ablation, then make the right call.

5. PostGIS point-in-polygon was a bottleneck at ingest

Challenge: Assigning every new crime record to a beat (PostGIS ST_Contains against 274 polygons) was slow at the rates Chicago's API publishes.

Solution: Added a GIST spatial index on beats.geometry. Lookup latency dropped from seconds to milliseconds. The ETL pipeline went from a slow trickle to comfortably keeping up.

What I learned: PostGIS is fast when indexed and dog-slow when not. Always add the GIST index.

6. Holiday detection was a long tail

Challenge: is_holiday as a temporal feature seemed simple. The actual list includes federal holidays, Chicago-specific observances, religious holidays with floating dates, and adjacent days that act like holidays (the Friday after Thanksgiving).

Solution: Used the holidays Python library plus a manually-curated supplement for Chicago-specific dates. Backfilled the full feature against the historical dataset.

What I learned: "Is it a holiday" is a deceptively complex feature. Use a maintained library.

7. SHAP attribution latency on the prediction endpoint

Challenge: Computing SHAP attribution per request added 200 to 400ms to /predict calls. Acceptable but felt fragile.

Solution: Loaded the four TreeExplainer instances once at FastAPI startup via the lifespan context manager, keeping them in memory for the lifetime of the process. Latency dropped to under 50ms.

What I learned: Anything that loads from disk on a hot path is a configuration smell. Load once at startup.

8. Rollup table updates were a transaction hazard

Challenge: Incremental rollup updates need to atomically increment counts across four tables for every new incident. Naive implementation produced race conditions when the ETL pipeline ran concurrently with manual backfills.

Solution: Wrapped the rollup update in a single transaction per incident batch with SELECT ... FOR UPDATE on the affected rollup rows. Slower than the racy version, correct by construction.

What I learned: "Increment a counter" is never as simple as it looks under concurrent writes. Lock the row.

9. Beat polygons aren't free in MapLibre

Challenge: Rendering 274 beat polygons as individual layers tanked MapLibre's frame rate on lower-end devices.

Solution: Compiled the polygons to a single vector tile source with the choropleth color baked in as a feature property. One layer, GPU-accelerated, smooth on every device tested.

What I learned: Vector tile pipelines are the right shape for choropleth maps. Per-feature layers are not.

10. The methodology page had to be first-class

Challenge: First version buried the limitations in a footer-linked /about page. Nobody reads /about pages. Users would see predictions without seeing the caveats.

Solution: The methodology and ethics pages are part of the primary navigation, with the data ceiling chart visible the moment you click in. Predictions on the main app link to the relevant methodology section for the feature that drove them.

What I learned: If the limitations matter, and on a project like this they're the most important thing, they have to be in the foreground. Buried disclaimers are equivalent to no disclaimers.

CommunityShield