What the Data Actually Shows — Part 4: Exploratory Data Analysis

You can't just shove a tensor into a neural network and hope for the best.

I mean, you can. People do it all the time. But you'll have no idea whether your model is learning something real or just memorising noise. Before we get anywhere near ConvLSTM or ST-ResNet, we need to properly understand what patterns actually exist in this data — and whether they're strong enough for a model to learn.

This is the part that most ML blog posts skip. It's also the part that saves you weeks of debugging later.

When does crime happen?

The monthly pattern across Auckland is surprisingly consistent year to year. Crime peaks in late spring and early summer — October through January — and dips in late summer through winter. February is reliably the quietest month at around 7,000–8,000 victimisations, while November and December regularly push past 9,000.

This tracks with what criminology research has found globally — warmer months mean more people out and about, more opportunities for property crime, and more interpersonal conflict. It's a well-documented pattern called seasonal variation in crime, and it shows up clearly in the NZ data.

But here's the interesting bit — the seasonal signal isn't uniform across crime types. Theft drives most of the swing. It surges in summer and drops in winter, accounting for nearly all the monthly variance. Assault has its own rhythm — it peaks around the holiday period (December–January) and shows a secondary bump in winter weekends, probably pub-related. Burglary is flatter, with a slight winter uptick when houses are dark earlier.

2023 was the peak year across the board, with a noticeable decline through 2024 and into early 2025. Whether that's a real trend or a reporting artefact, I genuinely don't know. But it means the model's training data includes both an upswing and a downswing, which is actually useful — it can't just learn "crime always goes up."

Where does crime cluster?

Crime in Auckland is not randomly distributed. That's obvious to anyone who lives here, but it's worth quantifying.

Running a Moran's I test on our 500m grid confirms strong positive spatial autocorrelation — cells with high crime counts are surrounded by other high-crime cells. The Moran's I statistic comes out at 0.43 (p < 0.001), which means the clustering is highly significant. Crime begets more crime in adjacent cells.

The hotspots are exactly where you'd expect. The CBD dominates — Queen Street, Karangahape Road, and the surrounding blocks consistently light up across all crime types. South Auckland corridors — Manukau, Ōtāhuhu, Papatoetoe — form a second cluster, particularly for assault and robbery. Henderson in the west shows up for burglary.

What's less obvious is how stable these hotspots are over time. The top 5% of cells (about 227 cells) account for over 60% of all recorded crime across the entire four-year period. These aren't random spikes — they're persistent. A cell that's hot in 2022 is almost certainly still hot in 2025. That temporal persistence is exactly what makes this data amenable to prediction. If hotspots moved randomly month to month, no model could learn them.

Crime type correlations

The six channels in our tensor don't behave independently. Theft and burglary show moderate positive correlation (r ≈ 0.52) — cells with lots of theft tend to have more burglary too, which makes sense given similar opportunity structures (commercial areas, transport hubs).

Assault correlates weakly with everything else (r ≈ 0.15–0.25). It has its own spatial logic — nightlife areas, specific residential pockets — that doesn't align neatly with property crime.

Robbery, sexual offences, and harm are so sparse at the 500m monthly resolution that correlation analysis is basically meaningless. Most cells have zero counts for these types in any given month. That sparsity is going to be a real headache for the models.

The sparsity problem — again

We flagged this in Part 3: 91.7% of the tensor is zeros. But the EDA makes the problem even clearer.

The distribution of non-zero cell values is heavily right-skewed. The median non-zero value is 1. One crime, in one cell, in one month. The mean is about 2.3. A handful of cells — the CBD, Manukau — hit 30–50+ in peak months for theft. The model needs to learn the difference between "always zero" cells, "occasionally one" cells, and "consistently busy" cells.

If you plot the crime count distribution across non-zero cells, it follows something close to a power law. A tiny number of cells carry an outsized share of the signal. This is textbook spatial concentration of crime — it's been documented in basically every city ever studied.

For modelling, this means two things. First, aggregate metrics like RMSE will be dominated by how well the model predicts the high-count cells. Second, predicting "zero" for a sparse cell is almost always correct but completely uninformative. We'll need to think carefully about what "accuracy" actually means when we get to evaluation.

What this means for the models

The EDA tells us a few things that should directly shape how we build and evaluate the models:

The seasonal signal is strong and consistent. A model that can't capture monthly seasonality is worse than useless — it's worse than a calendar.

Spatial structure is real and persistent. Hotspots don't move much. A model that learns static spatial patterns will get a lot of the way there, even without understanding temporal dynamics.

The interesting question isn't "can we predict that the CBD will have lots of theft next month" — of course it will. It's "can we predict the changes at the margins?" The cells that go from quiet to active, or the months where a normally stable area spikes. That's where deep learning might actually add value over simple baselines.

Speaking of which — we need baselines. Otherwise we won't know if ConvLSTM is actually clever or just expensive. That's next.

When does crime happen?

Where does crime cluster?

Crime type correlations

The sparsity problem — again

What this means for the models

Related posts

Comments