Crime as Video — Part 3: Spatiotemporal Grid Construction

This is where the project gets properly fun.

We've got 1.15 million clean crime records. Every one of them has coordinates — either precise meshblock centroids or area unit fallbacks from Part 2. But a bag of lat/lon points isn't what a neural network wants. ConvLSTM and ST-ResNet are fundamentally image-processing architectures. They expect regular 2D grids — rows and columns, like pixels in a photograph.

So our job now is to convert the messy reality of crime locations into clean, regular "crime images" that a convolutional network can actually consume. And once you see it framed that way, crime prediction becomes video prediction. Each month is a frame. Each grid cell is a pixel. The brightness is the crime count.

Choosing 500m

This is the single most consequential decision in the entire data pipeline. Get the grid resolution wrong and everything downstream suffers.

Too fine — say 100m cells — and the vast majority of cells are empty in any given month. The model sees an ocean of zeros with occasional spikes, which is incredibly hard to learn from. Too coarse — say 2km — and you've blurred away the spatial patterns you're trying to detect. "Auckland CBD" and "Ponsonby" become the same cell, which is useless.

We computed Auckland's urban crime extent from the meshblock centroids (5th to 95th percentile to exclude outliers like Great Barrier Island):

Metric	Value
Urban extent	27.7 km × 36.9 km
Grid resolution	500m × 500m
Grid dimensions	77 rows × 59 columns
Total cells	4,543

At 500m, each cell covers roughly a few city blocks. That's fine enough to distinguish a commercial strip from a residential street, but coarse enough that most cells accumulate at least some crime over the 48-month period. It's a sweet spot — and it's consistent with what recent crime forecasting research uses for similar models in US cities.

Simple maths, no spatial joins

Here's the nice thing about working in NZTM2000 — the coordinate system we set up in Part 2 where units are metres. Assigning a crime to a grid cell is just floor division:

grid_j = floor((x - xmin) / 500)  # column index
grid_i = floor((y - ymin) / 500)  # row index

No spatial joins, no polygon intersection, no geopandas overhead. Just arithmetic. It processes all 400k Auckland records in under a second.

For the ~22% of Auckland records that didn't get meshblock coordinates in Part 2, we fall back to area unit centroids converted to NZTM2000. Those records land at the centre of their suburb rather than their exact location — less precise, but dropping them entirely would be worse.

The result: 354,387 of 412,669 Auckland records (86.2%) fall within the grid. The remaining 14% are in Auckland's outer fringes — Great Barrier Island, rural Rodney, the edges of the Waitakere Ranges — beyond our urban bounding box. That's fine. We're modelling urban crime patterns, not rural ones.

The 4D tensor

With every crime assigned to a cell, we aggregate by grid position, month, and crime type:

(grid_i, grid_j, month, crime_type) → sum(victimisations)

This gives us a 4D tensor:

Dimension	Size	Meaning
T (time)	48	Months: Feb 2022 – Jan 2026
H (height)	77	Grid rows (south → north)
W (width)	59	Grid columns (west → east)
C (channels)	6	Crime types: theft, burglary, assault, robbery, sexual, harm

Think of it as a 48-frame video with 6 colour channels. A regular video has 3 channels — red, green, blue. Ours has 6 — theft, burglary, assault, robbery, sexual offences, harm. Each pixel's brightness in a given channel tells you how many of that crime type happened in that 500m cell during that month.

I genuinely love this framing. It takes a complicated spatial-temporal prediction problem and maps it onto something that decades of computer vision research already knows how to handle.

91.7% zeros

The tensor is overwhelmingly empty. 91.7% of all cells are zero.

This makes complete sense if you think about it. Most 500m squares in Auckland don't have a single reported crime in any given month. Crime clusters — commercial corridors, transport hubs, specific residential pockets. The non-zero 8.3% is where all the signal lives.

The sparsity does create a training challenge though. If the model just predicted zero everywhere, it'd be right 91.7% of the time. Useless, but technically accurate. That's why we'll use log1p normalisation during training — it compresses the range from [0, 50+] to [0, ~4], giving the model a more balanced gradient to learn from. And it's why the loss function needs to care more about the non-zero cells than the empty ones.

The upside of all those zeros is storage. The compressed numpy format handles sparse data beautifully — the full 4D tensor saves to just 0.2 MB. Compare that to the 21.9 MB Parquet from Part 2.

Train, validate, test

We split the 48 months temporally — no shuffling, no random sampling:

Set	Months	Range
Train	36	Feb 2022 – Jan 2025
Validation	6	Feb 2025 – Jul 2025
Test	6	Aug 2025 – Jan 2026

The model trains on three years, tunes on six months, and gets evaluated on the most recent six months it's never seen. There's no spatial leakage either — we don't hold out specific grid cells. The model has to predict all locations for future months simultaneously.

This is the only honest way to evaluate a time-series model. If you randomly shuffle months into train and test, the model can memorise seasonal patterns and look brilliant without actually learning anything useful about temporal dynamics.

What the tensor reveals

Even at this aggregate level, clear patterns jump out.

February tends to be the quietest month (~7–8k victimisations across Auckland), while October through January — spring and early summer — consistently peaks at 8.5–9.5k. 2023 was the peak year across the board, with a gradual decline through 2024 and into 2025.

Theft accounts for 72% of the tensor values (283k victimisations), burglary 17% (68k), and assault 9% (34k). That theft dominance from Part 1 — the 66% figure — gets even more pronounced when you focus on Auckland, because theft clusters harder in urban areas than other crime types do.

What's next

The tensor is built. The model input is ready. But before throwing deep learning at anything, we need to properly understand what patterns actually exist in this data — when does crime peak, where does it cluster, and how do different crime types behave differently. Next post: exploratory data analysis.