Teaching a Neural Network to Watch Crime Like Video — Part 6: ConvLSTM

ConvLSTM was invented to predict rainstorms.

Specifically, Shi et al. at the Hong Kong Observatory needed to forecast radar echo maps — 2D grids of rainfall intensity that evolve over time. They had sequences of spatial images and wanted to predict the next frames. Sound familiar?

That's exactly what we built in Part 3. Crime on a 500m grid, one frame per month, six channels for crime types. The Auckland crime tensor is structurally identical to a weather radar sequence. Same dimensionality, same prediction task, just a very different domain.

Why not regular LSTM?

Standard LSTM networks are fantastic at learning sequences. They're the backbone of a lot of time-series forecasting. But they have a fundamental problem with spatial data: they need flat vectors as input.

To feed our 77×59 grid into a regular LSTM, we'd have to flatten it into a vector of 4,543 values per crime type. That's 27,258 values per timestep across all six channels. The network would process this as a sequence of big flat vectors, with no concept that cell (10, 5) is next to cell (10, 6).

All the spatial structure — the fact that crime clusters, that hotspots have neighbourhoods, that the CBD is a contiguous area — gets thrown away. The model would have to rediscover spatial relationships from scratch, purely from correlations in the flattened vector. With only 36 training months, that's not happening.

The convolutional trick

ConvLSTM's insight is elegant. Take the standard LSTM equations — the input gate, forget gate, output gate, cell state update — and replace every matrix multiplication with a convolution operation.

In a regular LSTM:

input_gate = sigmoid(W_xi * x_t + W_hi * h_{t-1} + b_i)

In ConvLSTM:

input_gate = sigmoid(W_xi ∗ X_t + W_hi ∗ H_{t-1} + b_i)

That ∗ is a convolution instead of a matrix multiply. X_t is the full 2D grid at time t, and H_{t-1} is the previous hidden state — also a 2D grid. The convolution kernel slides across the spatial dimensions, so each cell's gate values depend on its local neighbourhood.

This means the network naturally learns that a spike in cell (10, 5) might affect predictions for cell (10, 6). Spatial proximity is baked into the architecture. It doesn't need to learn it from data.

The kernel size controls how much spatial context each cell sees. A 3×3 kernel means each cell looks at its immediate 8 neighbours. Stack multiple ConvLSTM layers and the effective receptive field grows — deeper layers can capture relationships between cells that are several kilometres apart.

Architecture choices

Here's what I settled on after a fair bit of experimentation (which on CPU means "a lot of patient waiting"):

Input: (batch, 6, 6, 77, 59) — 6 months, 6 crime types, 77×59 grid
  ↓
ConvLSTM2d(in=6, hidden=32, kernel=3×3, padding=1)
  ↓
BatchNorm2d
  ↓
ConvLSTM2d(in=32, hidden=32, kernel=3×3, padding=1)
  ↓
BatchNorm2d
  ↓
Conv2d(in=32, out=6, kernel=1×1) — project to 6 crime type channels
  ↓
Output: (batch, 6, 77, 59) — next month prediction

Two ConvLSTM layers with 32 hidden channels each. The 3×3 kernel gives each cell a neighbourhood view, and stacking two layers means the effective receptive field covers about 1–1.5 km — enough to capture the spatial extent of most crime hotspots.

Why only 32 hidden channels? This is where the CPU constraint actually helps. A bigger model would be tempting with a GPU, but on a Ryzen 5 we need to keep it tight. 32 channels gives us about 200k trainable parameters — small enough to train in under an hour, large enough to learn meaningful spatial-temporal patterns.

The 1×1 convolution at the end is just a channel projection — it maps the 32 learned features back to 6 crime type predictions.

Sequence length: six months

The lookback window is six months. The model sees January through June and predicts July. Then February through July to predict August. And so on.

Six months captures one half of the seasonal cycle, which turned out to be the sweet spot. Shorter sequences (3 months) missed seasonal context. Longer sequences (12 months) didn't improve results — likely because the model doesn't have enough data to learn year-long dependencies with only 36 training months total.

The training set gives us 30 sequences (months 1–6 predict 7, months 2–7 predict 8, all the way to months 30–35 predict 36). That's not a lot. Every sequence counts.

Training details

optimiser = Adam(lr=1e-4)
loss = MSE  # on log1p-transformed values
batch_size = 4  # small because sequences are large
epochs = 150 with early stopping (patience=15)

The log1p transformation from Part 3 is critical here. Raw crime counts range from 0 to 50+. After log1p, the range compresses to 0–4. Without this, the loss function would be dominated by the handful of high-count CBD cells, and the model would essentially ignore the rest of the grid.

Training on CPU takes about 40 minutes per run. Not fast, but manageable. I could typically fit in 3–4 experimental runs per evening, which meant progress was slow but steady. Each run I'd tweak one thing — kernel size, hidden channels, learning rate — and compare validation MAE.

Early stopping triggers around epoch 80–100 in most runs. The model converges relatively quickly, which makes sense given the small dataset and architecture.

Results

So how does ConvLSTM stack up against the baselines from Part 5?

Crime Type	Hist. Avg MAE	ConvLSTM MAE	Improvement
Theft	1.28	1.14	10.9%
Burglary	0.35	0.32	8.6%
Assault	0.20	0.19	5.0%
Robbery	0.04	0.04	2.5%
Sexual	0.03	0.03	~0%
Harm	0.01	0.01	~0%
All types	0.39	0.35	10.3%

A 10% improvement on the aggregate MAE. Not earth-shattering, but real.

Theft gets the biggest lift because there's the most signal to work with. The model genuinely learns spatial dynamics that the historical average can't capture — when a cluster of cells in South Auckland trends upward over several months, ConvLSTM picks up on that momentum and adjusts its predictions accordingly.

Burglary sees a decent improvement too, likely driven by the spatial correlation with theft that we spotted in the EDA.

For the sparse crime types — robbery, sexual offences, harm — ConvLSTM basically learns to predict near-zero, same as the baseline. There simply isn't enough signal at 500m monthly resolution for these types. The model is honest about what it doesn't know, which I actually respect.

Where it shines and where it doesn't

The improvement isn't uniform across the grid. ConvLSTM does best in the transition zones — cells on the edges of established hotspots where crime counts fluctuate month to month. It learns that these boundary cells tend to follow the trend of their neighbours, which is exactly the kind of spatial-temporal pattern it was designed to capture.

In the stable hotspot cores — the CBD, Manukau — the model performs about the same as the baseline. Those cells are consistently high, and the historical average already captures that well.

Where it properly struggles is with sudden spikes in normally quiet areas. A cell that's been near-zero for months and then gets 5 thefts in one month — the model doesn't see that coming. Neither does any other model, to be fair. Those events are closer to random noise than learnable signal.

Putting it in perspective

A 10% MAE improvement is meaningful but modest. Recent ConvLSTM crime prediction papers report larger gains, but they typically work with much more data — years of daily records across cities with higher crime density. Our setup is tougher: monthly resolution limits temporal signal, Auckland is relatively low-crime by global standards, and we only have four years.

The model is also running on CPU with a deliberately small architecture. A bigger model on a GPU might squeeze out more performance. But the point of this project was always to see how far you can push it with modest resources, and a 10% beat over simple baselines feels like a genuine result.

The question now is whether ST-ResNet's different approach to temporal modelling can do better. ConvLSTM processes time as one continuous sequence. ST-ResNet breaks it into three separate temporal scales — closeness, period, and trend. With a seasonal dataset like crime, that decomposition might be exactly what's needed.