Three Ways to Look at Time

ST-ResNet's core insight is that not all history is created equal.

When you're predicting crime in Auckland next month, three different kinds of past information matter. What happened in the last couple of months: the recent trend. What happened at the same time last year: the seasonal pattern. And what's been happening over the longer term: whether crime is generally rising or falling in an area.

ConvLSTM treats all of this as one continuous sequence and hopes the network figures out which parts matter. ST-ResNet takes a more opinionated approach. It separates these three temporal scales explicitly and gives each one its own dedicated neural network branch.

The original paper by Zhang et al. was about predicting crowd flows in Beijing. People move through cities in patterns that look a lot like crime patterns: daily rhythms, weekly cycles, long-term trends. The architecture translates well to crime data, with some modifications.

Closeness, period, trend

The three branches each look at different slices of history:

Closeness captures what's been happening recently. For our monthly data, this means the last 3 months. If South Auckland has been trending upward over the last quarter, the closeness branch sees that momentum.

Period captures seasonal patterns. It looks at the same month in previous years. So to predict January 2026, it pulls in January 2025 and January 2024. The assumption is that crime has an annual rhythm, and the same month tends to look similar year to year.

Trend captures longer-term shifts. It uses quarterly averages from further back: broad strokes of whether an area is seeing more or less crime over time. This is the slowest-moving signal.

Each branch independently processes its temporal slice through a stack of residual convolutional blocks, then a learned fusion layer combines the three outputs:

prediction = W_c · closeness + W_p · period + W_t · trend + bias

Where W_c, W_p, and W_t are learned weights that vary by grid cell. This is a nice touch. It means the model can decide that the CBD's crime is mostly driven by recent trends (closeness), while a residential suburb might be more seasonal (period). Different areas get different temporal recipes.

Residual blocks

Each branch uses residual convolutional units, the building blocks that made ResNet so successful in image recognition.

The key idea: instead of learning the full output at each layer, the network learns the residual, the difference between input and output. The identity shortcut connection means gradients flow cleanly through the network during training, which lets you stack more layers without the signal degrading.

ResUnit(X) = ReLU(Conv(ReLU(Conv(X))) + X)

That + X at the end is the skip connection. If the layer has nothing useful to add, it can learn weights near zero and just pass the input through. This makes deeper networks stable, which matters when you're trying to learn spatial features at multiple scales.

For our grid, I use 4 residual units per branch. Each unit has two 3×3 convolutional layers with 32 filters. That's deep enough to capture spatial relationships across several kilometres without being so deep that the model overfits on 36 months of training data.

The NZ-specific problem

Here's where theory meets reality, and it gets a bit awkward.

ST-ResNet was designed for dense, high-frequency data. The Beijing crowd flow paper used 30-minute intervals over months of data: thousands of timesteps. The crime papers that report strong results typically use daily data over several years.

We have 48 monthly timesteps. Total. The period branch (which looks at the same month in previous years) has at most 3 data points per month (2022, 2023, 2024 to predict 2025/2026). The trend branch is working with quarterly averages from a four-year window. It's not a lot of temporal data for an architecture that's specifically designed to decompose temporal patterns.

I had a feeling this would be the bottleneck, and it was.

Implementation

Closeness branch:
  Input: last 3 months (3 × 6 channels = 18 input channels)
  → 4 ResUnits (32 filters, 3×3 kernels)
  → Output: 32 channels

Period branch:
  Input: same month from 2 prior years (2 × 6 = 12 input channels)
  → 4 ResUnits (32 filters, 3×3 kernels)
  → Output: 32 channels

Trend branch:
  Input: 2 quarterly averages (2 × 6 = 12 input channels)
  → 4 ResUnits (32 filters, 3×3 kernels)
  → Output: 32 channels

Fusion:
  → Learned weighted sum across branches
  → Conv2d(32, 6, 1×1) → 6 crime type predictions

Total parameters: roughly 180k. Slightly smaller than the ConvLSTM, which is fine. ST-ResNet's power is supposed to come from the temporal decomposition, not from model size.

Training uses the same setup as ConvLSTM: Adam optimiser, learning rate 1e-4, MSE loss on log1p-transformed values, early stopping with patience of 15 epochs. On CPU, each run takes about 35 minutes, a bit faster than ConvLSTM since there's no sequential recurrence to deal with.

Results

Crime Type	Hist. Avg MAE	ConvLSTM MAE	ST-ResNet MAE
Theft	1.28	1.14	1.18
Burglary	0.35	0.32	0.33
Assault	0.20	0.19	0.19
Robbery	0.04	0.04	0.04
Sexual	0.03	0.03	0.03
Harm	0.01	0.01	0.01
All types	0.39	0.35	0.36

ST-ResNet beats the historical average but doesn't quite match ConvLSTM. The aggregate MAE of 0.36 is a 7.7% improvement over the baseline, compared to ConvLSTM's 10.3%.

That's not a terrible result, but it's not what I was hoping for.

Why ConvLSTM wins here

When I dug into the learned fusion weights, the story became clear. The closeness branch dominates. It gets 60–70% of the weight across most grid cells. The period branch gets 20–25%, and the trend branch barely contributes at 10–15%.

The model is basically saying: "Recent months matter most, seasonal patterns help a bit, and long-term trends are mostly noise." That's not a failure of the architecture. It's a fair assessment of what's in the data.

With only 2–3 examples of each calendar month, the period branch can't reliably learn seasonal patterns. It's overfitting to individual years rather than extracting a stable seasonal signal. ConvLSTM handles this better because it processes the full sequence and implicitly learns seasonality from the continuous flow of months, without needing to explicitly align calendar periods.

The trend branch suffers even more. Quarterly averages over a four-year window don't give it much to work with. In the original crowd flow papers with years of half-hourly data, the trend branch captures genuine long-term shifts in population movement. Here, it's essentially learning a constant.

Where ST-ResNet does shine

Despite losing on aggregate, ST-ResNet has one clear advantage: it's better at predicting seasonal transitions.

The months where crime shifts gears (the spring uptick in September/October and the February dip) ST-ResNet handles more gracefully than ConvLSTM. The period branch, sparse as its data is, does capture enough of the annual rhythm to anticipate these transitions a bit earlier.

ConvLSTM tends to lag these transitions by about a month. It needs to "see" the uptick starting before it predicts continuation. ST-ResNet, by explicitly looking at last year's same month, can anticipate the shift before it fully materialises in the recent sequence.

For an operational forecasting tool, that one-month lead time on seasonal transitions could be valuable. But in our test set metrics, it's a small advantage that doesn't overcome ST-ResNet's overall weaker performance on month-to-month dynamics.

Head to head

Metric	Historical Avg	ConvLSTM	ST-ResNet
Overall MAE	0.39	0.35	0.36
Theft MAE	1.28	1.14	1.18
Training time (CPU)	N/A	~40 min	~35 min
Parameters	0	~200k	~180k
Seasonal transitions	Poor	Lagging	Better
Spatial dynamics	None	Good	Good

ConvLSTM is the better model for this specific dataset. Not by a lot. We're talking about small differences on already-small error values. But consistently better on the main crime types that have enough signal to matter.

Neither model is a revelation. A 7–10% improvement over "just use the historical average" is real but modest. Deep learning's strengths (learning complex nonlinear dynamics from huge datasets) are somewhat wasted on 48 monthly timesteps over a relatively low-crime city.

If I had daily data instead of monthly, or ten years instead of four, I'd expect ST-ResNet to close the gap or pull ahead. Its architecture is fundamentally sound. The temporal decomposition is a genuinely good idea. It's just starved of the data it needs to shine.

Both models meaningfully beat the baselines. Both learn spatial patterns that simple averages can't capture. And both are honest about the sparse crime types: they predict near-zero and move on, which is the right call.

Next up: we'll take these predictions and build something you can actually look at. A 3D interactive dashboard where you can watch crime patterns evolve across Auckland over time. The modelling was the hard bit. Making it visual is the fun bit.