Every machine learning project needs a reality check.
It's tempting to jump straight to the neural network — that's the exciting bit, right? But if you don't establish what a dead-simple model can do first, you've got no idea whether your fancy architecture is actually learning anything useful or just being expensive.
So before ConvLSTM gets anywhere near this data, we're going to throw three gloriously simple baselines at it and see how they do.
Persistence: next month equals this month
The dumbest possible model. To predict April, just use March's values. Every cell, every crime type — carbon copy.
It sounds ridiculous, but it works surprisingly well when patterns are stable. And as we saw in the EDA, Auckland's crime hotspots are remarkably persistent. The CBD doesn't suddenly go quiet. South Auckland doesn't randomly calm down.
On the six-month test set (August 2025 – January 2026):
| Crime Type | MAE | RMSE |
|---|---|---|
| Theft | 1.42 | 3.18 |
| Burglary | 0.38 | 0.91 |
| Assault | 0.22 | 0.64 |
| Robbery | 0.04 | 0.15 |
| Sexual | 0.03 | 0.12 |
| Harm | 0.01 | 0.04 |
Those MAE numbers for theft and burglary look small until you remember that most cells are zero. For the active cells — the ones we actually care about — the error is larger. A busy CBD cell might have 35 thefts in one month and 28 the next. Persistence would be off by 7 there, which is a 20% miss on an important prediction.
Seasonal naive: same month last year
Instead of copying last month, copy the same month from the previous year. January 2026 gets predicted from January 2025. This should capture seasonal patterns — the summer spike, the February dip.
The catch? We only have four years of data. The test set months (August–January) each have at most three prior examples of the same month. That's not a lot of seasonal training data.
| Crime Type | MAE | RMSE |
|---|---|---|
| Theft | 1.51 | 3.42 |
| Burglary | 0.41 | 0.97 |
| Assault | 0.24 | 0.68 |
| Robbery | 0.05 | 0.17 |
| Sexual | 0.04 | 0.13 |
| Harm | 0.01 | 0.04 |
Slightly worse than persistence across the board. That surprised me initially — shouldn't capturing seasonality help?
The issue is that the 2023-to-2025 decline we spotted in the EDA bites hard here. If you predict January 2026 from January 2025, you're using data from a period when crime was higher. The seasonal pattern is real, but the year-over-year trend works against it. With more years of data, seasonal naive would likely pull ahead.
Historical average: the mean of all training months
For each cell and crime type, take the average across all 36 training months. This smooths out month-to-month noise and gives you a "typical" value for each location.
| Crime Type | MAE | RMSE |
|---|---|---|
| Theft | 1.28 | 2.95 |
| Burglary | 0.35 | 0.84 |
| Assault | 0.20 | 0.58 |
| Robbery | 0.04 | 0.14 |
| Sexual | 0.03 | 0.11 |
| Harm | 0.01 | 0.04 |
The best baseline. By averaging over three years, it smooths out the month-to-month noise and the year-over-year trend simultaneously. It won't capture seasonal peaks or sudden changes, but for the "typical month" prediction it's solid.
Why MAPE breaks down
You might wonder why I'm not reporting MAPE (Mean Absolute Percentage Error) — it's the standard metric in a lot of forecasting work. The reason: sparse data.
MAPE divides the error by the actual value. When the actual value is zero — which it is for 91.7% of our tensor — you get division by zero. Even for cells with small counts (1 or 2 crimes), a prediction of 0 gives you 100% MAPE while a prediction of 2 gives you 0–100%. The metric becomes wildly unstable.
MAE and RMSE are more honest here. They tell you the absolute magnitude of your errors in actual crime counts, which is what we care about. A miss of 3 victimisations means the same thing whether the cell usually has 5 or 50.
The bar to clear
Here's the scoreboard going forward. Any deep learning model needs to beat the historical average baseline to justify its existence:
| Crime Type | Historical Avg MAE | Historical Avg RMSE |
|---|---|---|
| Theft | 1.28 | 2.95 |
| Burglary | 0.35 | 0.84 |
| Assault | 0.20 | 0.58 |
| All types | 0.39 | 0.95 |
Theft is the easiest to beat because there's the most signal — high counts, clear spatial patterns, strong seasonality. Robbery, sexual offences, and harm are essentially noise at this resolution. The models will probably predict near-zero for those types and be mostly correct.
The real test will be the middle ground — can ConvLSTM or ST-ResNet predict the changes in theft and burglary better than a static average? Can they catch the months where a cell spikes or dips? That's where simple baselines fall flat, because they don't model dynamics at all.
If the deep learning can't meaningfully beat "just use the average," then it's not worth the CPU cycles. Or in my case, the many hours of a Ryzen 5 grinding away.
Comments