A crime record that says "Woodglen, meshblock 0284305" is useless for spatial modelling. It's a name and a number. You can't plot it, you can't measure distances from it, and you definitely can't feed it to a neural network that thinks in grid cells.
To do anything spatial, every record needs actual coordinates — latitude, longitude, or ideally metres on a proper projection. That means downloading Stats NZ's geographic boundary files and joining them to our crime data.
NZ's geographic hierarchy
New Zealand has a neat nested system of geographic units maintained by Stats NZ:
graph TD
A["Region (16)"] --> B["Territorial Authority (67)"]
B --> C["Area Unit / SA2 (~2,000)"]
C --> D["Meshblock (~53,000)"]
Regions are the big ones — Auckland, Canterbury, Wellington. Territorial authorities are your cities and districts. Area units are roughly suburb-sized. And meshblocks are the smallest unit — about 100 people each, roughly a city block. Our crime data uses area units and meshblocks, so those are the layers we need.
There's a gotcha here. Stats NZ replaced "Area Units" with "Statistical Area 2" (SA2) in 2018 as part of a geographic classification overhaul. But the NZ Police crime data still uses the old area unit names. So we need the 2017 vintage boundary files, not the current ones. Use the wrong vintage and your join silently fails on hundreds of area units. Ask me how I know.
Three boundary files
We downloaded three layers from Stats NZ DataFinder via their WFS API:
| Layer | Features | Size |
|---|---|---|
| Area Unit 2017 (generalised) | 2,004 | 88 MB |
| Meshblock 2018 (generalised) | 53,589 | 213 MB |
| Territorial Authority 2023 | 68 | 34 MB |
All three come in EPSG:2193 — that's NZTM2000, New Zealand's official projected coordinate system. The units are metres, not degrees. This matters a lot later when we need to build a "500m grid" — you want that to be 500 actual metres, not some approximation based on latitude.
We use generalised (simplified) versions rather than high-definition. The full-resolution meshblock layer is over a gigabyte. For centroid calculations and spatial joins, the generalised versions are more than accurate enough.
The area unit join: 99.4%
Joining crime records to area unit boundaries by name was almost perfect. 1,146,721 of 1,154,102 records matched — 99.4%.
Only two area unit codes failed:
999999— the official "unspecified" catch-all (7,331 records)-29— a straight-up data entry error (50 records)
That's a genuinely excellent result. The unmatched records aren't a bug in our pipeline — they're genuinely unlocatable crimes that the police couldn't assign to a specific area. Nothing we can do about those, and nothing we should try to do.
The meshblock join: 81.2%
The meshblock join came in lower at 81.2% — 937,604 records matched out of 1,154,102.
This is expected and it's fine. Here's why: NZ meshblock boundaries get revised with every census. We're using 2018 boundaries, but our crime data runs through January 2026. Any crime from 2023 onwards might reference a 2023-vintage meshblock code that simply doesn't exist in the 2018 file. Some meshblocks get split, some get merged, some get renumbered entirely.
81.2% still gives us fine-grained coordinates for the vast majority of records. For the ~19% that miss, we fall back to the area unit centroid. It's less precise — suburb-level instead of block-level — but it's better than dropping the records entirely.
Two coordinate systems
This is one of those things that seems like a minor detail but will bite you hard if you get it wrong. We use two coordinate reference systems throughout the project:
NZTM2000 (EPSG:2193) for all spatial analysis. The units are metres, which makes grid construction trivial — a 500m cell is literally 500 units on each axis. Distance calculations are straightforward. No need to worry about the fact that a degree of longitude means different things at different latitudes.
WGS84 (EPSG:4326) for the frontend dashboard only. deck.gl and MapLibre expect coordinates in degrees (latitude/longitude), which is the standard for web mapping.
The rule is simple: do everything in NZTM2000, convert to WGS84 at the very end when exporting for the dashboard. Mixing coordinate systems mid-pipeline is a recipe for bugs that are incredibly annoying to track down.
The output
Each crime record now has up to 8 new geographic columns — area unit centroids, meshblock centroids, and areas in both coordinate systems. The enriched dataset saves as crimes_with_geo.parquet at 21.9 MB with 29 columns.
Quick sanity check: Auckland's mean crime centroid lands at lat -36.90, lon 174.78 — right in the middle of the urban area. If that number had come back as somewhere in the Waikato, we'd know something went wrong.
What's next
Every crime record now has a place in physical space. But individual points aren't what the neural network needs — it needs a regular grid. In the next post, we'll overlay a 500m × 500m grid on Auckland, count crimes per cell per month, and build the 4D tensor that turns crime prediction into a video prediction problem.
Comments