The very first thing NZ Police's crime dataset teaches you is that government data is never straightforward.
You download the CSV from policedata.nz, expecting to do a quick pd.read_csv() and start exploring. Instead you get a 503MB file encoded in UTF-16 Little Endian with tab delimiters. Not a regular CSV. Not even close. This is a legacy format from old Excel exports and most tools just silently corrupt it if you try to read it as UTF-8.
df = pd.read_csv("data.csv", encoding="utf-16-le", sep="\t")
That one line took longer to figure out than I'd like to admit.
What's actually in here
Once you get past the encoding, there's a lot to work with. 1,154,102 rows covering every reported victimisation in New Zealand from February 2022 through January 2026. Each row tells you the crime type (ANZSOC Division), where it happened (down to meshblock level), when it happened (month, day of week, hour of day), and sometimes what weapon was involved.
There are 20 columns, but five of them are useless — three are just duplicates of "Year Month" and two are constants that add zero information. Every area name and territorial authority has a trailing period stuck on the end: "Auckland.", "Woodglen.", "Christchurch City.". A quirk of the export that'll break any geographic join if you don't strip them.
And meshblock IDs? Some are 6 digits, some are 7. Stats NZ boundary files use 7-digit codes consistently, so shorter ones need zero-padding. The kind of thing that's invisible until your join silently drops 19% of your records and you spend an afternoon figuring out why.
The interesting absence
Here's the bit that actually made me stop and think. 32.2% of records have the hour of day recorded as 99 — unknown. Another 23.2% have the day of week as "UNKNOWN".
At first this looks like a data quality problem. But it's not — it's actually telling you something about the nature of the crime. If someone breaks into your house while you're at work, you come home to find your stuff gone. Was it 9am or 2pm? You've got no idea, and neither do the police.
Property crimes — theft, burglary — make up the bulk of these unknowns. Assault, by contrast, almost always has a precise time because there's a victim present when it happens. The absence of data is itself a signal about what kind of crime you're looking at.
78.6% of location type values are just "." — effectively missing. That column is sparsely populated but still useful for the roughly one in five records that have it.
Cleaning it up
We built a modular pipeline where each cleaning step is its own function. Nothing fancy, just practical:
def ingest() -> pd.DataFrame:
df = load_raw_csv(RAW_CSV) # UTF-16 LE, tab-delimited
df = drop_redundant_columns(df) # Remove 5 useless columns
df = rename_columns(df) # snake_case everything
df = parse_dates(df) # "July 2022" → datetime
df = clean_strings(df) # Strip trailing periods
df = clean_meshblocks(df) # Zero-pad to 7 digits
df = encode_unknowns(df) # 99 → NaN, "UNKNOWN" → NaN
df = map_crime_types(df) # ANZSOC Division → short enum
return df
Each function does one thing. If something breaks, you know exactly where. If someone wants to understand the pipeline, they can read it top to bottom in about thirty seconds. I've been bitten enough times by monolithic data scripts that I'm allergic to them now.
The crime type mapping turns six ANZSOC Division values into short enums:
| Crime Type | Count | Share |
|---|---|---|
| Theft | 761,977 | 66.0% |
| Burglary | 247,034 | 21.4% |
| Assault | 115,383 | 10.0% |
| Robbery | 14,860 | 1.3% |
| Sexual | 13,943 | 1.2% |
| Harm | 905 | 0.1% |
That 66% theft number is going to haunt us when we get to model training. Any loss function you throw at this data will overwhelmingly optimise for predicting theft, because that's two-thirds of everything. The class imbalance is real and it matters.
503MB to 6.3MB
The cleaned output goes to Apache Parquet with snappy compression. The result?
- Input: 503MB CSV (UTF-16, 20 columns)
- Output: 6.3MB Parquet (21 columns including derived fields)
- Compression: ~80x
That's not a typo. Parquet's columnar storage is dramatically more efficient than row-oriented CSV, especially when you've got columns full of repeated values like crime types and territorial authorities. The file loads in under a second compared to 3+ seconds for the CSV. When you're iterating on analysis and loading this data hundreds of times, that adds up fast.
The 21 output columns include the original 16 we kept plus five derived ones: a proper datetime, year, month, day-of-week as an integer, and the short crime type enum.
Sanity checks
Before calling the data clean, we verify everything that matters:
- Row count: 1,154,102 — all rows preserved, nothing dropped
- No nulls in key columns: crime_type, date, area_unit, territorial_authority, meshblock
- Date range: Feb 2022 to Jan 2026 — all 48 months present
- Auckland: 412,669 records, 36% of total — exactly where it should be
- Theft: 761,977 records, 66% — as expected
- No trailing periods anywhere in area names
- All meshblock IDs are 7 digits
- Max hour value is 23 — no more 99s leaking through
You want these checks automated and running every time you regenerate the data. Future you will thank past you when something upstream changes and a check catches it.
What's next
We've got clean, compressed crime data — but the records only have meshblock IDs and area unit names. No coordinates. No shapes on a map. In the next post, we'll download Stats NZ geographic boundary files and join them to our crime records, giving every victimisation a place in physical space.
Comments