Wrangling a Million Crime Records — Part 1: Data Acquisition

The very first thing NZ Police's crime dataset teaches you is that government data is never straightforward.

You download the CSV from policedata.nz, expecting to do a quick pd.read_csv() and start exploring. Instead you get a 503MB file encoded in UTF-16 Little Endian with tab delimiters. Not a regular CSV. Not even close. This is a legacy format from old Excel exports and most tools just silently corrupt it if you try to read it as UTF-8.

df = pd.read_csv("data.csv", encoding="utf-16-le", sep="\t")

That one line took longer to figure out than I'd like to admit.

What's actually in here

Once you get past the encoding, there's a lot to work with. 1,154,102 rows covering every reported victimisation in New Zealand from February 2022 through January 2026. Each row tells you the crime type (ANZSOC Division), where it happened (down to meshblock level), when it happened (month, day of week, hour of day), and sometimes what weapon was involved.

There are 20 columns, but five of them are useless — three are just duplicates of "Year Month" and two are constants that add zero information. Every area name and territorial authority has a trailing period stuck on the end: "Auckland.", "Woodglen.", "Christchurch City.". A quirk of the export that'll break any geographic join if you don't strip them.

And meshblock IDs? Some are 6 digits, some are 7. Stats NZ boundary files use 7-digit codes consistently, so shorter ones need zero-padding. The kind of thing that's invisible until your join silently drops 19% of your records and you spend an afternoon figuring out why.

The interesting absence

Here's the bit that actually made me stop and think. 32.2% of records have the hour of day recorded as 99 — unknown. Another 23.2% have the day of week as "UNKNOWN".

At first this looks like a data quality problem. But it's not — it's actually telling you something about the nature of the crime. If someone breaks into your house while you're at work, you come home to find your stuff gone. Was it 9am or 2pm? You've got no idea, and neither do the police.

Property crimes — theft, burglary — make up the bulk of these unknowns. Assault, by contrast, almost always has a precise time because there's a victim present when it happens. The absence of data is itself a signal about what kind of crime you're looking at.

78.6% of location type values are just "." — effectively missing. That column is sparsely populated but still useful for the roughly one in five records that have it.

Cleaning it up

We built a modular pipeline where each cleaning step is its own function. Nothing fancy, just practical:

def ingest() -> pd.DataFrame:
    df = load_raw_csv(RAW_CSV)            # UTF-16 LE, tab-delimited
    df = drop_redundant_columns(df)        # Remove 5 useless columns
    df = rename_columns(df)                # snake_case everything
    df = parse_dates(df)                   # "July 2022" → datetime
    df = clean_strings(df)                 # Strip trailing periods
    df = clean_meshblocks(df)              # Zero-pad to 7 digits
    df = encode_unknowns(df)              # 99 → NaN, "UNKNOWN" → NaN
    df = map_crime_types(df)               # ANZSOC Division → short enum
    return df

Each function does one thing. If something breaks, you know exactly where. If someone wants to understand the pipeline, they can read it top to bottom in about thirty seconds. I've been bitten enough times by monolithic data scripts that I'm allergic to them now.

The crime type mapping turns six ANZSOC Division values into short enums:

Crime Type	Count	Share
Theft	761,977	66.0%
Burglary	247,034	21.4%
Assault	115,383	10.0%
Robbery	14,860	1.3%
Sexual	13,943	1.2%
Harm	905	0.1%

That 66% theft number is going to haunt us when we get to model training. Any loss function you throw at this data will overwhelmingly optimise for predicting theft, because that's two-thirds of everything. The class imbalance is real and it matters.

503MB to 6.3MB

The cleaned output goes to Apache Parquet with snappy compression. The result?

Input: 503MB CSV (UTF-16, 20 columns)
Output: 6.3MB Parquet (21 columns including derived fields)
Compression: ~80x

That's not a typo. Parquet's columnar storage is dramatically more efficient than row-oriented CSV, especially when you've got columns full of repeated values like crime types and territorial authorities. The file loads in under a second compared to 3+ seconds for the CSV. When you're iterating on analysis and loading this data hundreds of times, that adds up fast.

The 21 output columns include the original 16 we kept plus five derived ones: a proper datetime, year, month, day-of-week as an integer, and the short crime type enum.

Sanity checks

Before calling the data clean, we verify everything that matters:

Row count: 1,154,102 — all rows preserved, nothing dropped
No nulls in key columns: crime_type, date, area_unit, territorial_authority, meshblock
Date range: Feb 2022 to Jan 2026 — all 48 months present
Auckland: 412,669 records, 36% of total — exactly where it should be
Theft: 761,977 records, 66% — as expected
No trailing periods anywhere in area names
All meshblock IDs are 7 digits
Max hour value is 23 — no more 99s leaking through

You want these checks automated and running every time you regenerate the data. Future you will thank past you when something upstream changes and a check catches it.

What's next

We've got clean, compressed crime data — but the records only have meshblock IDs and area unit names. No coordinates. No shapes on a map. In the next post, we'll download Stats NZ geographic boundary files and join them to our crime records, giving every victimisation a place in physical space.