{"version":"https://jsonfeed.org/version/1","title":"jonno.nz","home_page_url":"https://jonno.nz/","feed_url":"https://jonno.nz/feed.json","description":"Tech, teams, and projects — John Gregoriadis","author":{"name":"John Gregoriadis","url":"https://jonno.nz"},"items":[{"id":"https://jonno.nz/posts/llm-kills-compromised-services-at-3am/","url":"https://jonno.nz/posts/llm-kills-compromised-services-at-3am/","title":"An LLM That Watches Your Logs and Kills Compromised Services at 3am","content_html":"<p>Anthropic just dropped <a href=\"https://www.anthropic.com/glasswing\">Project Glasswing</a> — a big collaborative cybersecurity initiative with a shiny new model called Claude Mythos Preview that can find zero-day vulnerabilities at scale. Twelve major tech companies involved. $100M in credits. Found a 27-year-old flaw in OpenBSD. Impressive stuff.</p>\n<p>But let's be real about what's happening here. Anthropic trained a model so capable at breaking into systems that they decided it was too dangerous to release publicly. So they wrapped the release in a collaborative security initiative. The security work is genuinely valuable. But it's also a smart way to keep control of something they know is too powerful to let loose.</p>\n<p>The part that actually matters, though, is who benefits. Glasswing is for the big players. The companies with security teams, budgets, and the kind of infrastructure that gets invited to sit at the table with AWS, Microsoft, and Palo Alto Networks. What about the rest of us? The startups, the small SaaS shops, the indie developers running production systems on a shoestring?</p>\n<p>The internet is a <a href=\"https://bigthink.com/books/how-the-dark-forest-theory-helps-us-understand-the-internet/\">dark forest</a>. That's not a metaphor anymore — it's becoming the literal reality. Bots, scrapers, automated exploit chains, credential stuffing, AI-generated phishing. A server goes up and within hours it's being scanned, fingerprinted, and probed by systems that don't sleep. Visibility equals vulnerability. And AI is making the attackers faster, cheaper, and more autonomous every month.</p>\n<p>The <a href=\"https://www.isc2.org/insights/2026/04/ai-driven-defense-and-autonomous-attacks\">ISC2 put it plainly</a> — both offence and defence now operate at speeds beyond human intervention. The threats aren't people sitting at keyboards anymore. They're autonomous systems running campaigns end-to-end.</p>\n<p>So what do we do about it?</p>\n<h2>Offensive security — but not the kind you're thinking</h2>\n<p>When I say offensive security, I don't mean red-teaming or penetration testing. I mean giving your systems the ability to fight back.</p>\n<p>Picture an LLM that sits across your centralised logs — network traffic, database queries, user interactions, access patterns — and builds an understanding of what normal looks like for your system over weeks and months. Not just pattern matching against known signatures. Actually understanding the shape of healthy behaviour.</p>\n<p>When something breaks the pattern, it doesn't just alert. It acts.</p>\n<p>Disable a compromised account. Kill a service that's behaving strangely. Block a database connection that shouldn't exist. Create an incident with full context for a human to review. The response is proportional and immediate — not waiting for someone to check their phone at 3am.</p>\n<p>The architecture is pretty straightforward:</p>\n<pre><code class=\"language-mermaid\">graph TD\n    A[Application Logs] --&gt; D[Secure Isolated Log Store]\n    B[Network Traffic] --&gt; D\n    C[Database Queries] --&gt; D\n    D --&gt; F[Baseline Health Model]\n    E[User Activity] --&gt; D\n    F --&gt;|Anomaly Detected| G[LLM Analysis]\n    G --&gt;|Analyse &amp; Plan| H{Threat Assessment}\n    H --&gt;|Low| I[Alert &amp; Log]\n    H --&gt;|Medium| J[Restrict &amp; Escalate]\n    H --&gt;|High| K[Disable &amp; Isolate]\n    I --&gt; L[Human Review]\n    J --&gt; L\n    K --&gt; L\n</code></pre>\n<p>The key is that the logging and analysis layer has to be isolated and secured separately from the systems it's watching. If an attacker can compromise the thing that's watching them, the whole model falls apart.</p>\n<p>In practice that means separate infrastructure with its own auth boundary. Ingestion is write-only — your application services push logs in but can never read or modify what's already there. Append-only, immutable. The analysis layer gets scoped service accounts that can read logs, fire alerts, and pull specific emergency levers through a narrow API. Nothing else. If a compromised service tries to reach the log store directly, it hits a wall.</p>\n<p>None of this is exotic. Centralised logging, immutable storage, scoped IAM — the building blocks exist. The hard part is wiring an LLM into that loop with the right constraints. Enough access to act, not enough to make things worse.</p>\n<h2>Adaptive, not rule-based</h2>\n<p>Traditional security tooling runs on signatures and static rules. Known bad patterns, blocklists, threshold alerts. That worked when threats were mostly human-paced. It doesn't work when you're up against autonomous systems that adapt faster than you can write rules.</p>\n<p>The alternative is a system that learns what normal looks like for <em>your</em> environment — not a generic baseline, but the actual shape of healthy behaviour in your specific infrastructure. Traffic patterns, query frequencies, access timing, user behaviour. Weeks of observation before it starts making decisions.</p>\n<p>When something breaks the pattern, the response is proportional. A sudden spike in unusual API calls might trigger deeper correlation — the system widens its search, pulls in more signals, lowers its threshold for flagging related activity. Repeated failed auth attempts from new IPs tighten access controls automatically. A database connection that shouldn't exist gets killed.</p>\n<p>This isn't a static ruleset you configure once and hope covers everything. It's a system that develops behavioural intuition from running in your environment, responding to your traffic. The difference matters — static rules are brittle against novel attacks, while adaptive systems can catch anomalies they've never seen before.</p>\n<p>The baseline isn't magic. It's watching five things:</p>\n<ul>\n<li><strong>Rate</strong> — how many events per time window. A user who averages 50 API calls per hour suddenly making 500 is a signal.</li>\n<li><strong>Composition</strong> — what's in those events. The same user always hitting /api/users and /api/orders suddenly hammering /api/admin/export.</li>\n<li><strong>Cardinality</strong> — how many unique values. One IP hitting 3 endpoints is normal. One IP cycling through 200 endpoints in an hour isn't.</li>\n<li><strong>Latency</strong> — how fast things happen. Legitimate users pause, think, navigate. Bots don't.</li>\n<li><strong>Novelty</strong> — things the system has never seen. A new endpoint, a new parameter, a user agent string that doesn't match anything in the training window.</li>\n</ul>\n<p>Three layers of detection stack on top of each other. Layer one is simple thresholds — hard caps that trigger immediately. Layer two is statistical deviation — standard deviations from the learned baseline. Layer three is correlation — looking across multiple signals simultaneously. A spike in rate alone might be fine. A spike in rate plus unusual composition plus new source IP? That's a pattern.</p>\n<h2>Learning to recognise yourself</h2>\n<p>A pure anomaly detector would go nuts during deploys. New code paths, changed response times, config reloads — all of it looks unusual. Same with cron jobs. Your 3am batch job that hits the database hard every night would trigger alerts every night.</p>\n<p>Tolerance patterns solve this. The system learns to recognise you.</p>\n<p>Mark a deploy event, and the system creates a tolerance window — elevated thresholds for the next 30 minutes. Register a recurring cron job, and the system expects that exact spike at that exact time. These aren't exceptions you configure manually. They're patterns the system learns from watching.</p>\n<p>After a few weeks, it knows when your weekly cache warm-up runs, when your daily reports generate, when deploys happen. It stops bothering you about the things you do on purpose.</p>\n<h2>The system gets cheaper over time</h2>\n<p>Calling an LLM for every anomaly would be expensive. The trick is building immune memory.</p>\n<p>When the LLM analyses an anomaly and decides it's benign — say, a deploy spike or a legitimate traffic surge — that verdict gets stored. Next time the same pattern appears, the system recognises it. No LLM call needed.</p>\n<p>This is how your security bill drops over the first few weeks. Early on, everything is novel. The LLM gets called constantly. A month in, most anomalies match patterns it's already seen. The LLM only gets called for genuinely new situations.</p>\n<p>The more your system runs, the smarter it gets and the less it costs.</p>\n<h2>Setup without a PhD</h2>\n<p>The hardest part of any security tool is configuration. Getting thresholds right. Understanding your traffic patterns before you can tell the tool what's normal.</p>\n<p><code>darkforest init</code> flips this. Point it at a log sample — a day's worth of traffic, a week if you've got it — and Claude reads it. Not just parsing, actually understanding the shape of your system. It figures out what your endpoints are, what normal request rates look like, what user agents show up, where your traffic comes from geographically.</p>\n<p>Then it writes your config file for you.</p>\n<p>You review it, tweak anything that looks wrong, and you're running. No spreadsheets. No guesswork about what &quot;normal&quot; means for your specific stack. The LLM that's going to watch your logs already understands them.</p>\n<h2>This has to be open</h2>\n<p>Glasswing is cool. <a href=\"https://github.com/aliasrobotics/CAI\">Open-source frameworks like CAI</a> are making progress — but mostly on the offensive side, using LLMs for penetration testing and vulnerability research. On the defensive side, the tooling barely exists. There's no open-source equivalent for the kind of adaptive monitoring and response I'm describing here.</p>\n<p>The building blocks are around. Centralised logging is a solved problem. Open standards for security event formats are maturing. Smaller open models are more than capable of pattern analysis on local infrastructure. What's missing is the glue — a framework that takes logs in, builds a baseline, detects anomalies, and can actually respond. Something a small team can deploy without a six-figure security budget.</p>\n<p>The threats don't discriminate by company size. The defences shouldn't either. This can't be proprietary or locked behind enterprise contracts.</p>\n<p>The dark forest doesn't care how big your company is. The bots scanning your infrastructure don't check your headcount before they attack. If the threats are going to be this accessible, the defences need to be too.</p>\n<p>I'm building this. An open-source security agent — adaptive, autonomous, acts when something breaks the pattern. Small enough for a startup to run on their own infrastructure. Centralised logging, open LLMs, scoped response actions. The pieces are all there. I'm wiring them together now.</p>\n<p>For v0.1, one real action working end-to-end: detect anomalous authentication patterns, call the LLM for analysis, and disable the compromised account via your identity provider's API. Not just alerting — actually responding while you're asleep. That's the proof of concept that matches the headline.</p>\n<p>I'm actively working on this and looking for early testers. If you want alpha access when it's ready, or just want to follow along, <a href=\"https://jonno.nz/posts/llm-kills-compromised-services-at-3am/#newsletter\">drop your email below</a>. I'll reach out when there's something to try.</p>\n","date_published":"Sat, 11 Apr 2026 00:00:00 GMT","image":"https://jonno.nz/og/llm-kills-compromised-services-at-3am.png"},{"id":"https://jonno.nz/posts/booting-a-vm-in-4-seconds-and-talking-to-it-without-a-network/","url":"https://jonno.nz/posts/booting-a-vm-in-4-seconds-and-talking-to-it-without-a-network/","title":"Booting a VM in 4 Seconds and Talking to It Without a Network","content_html":"<p>The first time I got a Firecracker VM to boot and respond to a vsock ping from the host, I sat there grinning like an idiot. Typed a command on my machine, it reached through a kernel-level socket into a completely separate Linux system with its own kernel, and got a reply. Under a second.</p>\n<p>That was about 30 hours into the project. The previous 29 were mostly fighting with rootfs images and iptables rules.</p>\n<p><a href=\"https://jonno.nz/posts/i-built-anthropics-internal-sandbox-platform-on-a-single-linux-box/\">Part 1</a> covered why I built this — Firecracker MicroVMs for running Claude Code in full-permission isolation. This post is the actual build. Rootfs, networking, the guest agent, and the streaming pipeline.</p>\n<h2>Building the rootfs</h2>\n<p>A Firecracker VM needs two things: an uncompressed Linux kernel (<code>vmlinux</code>, not <code>bzImage</code> — there's no bootloader) and an ext4 filesystem image to use as the root disk.</p>\n<p>The kernel is straightforward — grab a prebuilt 6.1 LTS vmlinux. The rootfs took more work.</p>\n<p>It's a standard ext4 image with Debian Bookworm, and it needs everything Claude Code might want: Node.js 24, Python 3.11, Chromium for browser automation, git, curl, jq, and the full Claude Code CLI installed globally via npm. The image ends up at about 4GB.</p>\n<p>The guest agent — the Go binary that listens for commands from the host — lives inside the rootfs as a systemd service:</p>\n<pre><code class=\"language-bash\">sudo mount /opt/firecracker/rootfs/base-rootfs.ext4 /mnt\nsudo cp bin/agent /mnt/usr/local/bin/agent\nsudo chmod +x /mnt/usr/local/bin/agent\n\nsudo tee /mnt/etc/systemd/system/agent.service &lt;&lt;'EOF'\n[Unit]\nDescription=Orchestrator Guest Agent\nAfter=network.target\n\n[Service]\nType=simple\nExecStart=/usr/local/bin/agent\nRestart=always\nRestartSec=1\n\n[Install]\nWantedBy=multi-user.target\nEOF\n\nsudo chroot /mnt systemctl enable agent.service\nsudo umount /mnt\n</code></pre>\n<p>That <code>RestartSec=1</code> matters. If the agent crashes for any reason, systemd has it back up in a second. The orchestrator polls vsock every 500ms waiting for the agent, so even a crash during boot is barely noticeable.</p>\n<p>You build this rootfs once, by hand. Every new VM gets a sparse copy of it.</p>\n<h2>VM lifecycle</h2>\n<p><code>internal/vm/manager.go</code> handles the whole lifecycle. It's sequential with cleanup at each step — if anything fails, it tears down what it already set up and returns the error.</p>\n<pre><code class=\"language-mermaid\">flowchart TD\n    A[&quot;Copy rootfs (sparse)&quot;] --&gt; B[&quot;Mount &amp; inject network config&quot;]\n    B --&gt; C[&quot;Create TAP device&quot;]\n    C --&gt; D[&quot;Add iptables rules&quot;]\n    D --&gt; E[&quot;Setup jailer chroot&quot;]\n    E --&gt; F[&quot;Write Firecracker config JSON&quot;]\n    F --&gt; G[&quot;Launch via jailer --daemonize&quot;]\n    G --&gt; H[&quot;Find PID, save metadata&quot;]\n    H --&gt; I[&quot;VM ready — poll vsock&quot;]\n</code></pre>\n<p>The sparse copy is the first thing that happens:</p>\n<pre><code class=\"language-go\">cmd := exec.Command(&quot;cp&quot;, &quot;--sparse=always&quot;, BaseRootfs, vm.RootfsPath)\n</code></pre>\n<p><code>--sparse=always</code> means zero blocks aren't allocated on disk. A 4GB image might only use 2GB of actual disk space. Takes under a second on NVMe.</p>\n<p>After copying, the rootfs gets mounted and three files are injected: a systemd-networkd config with a static IP, <code>/etc/resolv.conf</code> for DNS, and <code>/etc/hostname</code>. Then it's unmounted and copied again into the jailer chroot.</p>\n<p>Yeah, that's two copies of the rootfs per VM. The first for network injection, the second because the jailer expects everything inside its chroot. I could collapse this into one copy by injecting the network config directly into the chroot copy, but it's never been a bottleneck — sparse copy of 4GB takes less time than Firecracker takes to boot. So I left it.</p>\n<h2>The jailer</h2>\n<p>Firecracker's jailer is a separate binary that creates a chroot, sets up minimal <code>/dev</code> entries (kvm, net/tun, urandom), and runs the Firecracker process inside it. The VM config is a JSON file:</p>\n<pre><code class=\"language-go\">vmConfig := map[string]interface{}{\n    &quot;boot-source&quot;: map[string]interface{}{\n        &quot;kernel_image_path&quot;: &quot;/vmlinux&quot;,\n        &quot;boot_args&quot;:         &quot;console=ttyS0 reboot=k panic=1 pci=off init=/sbin/init&quot;,\n    },\n    &quot;drives&quot;: []map[string]interface{}{{\n        &quot;drive_id&quot;:       &quot;rootfs&quot;,\n        &quot;path_on_host&quot;:   &quot;/rootfs.ext4&quot;,\n        &quot;is_root_device&quot;: true,\n        &quot;is_read_only&quot;:   false,\n    }},\n    &quot;machine-config&quot;: map[string]interface{}{\n        &quot;vcpu_count&quot;:  vm.VCPUs,\n        &quot;mem_size_mib&quot;: vm.RamMB,\n    },\n    &quot;network-interfaces&quot;: []map[string]interface{}{{\n        &quot;iface_id&quot;:      &quot;eth0&quot;,\n        &quot;guest_mac&quot;:     &quot;06:00:AC:10:00:02&quot;,\n        &quot;host_dev_name&quot;: netCfg.TapDev,\n    }},\n    &quot;vsock&quot;: map[string]interface{}{\n        &quot;guest_cid&quot;: vm.VsockCID,\n        &quot;uds_path&quot;:  &quot;/vsock.sock&quot;,\n    },\n}\n</code></pre>\n<p><code>pci=off</code> because Firecracker doesn't emulate PCI. Paths are relative to the jailer chroot. The vsock entry creates a Unix domain socket at <code>/vsock.sock</code> inside the chroot — that's how the host talks to the guest.</p>\n<p>Launch looks like this:</p>\n<pre><code class=\"language-go\">cmd := exec.Command(JailerBin,\n    &quot;--id&quot;, vm.JailID,\n    &quot;--exec-file&quot;, FCBin,\n    &quot;--uid&quot;, &quot;0&quot;, &quot;--gid&quot;, &quot;0&quot;,\n    &quot;--cgroup-version&quot;, &quot;2&quot;,\n    &quot;--daemonize&quot;,\n    &quot;--&quot;,\n    &quot;--config-file&quot;, &quot;/vm-config.json&quot;,\n)\ncmd.Run()\n</code></pre>\n<p>After launch there's a 2-second sleep — Firecracker needs a moment to start — then the PID is found via <code>pgrep</code> and saved to a metadata file. If the orchestrator restarts, it reads these metadata files and picks up where it left off. VMs survive orchestrator crashes.</p>\n<h2>Networking</h2>\n<p>This is where I burned the most time. Not because the concepts are hard, but because of one specific bug that had me questioning reality.</p>\n<p>Each VM needs internet access for Claude Code to fetch packages, clone repos, and hit the Anthropic API. The approach: each VM gets a Linux TAP device on the host, a dedicated <code>/24</code> subnet, and iptables rules for NAT.</p>\n<h3>IP allocation</h3>\n<p>Subnets are deterministic, derived from the VM name using FNV-1a hashing:</p>\n<pre><code class=\"language-go\">func NetSlot(name string) int {\n    h := fnv.New32a()\n    h.Write([]byte(name))\n    return int(h.Sum32()%253) + 1\n}\n</code></pre>\n<p>VM named <code>task-a3bfca80</code> might hash to slot 61, giving it subnet <code>172.16.61.0/24</code>, guest IP <code>172.16.61.2</code>, TAP IP <code>172.16.61.1</code>. No coordination needed, no DHCP server, no IP pool to manage. The collision space is 253 slots — more than enough for 12-13 concurrent VMs.</p>\n<h3>TAP devices</h3>\n<p>A TAP device is a virtual ethernet interface. Firecracker attaches the guest's <code>eth0</code> to it.</p>\n<pre><code class=\"language-go\">tap := &amp;netlink.Tuntap{\n    LinkAttrs: netlink.LinkAttrs{Name: cfg.TapDev},\n    Mode:      netlink.TUNTAP_MODE_TAP,\n}\nnetlink.LinkAdd(tap)\naddr, _ := netlink.ParseAddr(cfg.TapIP + &quot;/24&quot;)\nlink, _ := netlink.LinkByName(cfg.TapDev)\nnetlink.AddrAdd(link, addr)\nnetlink.LinkSetUp(link)\n</code></pre>\n<p>TAP names are <code>fc-&lt;vm-name&gt;</code>, truncated to 15 characters because Linux interface names can't be longer. A fun constraint to discover at runtime.</p>\n<h3>The iptables rules</h3>\n<p>Three rules per VM:</p>\n<pre><code class=\"language-go\">// NAT — rewrite source IP when traffic exits the host\nipt.AppendUnique(&quot;nat&quot;, &quot;POSTROUTING&quot;,\n    &quot;-s&quot;, cfg.Subnet, &quot;-o&quot;, cfg.HostIface, &quot;-j&quot;, &quot;MASQUERADE&quot;)\n\n// FORWARD — allow outbound from TAP\nipt.Insert(&quot;filter&quot;, &quot;FORWARD&quot;, 1,\n    &quot;-i&quot;, cfg.TapDev, &quot;-o&quot;, cfg.HostIface, &quot;-j&quot;, &quot;ACCEPT&quot;)\n\n// FORWARD — allow established/related inbound\nipt.Insert(&quot;filter&quot;, &quot;FORWARD&quot;, 1,\n    &quot;-i&quot;, cfg.HostIface, &quot;-o&quot;, cfg.TapDev,\n    &quot;-m&quot;, &quot;state&quot;, &quot;--state&quot;, &quot;RELATED,ESTABLISHED&quot;, &quot;-j&quot;, &quot;ACCEPT&quot;)\n</code></pre>\n<p>See those <code>Insert</code> calls with position 1? That's the bug fix.</p>\n<h3>The UFW bug</h3>\n<p>I originally used <code>Append</code> for the FORWARD rules. Traffic from the VM would leave the host fine (NAT worked), but return traffic got dropped. The VM could resolve DNS but couldn't complete TCP handshakes. I spent an embarrassing amount of time staring at <code>tcpdump</code> output before I figured it out.</p>\n<p>Ubuntu's UFW adds a blanket <code>DROP</code> rule to the FORWARD chain. If you append your ACCEPT rules, they land <em>after</em> UFW's DROP. They never match. The packets hit the DROP rule first and get silently killed.</p>\n<p><code>Insert</code> at position 1 puts the rules before UFW's. Return traffic flows, VMs get internet access, everything works.</p>\n<p>The traffic path through a working VM:</p>\n<pre><code>Guest (172.16.61.2) → eth0 → TAP (fc-task-xxx) → FORWARD ACCEPT\n→ NAT MASQUERADE (rewrite src to host IP) → host interface → internet\n→ response → RELATED,ESTABLISHED → TAP → guest eth0\n</code></pre>\n<p>VMs can't reach each other. Each TAP device is point-to-point on its own <code>/24</code>. There's no route between subnets.</p>\n<h2>The guest agent</h2>\n<p><code>cmd/agent/main.go</code> — 420 lines of Go. It's a static binary that starts on boot, listens on vsock port 9001, and handles five request types: ping, exec, write_files, read_file, and signal.</p>\n<p>The interesting one is streaming exec.</p>\n<p>When the orchestrator wants to run Claude Code, it sends an exec request with <code>stream: true</code>. The agent spawns the command, reads stdout and stderr line by line, and sends each line back as a framed event over the vsock connection. When the process exits, it sends an exit event with the exit code.</p>\n<p>Sounds straightforward. The tricky part is background processes.</p>\n<p>Claude Code can start things that outlive the main command — dev servers, file watchers, whatever it decides it needs. These child processes inherit the stdout/stderr pipes. If the agent waits for the pipes to close (the normal approach), it hangs forever because the children are still holding them open.</p>\n<p>The fix has three parts:</p>\n<pre><code class=\"language-go\">// 1. Process group isolation\ncmd.SysProcAttr = &amp;syscall.SysProcAttr{Setpgid: true}\n\n// 2. Wait for the main process, not the pipes\n&lt;-waitDone\n\n// 3. Kill the entire process group\npgid, _ := syscall.Getpgid(cmd.Process.Pid)\nsyscall.Kill(-pgid, syscall.SIGTERM)\ntime.Sleep(500 * time.Millisecond)\nsyscall.Kill(-pgid, syscall.SIGKILL)\n</code></pre>\n<p><code>Setpgid: true</code> puts the command in its own process group. When the main process exits, kill the group (<code>-pgid</code> means &quot;everything in this group&quot;). SIGTERM first, wait half a second, then SIGKILL for anything that didn't listen.</p>\n<p>Even after killing the group, there's a 3-second timeout waiting for the pipe-reading goroutines to drain. If they're still stuck after that, move on and send the exit event anyway. Can't let a hung pipe block the entire task.</p>\n<p>The line-by-line reader uses a 256KB buffer because Claude Code's <code>--output-format stream-json</code> can produce enormous single lines — tool results that include the full contents of files it read.</p>\n<h2>Credential injection</h2>\n<p>Before Claude Code runs, the orchestrator writes five things into the VM via vsock:</p>\n<p>OAuth credentials from the host's <code>~/.claude/.credentials.json</code> (mode 0600). A settings file that allows all tools. An environment script that sets <code>CLAUDE_DANGEROUSLY_SKIP_PERMISSIONS=true</code>. Task metadata. And a marker file to create the output directory.</p>\n<p>The prompt itself gets written to a temp file inside the VM to avoid shell escaping nightmares, then referenced in the command:</p>\n<pre><code class=\"language-go\">claudeArgs := fmt.Sprintf(\n    &quot;claude -p \\&quot;$(cat %s)\\&quot; --output-format stream-json --verbose&quot;,\n    promptFile,\n)\ncmd := []string{&quot;bash&quot;, &quot;-c&quot;,\n    &quot;source /etc/profile.d/claude.sh &amp;&amp; &quot; + claudeArgs}\n</code></pre>\n<p>When the VM is destroyed, the rootfs — containing the credentials — is deleted. Credentials only exist for the lifetime of the task.</p>\n<h2>Collecting results</h2>\n<p>After Claude Code finishes, the orchestrator searches for files it created:</p>\n<pre><code class=\"language-go\">// Anything in the output directory\nvsock.Exec(jailID, []string{&quot;find&quot;, outputDir, &quot;-type&quot;, &quot;f&quot;, &quot;-not&quot;, &quot;-name&quot;, &quot;.keep&quot;}, nil, &quot;/root&quot;)\n\n// Any new files under /root, created after the prompt was written\nvsock.Exec(jailID, []string{&quot;find&quot;, &quot;/root&quot;, &quot;-maxdepth&quot;, &quot;2&quot;, &quot;-type&quot;, &quot;f&quot;,\n    &quot;-newer&quot;, &quot;/tmp/claude-prompt.txt&quot;}, nil, &quot;/root&quot;)\n</code></pre>\n<p>Each file gets downloaded via <code>vsock.ReadFile</code> and saved to <code>/opt/firecracker/results/&lt;task-id&gt;/</code>. The runner also scans the accumulated output for Claude's <code>total_cost_usd</code> field to record what the task cost in API credits.</p>\n<p>Then the VM is destroyed. Firecracker process killed, TAP device removed, iptables rules deleted, jailer chroot deleted, VM state directory deleted. Clean slate.</p>\n<p>The whole cycle — boot, inject, run, collect, destroy — typically takes 30-120 seconds depending on how complex the prompt is. The 4-second boot and ~1-second teardown are rounding errors compared to the time Claude actually spends thinking.</p>\n<p><a href=\"https://jonno.nz/posts/letting-claude-delegate-to-itself/\">Part 3</a> gets into the fun stuff — the MCP server that lets Claude delegate tasks to itself, the streaming architecture, the web dashboard, and what productionising this would actually look like.</p>\n","date_published":"Fri, 10 Apr 2026 00:00:00 GMT","image":"https://jonno.nz/og/booting-a-vm-in-4-seconds-and-talking-to-it-without-a-network.png"},{"id":"https://jonno.nz/posts/can-you-beat-last-month/","url":"https://jonno.nz/posts/can-you-beat-last-month/","title":"Can You Beat Last Month? — Part 5: Baseline Models","content_html":"<p>Every machine learning project needs a reality check.</p>\n<p>It's tempting to jump straight to the neural network — that's the exciting bit, right? But if you don't establish what a dead-simple model can do first, you've got no idea whether your fancy architecture is actually learning anything useful or just being expensive.</p>\n<p>So before ConvLSTM gets anywhere near this data, we're going to throw three gloriously simple baselines at it and see how they do.</p>\n<h2>Persistence: next month equals this month</h2>\n<p>The dumbest possible model. To predict April, just use March's values. Every cell, every crime type — carbon copy.</p>\n<p>It sounds ridiculous, but it works surprisingly well when patterns are stable. And as we saw in the EDA, Auckland's crime hotspots are remarkably persistent. The CBD doesn't suddenly go quiet. South Auckland doesn't randomly calm down.</p>\n<p>On the six-month test set (August 2025 – January 2026):</p>\n<table>\n<thead>\n<tr>\n<th>Crime Type</th>\n<th>MAE</th>\n<th>RMSE</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Theft</td>\n<td>1.42</td>\n<td>3.18</td>\n</tr>\n<tr>\n<td>Burglary</td>\n<td>0.38</td>\n<td>0.91</td>\n</tr>\n<tr>\n<td>Assault</td>\n<td>0.22</td>\n<td>0.64</td>\n</tr>\n<tr>\n<td>Robbery</td>\n<td>0.04</td>\n<td>0.15</td>\n</tr>\n<tr>\n<td>Sexual</td>\n<td>0.03</td>\n<td>0.12</td>\n</tr>\n<tr>\n<td>Harm</td>\n<td>0.01</td>\n<td>0.04</td>\n</tr>\n</tbody>\n</table>\n<p>Those MAE numbers for theft and burglary look small until you remember that most cells are zero. For the active cells — the ones we actually care about — the error is larger. A busy CBD cell might have 35 thefts in one month and 28 the next. Persistence would be off by 7 there, which is a 20% miss on an important prediction.</p>\n<h2>Seasonal naive: same month last year</h2>\n<p>Instead of copying last month, copy the same month from the previous year. January 2026 gets predicted from January 2025. This should capture seasonal patterns — the summer spike, the February dip.</p>\n<p>The catch? We only have four years of data. The test set months (August–January) each have at most three prior examples of the same month. That's not a lot of seasonal training data.</p>\n<table>\n<thead>\n<tr>\n<th>Crime Type</th>\n<th>MAE</th>\n<th>RMSE</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Theft</td>\n<td>1.51</td>\n<td>3.42</td>\n</tr>\n<tr>\n<td>Burglary</td>\n<td>0.41</td>\n<td>0.97</td>\n</tr>\n<tr>\n<td>Assault</td>\n<td>0.24</td>\n<td>0.68</td>\n</tr>\n<tr>\n<td>Robbery</td>\n<td>0.05</td>\n<td>0.17</td>\n</tr>\n<tr>\n<td>Sexual</td>\n<td>0.04</td>\n<td>0.13</td>\n</tr>\n<tr>\n<td>Harm</td>\n<td>0.01</td>\n<td>0.04</td>\n</tr>\n</tbody>\n</table>\n<p>Slightly worse than persistence across the board. That surprised me initially — shouldn't capturing seasonality help?</p>\n<p>The issue is that the 2023-to-2025 decline we spotted in the EDA bites hard here. If you predict January 2026 from January 2025, you're using data from a period when crime was higher. The seasonal pattern is real, but the year-over-year trend works against it. With more years of data, seasonal naive would likely pull ahead.</p>\n<h2>Historical average: the mean of all training months</h2>\n<p>For each cell and crime type, take the average across all 36 training months. This smooths out month-to-month noise and gives you a &quot;typical&quot; value for each location.</p>\n<table>\n<thead>\n<tr>\n<th>Crime Type</th>\n<th>MAE</th>\n<th>RMSE</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Theft</td>\n<td>1.28</td>\n<td>2.95</td>\n</tr>\n<tr>\n<td>Burglary</td>\n<td>0.35</td>\n<td>0.84</td>\n</tr>\n<tr>\n<td>Assault</td>\n<td>0.20</td>\n<td>0.58</td>\n</tr>\n<tr>\n<td>Robbery</td>\n<td>0.04</td>\n<td>0.14</td>\n</tr>\n<tr>\n<td>Sexual</td>\n<td>0.03</td>\n<td>0.11</td>\n</tr>\n<tr>\n<td>Harm</td>\n<td>0.01</td>\n<td>0.04</td>\n</tr>\n</tbody>\n</table>\n<p>The best baseline. By averaging over three years, it smooths out the month-to-month noise and the year-over-year trend simultaneously. It won't capture seasonal peaks or sudden changes, but for the &quot;typical month&quot; prediction it's solid.</p>\n<h2>Why MAPE breaks down</h2>\n<p>You might wonder why I'm not reporting MAPE (Mean Absolute Percentage Error) — it's the standard metric in a lot of forecasting work. The reason: sparse data.</p>\n<p>MAPE divides the error by the actual value. When the actual value is zero — which it is for 91.7% of our tensor — you get division by zero. Even for cells with small counts (1 or 2 crimes), a prediction of 0 gives you 100% MAPE while a prediction of 2 gives you 0–100%. The metric becomes wildly unstable.</p>\n<p>MAE and RMSE are more honest here. They tell you the absolute magnitude of your errors in actual crime counts, which is what we care about. A miss of 3 victimisations means the same thing whether the cell usually has 5 or 50.</p>\n<h2>The bar to clear</h2>\n<p>Here's the scoreboard going forward. Any deep learning model needs to beat the historical average baseline to justify its existence:</p>\n<table>\n<thead>\n<tr>\n<th>Crime Type</th>\n<th>Historical Avg MAE</th>\n<th>Historical Avg RMSE</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Theft</td>\n<td>1.28</td>\n<td>2.95</td>\n</tr>\n<tr>\n<td>Burglary</td>\n<td>0.35</td>\n<td>0.84</td>\n</tr>\n<tr>\n<td>Assault</td>\n<td>0.20</td>\n<td>0.58</td>\n</tr>\n<tr>\n<td>All types</td>\n<td>0.39</td>\n<td>0.95</td>\n</tr>\n</tbody>\n</table>\n<p>Theft is the easiest to beat because there's the most signal — high counts, clear spatial patterns, strong seasonality. Robbery, sexual offences, and harm are essentially noise at this resolution. The models will probably predict near-zero for those types and be mostly correct.</p>\n<p>The real test will be the middle ground — can ConvLSTM or ST-ResNet predict the <em>changes</em> in theft and burglary better than a static average? Can they catch the months where a cell spikes or dips? That's where simple baselines fall flat, because they don't model dynamics at all.</p>\n<p>If the deep learning can't meaningfully beat &quot;just use the average,&quot; then it's not worth the CPU cycles. Or in my case, the many hours of a Ryzen 5 grinding away.</p>\n","date_published":"Thu, 09 Apr 2026 00:00:00 GMT","image":"https://jonno.nz/og/can-you-beat-last-month.png"},{"id":"https://jonno.nz/posts/i-built-anthropics-internal-sandbox-platform-on-a-single-linux-box/","url":"https://jonno.nz/posts/i-built-anthropics-internal-sandbox-platform-on-a-single-linux-box/","title":"I Built Anthropic's Internal Sandbox Platform on a Single Linux Box","content_html":"<p>Running Claude Code with full permissions inside a Docker container is a terrible idea. I did it anyway for about a week, then built something better.</p>\n<p>Anthropic has an internal platform — people have been calling it <a href=\"https://ai.gopubby.com/anthropics-antspace-the-secret-paas-nobody-was-supposed-to-find-a79ce1e02151\">Antspace</a> since it got reverse-engineered from the Claude Code source — that runs AI coding tasks in isolated environments. It's part of a vertical stack they're building internally: intent goes in, code comes out, and the agent never touches the host machine.</p>\n<p>I wanted that. Not the whole platform-as-a-service thing, just the core idea: give Claude Code a prompt, let it run with zero permission restrictions, stream the output back, grab any files it created, and destroy everything when it's done. On a single Linux box sitting in my office.</p>\n<p>The result is about 3,200 lines of Go and 860 lines of TypeScript. It boots a fresh Linux VM in ~4 seconds, runs Claude Code inside it, and tears it down when the task finishes. Three ways to use it: a CLI, a REST API with a web dashboard, and an MCP server so Claude Code on other machines can delegate tasks to it.</p>\n<p>This first post is about why I built it this way. Parts 2 and 3 get into the actual implementation.</p>\n<h2>The container problem</h2>\n<p><code>CLAUDE_DANGEROUSLY_SKIP_PERMISSIONS=true</code> — that's the environment variable that tells Claude Code to stop asking before it runs shell commands or writes files. It just does whatever it thinks it needs to. For autonomous tasks, you need this. Claude can't ask for confirmation when there's nobody watching.</p>\n<p>The question is where you let it run.</p>\n<p>Docker is the obvious first thought. Fast startup, everyone knows it, easy to orchestrate. But containers share the host kernel. Every container on the machine issues syscalls to the same Linux kernel, and a kernel vulnerability is a vulnerability in every container on the host. <a href=\"https://huggingface.co/blog/agentbox-master/firecracker-vs-docker-tech-boundary\">The isolation boundary is the container runtime</a>, not hardware — and that surface area is big.</p>\n<p>For most workloads this is fine. Running a web server in Docker? No worries. But running an AI agent that can execute arbitrary shell commands with root-level permissions? That's a different threat model. A container escape gives you the host. And you've just given the thing inside the container permission to try anything.</p>\n<p>Anthropic's own approach to <a href=\"https://www.anthropic.com/engineering/claude-code-sandboxing\">sandboxing Claude Code</a> uses OS-level primitives — bubblewrap on Linux, Seatbelt on macOS — for filesystem and network isolation. They report an 84% reduction in permission prompts internally. That's smart for the normal use case where Claude is helping you write code in your own project. But I wanted something more aggressive: full isolation where even a kernel exploit can't reach the host.</p>\n<h2>Why Firecracker</h2>\n<p><a href=\"https://firecracker-microvm.github.io/\">Firecracker</a> is what AWS built for Lambda and Fargate. Each MicroVM is a real KVM-backed virtual machine with its own guest kernel, its own memory space, and hardware-enforced isolation via Intel VT-x or AMD-V. The attack surface is the KVM hypervisor — which the kernel team at AWS has spent years minimising.</p>\n<p>The trade-off is boot time. Containers start in under a second. Firecracker VMs take about 4 seconds on my hardware once you account for the guest kernel boot, systemd init, and the agent process starting up. For tasks that typically run 20-120 seconds, 4 seconds of overhead is nothing.</p>\n<p>Each VM also copies a 4GB rootfs image. Sparse copies make this fast (&lt;1 second), but it does use disk. On a machine with a 1TB NVMe, I'm not losing sleep over it.</p>\n<p>The hardware is an AMD Ryzen 5 5600GT with 30GB of RAM. Nothing exotic. About $400 worth of parts sitting under my desk. Each VM gets 2GB of RAM by default, so I can run roughly 12-13 VMs concurrently before the host runs out of memory.</p>\n<h2>Talking to a VM without a network</h2>\n<p>This was my favourite bit to figure out.</p>\n<p>The obvious way to communicate with a process inside a VM is SSH. Set up keys, open a port, connect over the network. But SSH means key management, an open network port inside the VM, and another service to configure. If the guest's network breaks during a task, you've lost your control channel.</p>\n<p><a href=\"https://github.com/firecracker-microvm/firecracker/blob/main/docs/vsock.md\">vsock</a> (AF_VSOCK, address family 40) is a kernel-level host-guest communication channel. It doesn't touch the network stack. No IP addresses, no ports, no keys. Firecracker exposes the guest's vsock as a Unix domain socket on the host side — you connect to the socket, send <code>CONNECT &lt;port&gt;\\n</code>, and you're talking directly to a process inside the VM.</p>\n<pre><code class=\"language-go\">func Connect(jailID string, port int) (net.Conn, error) {\n    socketPath := fmt.Sprintf(&quot;/srv/jailer/firecracker/%s/root/vsock.sock&quot;, jailID)\n    conn, _ := net.Dial(&quot;unix&quot;, socketPath)\n    conn.Write([]byte(fmt.Sprintf(&quot;CONNECT %d\\n&quot;, port)))\n    // Read &quot;OK &lt;port&gt;&quot; response\n    return conn, nil\n}\n</code></pre>\n<p>On the guest side, Go's standard library doesn't support AF_VSOCK — address family 40 doesn't exist in the <code>net</code> package. So the guest agent uses raw syscalls:</p>\n<pre><code class=\"language-go\">fd, _ := syscall.Socket(40, syscall.SOCK_STREAM, 0)  // AF_VSOCK = 40\n// Manually construct struct sockaddr_vm (16 bytes)\nsa := [16]byte{}\n*(*uint16)(unsafe.Pointer(&amp;sa[0])) = 40          // family\n*(*uint32)(unsafe.Pointer(&amp;sa[4])) = uint32(port) // port (9001)\n*(*uint32)(unsafe.Pointer(&amp;sa[8])) = 0xFFFFFFFF   // VMADDR_CID_ANY\nsyscall.RawSyscall(syscall.SYS_BIND, uintptr(fd), uintptr(unsafe.Pointer(&amp;sa[0])), 16)\nsyscall.RawSyscall(syscall.SYS_LISTEN, uintptr(fd), 5, 0)\n</code></pre>\n<p>Yeah, that's <code>unsafe.Pointer</code> and manual struct layout. Not the prettiest Go you'll ever write. But it works, it's fast, and the whole vsock layer is about 160 lines shared between both binaries.</p>\n<p>The wire protocol is dead simple — length-prefixed JSON frames:</p>\n<pre><code class=\"language-go\">func WriteFrame(w io.Writer, v interface{}) error {\n    data, _ := json.Marshal(v)\n    binary.Write(w, binary.BigEndian, uint32(len(data)))\n    w.Write(data)\n    return nil\n}\n</code></pre>\n<p>Each operation (ping, exec, write files, read file) opens a new connection, sends one request, reads the response, and closes. Connection-per-request. Not fancy, but vsock connections are local and effectively instant, so there's no reason to complicate things with multiplexing.</p>\n<h2>The shape of the thing</h2>\n<p>The whole system is two Go binaries — the orchestrator (runs on the host) and the agent (runs inside each VM).</p>\n<pre><code class=\"language-mermaid\">graph TD\n    subgraph &quot;Host — orchestrator binary&quot;\n        API[&quot;REST API + WebSocket :8080&quot;]\n        MCP[&quot;MCP Server :8081&quot;]\n        VM[&quot;VM Manager&quot;]\n        NET[&quot;TAP + iptables&quot;]\n        TASK[&quot;Task Runner&quot;]\n        STREAM[&quot;Pub/Sub Hub&quot;]\n        VSOCK[&quot;vsock Client&quot;]\n    end\n\n    subgraph &quot;Guest — agent binary&quot;\n        AGENT[&quot;Guest Agent vsock:9001&quot;]\n        CLAUDE[&quot;Claude Code&quot;]\n    end\n\n    API --&gt; TASK\n    MCP --&gt; TASK\n    TASK --&gt; VM\n    VM --&gt; NET\n    TASK --&gt; VSOCK\n    VSOCK --&gt; AGENT\n    AGENT --&gt; CLAUDE\n    TASK --&gt; STREAM\n    STREAM --&gt; API\n</code></pre>\n<p>The orchestrator is a single 14MB binary with the React dashboard embedded via <code>//go:embed</code>. Copy it to a server, run it with sudo, done. Seven Go dependencies total — chi for routing, netlink for TAP devices, go-iptables for firewall rules, <a href=\"https://github.com/mark3labs/mcp-go\">mcp-go</a> for the MCP protocol, and a few others.</p>\n<p>The agent is a 2.5MB static binary compiled with <code>CGO_ENABLED=0</code>. It ships inside the VM's rootfs and starts via systemd on boot. Within about a second of the VM coming up, the agent is listening on vsock port 9001 and ready to accept commands.</p>\n<p>They share exactly one file — <code>internal/agent/protocol.go</code> — which defines the wire protocol types and framing functions. Everything else is independent.</p>\n<h2>What a task looks like</h2>\n<p>You give it a prompt. It does the rest.</p>\n<ol>\n<li>Generate a task ID and VM name</li>\n<li>Copy the base rootfs image (sparse, &lt;1 second)</li>\n<li>Inject network config into the rootfs</li>\n<li>Create a TAP device and iptables rules for internet access</li>\n<li>Launch Firecracker via the jailer</li>\n<li>Poll vsock until the agent responds (~1 second)</li>\n<li>Inject credentials and files via vsock</li>\n<li>Run Claude Code with streaming output</li>\n<li>Collect any files Claude created</li>\n<li>Destroy the VM</li>\n</ol>\n<p>From the CLI it looks like this:</p>\n<pre><code class=\"language-bash\">sudo ./bin/orchestrator task run \\\n    --prompt &quot;Write a Python script that generates Fibonacci numbers&quot; \\\n    --ram 2048 \\\n    --vcpus 2 \\\n    --timeout 120\n</code></pre>\n<p>Output streams to your terminal in real time. When it's done:</p>\n<pre><code>=== Task Complete ===\nID:     a3bfca80\nStatus: completed\nExit:   0\nCost:   $0.0582\nFiles:  [fibonacci.py]\n</code></pre>\n<p>The VM is gone. The rootfs is deleted. The TAP device and iptables rules are cleaned up. All that's left is the result files in <code>/opt/firecracker/results/a3bfca80/</code>.</p>\n<p>Or you use the MCP server, and Claude Code on your laptop delegates the task to a VM on the box under your desk. Claude spawning Claude. That bit is properly cool, and I'll get into it in Part 3.</p>\n<h2>Why Go</h2>\n<p>Quick aside on this because people always ask.</p>\n<p>Go produces static binaries. The agent needs to be a single file with zero dependencies that runs inside a minimal Debian guest — <code>CGO_ENABLED=0</code> makes this trivial. The orchestrator needs to manage concurrent VMs, and goroutines are a natural fit for that. Syscall support is first-class, which matters when you're doing raw vsock operations. And it compiles in about 2 seconds, which is nice when you're iterating.</p>\n<pre><code class=\"language-makefile\">build-agent:\n\tCGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o bin/agent -ldflags=&quot;-s -w&quot; ./cmd/agent\n</code></pre>\n<p>That <code>-ldflags=&quot;-s -w&quot;</code> strips debug info and DWARF tables, dropping the agent binary from ~3.5MB to ~2.5MB. Every byte counts when you're baking it into a rootfs that gets copied for every VM.</p>\n<p>Part 2 gets into the actual build — the rootfs, the networking (including a fun bug with Ubuntu's UFW that had me staring at iptables rules for an embarrassing amount of time), the guest agent, and the streaming pipeline that gets Claude's output from inside a VM to your browser.</p>\n","date_published":"Wed, 08 Apr 2026 00:00:00 GMT","image":"https://jonno.nz/og/i-built-anthropics-internal-sandbox-platform-on-a-single-linux-box.png"},{"id":"https://jonno.nz/posts/stealing-nanoclaw-patterns-for-webapps-and-saas/","url":"https://jonno.nz/posts/stealing-nanoclaw-patterns-for-webapps-and-saas/","title":"Stealing NanoClaw Patterns for Web Apps and SaaS","content_html":"<p>In <a href=\"https://jonno.nz/posts/nanoclaw-architecture-masterclass-in-doing-less/\">Part 1</a> I pulled apart NanoClaw's codebase and found six patterns that make an 8,000-line AI assistant surprisingly robust. But NanoClaw is a single-user tool running on your laptop. Surely these patterns fall apart once you've got real tenants, real money, and real scale?</p>\n<p>Nah. Four of them translate almost directly — and the ones that don't still teach you something useful.</p>\n<h2>The credential sidecar</h2>\n<p>NanoClaw's credential proxy — where containers get a placeholder API key and a localhost proxy injects the real one — sounds like a neat trick for a personal tool. But this exact pattern is showing up in production Kubernetes deployments right now.</p>\n<p>The broader version is a <a href=\"https://www.apistronghold.com/blog/phantom-token-pattern-production-ai-agents\">sidecar proxy that handles credential injection</a> for any service that needs API keys or tokens. Your application code never touches the real secret. A sidecar container intercepts outbound requests, swaps in credentials, and forwards them upstream.</p>\n<p>At Vend we managed a bunch of third-party integrations — payment gateways, shipping providers, accounting platforms. Each one had API keys that needed to live somewhere. We went through the typical evolution: environment variables, then a secrets manager, then a service that distributed keys at startup. Every step was an improvement, but the keys still ended up <em>in the application's memory</em>.</p>\n<p>The sidecar approach skips that entirely. Your app sends requests with a placeholder. The proxy — which is a separate process with its own security boundary — does the credential swap. Even if your application gets compromised, the real keys aren't there to steal.</p>\n<p>If you're running any kind of multi-service architecture where services call external APIs, this pattern is worth adopting. Your API gateway might already be doing a version of it — the insight is making it explicit and consistent across all outbound credential flows.</p>\n<h2>Isolation as the security model</h2>\n<p>This is the one I keep thinking about.</p>\n<p>NanoClaw uses filesystem mounts to control what each container can see. No application-level permission checks — the security model <em>is</em> the infrastructure topology. If a container can't see a file, it can't access it. No bugs, no missed checks, no escalation vulnerabilities.</p>\n<p>In SaaS, we spend enormous amounts of time writing authorisation logic. Role checks, permission middleware, tenant-scoping queries. And it works — until someone forgets a WHERE clause.</p>\n<p>AWS's own <a href=\"https://docs.aws.amazon.com/whitepapers/latest/saas-architecture-fundamentals/tenant-isolation.html\">SaaS tenant isolation guidance</a> makes this point explicitly: authentication and authorisation are not the same as isolation. The fact that a user logged in doesn't mean your system has achieved tenant isolation. A <a href=\"https://workos.com/blog/tenant-isolation-in-multi-tenant-systems\">single missed tenant filter</a> on a database query and you've got a cross-tenant data leak.</p>\n<p>The NanoClaw-inspired approach is to push isolation down the stack. Separate database schemas per tenant. Separate containers. Separate cloud accounts for your highest-value customers. Not instead of application-level checks — but as a backstop that catches the bugs your application-level checks inevitably have.</p>\n<p>At Xero, working across the integrations and app store teams, I saw first-hand how multi-tenant data isolation gets complicated fast. The teams that had the fewest incidents were the ones where the infrastructure itself enforced boundaries, not just the application code.</p>\n<p>You don't need to go full NanoClaw and give every tenant their own container. But you should be asking: if my application-level authorisation has a bug, what's my second line of defence? If the answer is &quot;nothing&quot; — that's the pattern to steal.</p>\n<h2>Polling when it's the right call</h2>\n<p>NanoClaw polls SQLite every 2 seconds. No WebSockets, no event bus, no pub/sub. Just a loop that checks for new stuff.</p>\n<p>The instinct for most teams is to treat polling as a temporary hack you'll replace with &quot;proper&quot; event-driven architecture later. Yan Cui wrote a <a href=\"https://theburningmonk.com/2025/05/understanding-push-vs-poll-in-event-driven-architectures/\">solid breakdown of push vs poll in event-driven systems</a> and the takeaway isn't that one is always better — it's that the right choice depends on your throughput, ordering, and failure-handling requirements.</p>\n<p>For a lot of internal systems, polling is the correct permanent answer.</p>\n<p>Admin dashboards. Background job status. Internal reporting. Webhook retry queues. Deployment pipelines. These systems don't need sub-second latency. They need reliability and simplicity. A polling loop against your database gives you both, with zero infrastructure overhead.</p>\n<p>At Xero we shipped multiple times per day, and some of the internal tooling that supported continuous deployment was surprisingly simple under the hood. Cron jobs. Polling loops. SQL queries on a timer. Not because anyone was cutting corners — because the requirements genuinely didn't need anything more sophisticated.</p>\n<p>The trap is reaching for Kafka or RabbitMQ because you think you'll need it eventually. <a href=\"https://synmek.com/saas-architecture-for-startups-2025-guide\">70% of startups fail due to premature scaling</a>. The infrastructure you don't deploy is the infrastructure that never breaks.</p>\n<h2>Your database is your message queue</h2>\n<p>NanoClaw uses JSON files on the filesystem for inter-process communication. Atomic rename, directory-based identity, simple polling to pick up new messages. No Redis. No message broker.</p>\n<p>That specific approach won't scale to a multi-tenant SaaS — but the <em>instinct</em> behind it absolutely does. The instinct is: use the infrastructure you already have.</p>\n<p>For most web apps, that means Postgres. The <a href=\"https://dagster.io/blog/skip-kafka-use-postgres-message-queue\">Postgres-as-queue movement</a> has been gaining serious traction, and tools like <a href=\"https://github.com/pgmq/pgmq\">PGMQ</a> make it practical. You get ACID guarantees, you don't need to manage another service, and your queue is backed by the same database you're already monitoring and backing up.</p>\n<p>NanoClaw's <a href=\"https://dev.to/constanta/crash-safe-json-at-scale-atomic-writes-recovery-without-a-db-3aic\">atomic write pattern</a> — write to a temp file, rename into place — maps to <code>INSERT INTO queue_table</code> followed by a <code>SELECT ... FOR UPDATE SKIP LOCKED</code> consumer. Same principle: the message either exists completely or doesn't exist at all. No partial state.</p>\n<p>The &quot;just add Redis&quot; reflex is strong in our industry. Sometimes it's the right call. But I've seen plenty of teams introduce a message broker for a workload that Postgres could've handled without breaking a sweat — and then spend the next six months debugging consumer lag and dead letter queues.</p>\n<h2>The real pattern</h2>\n<p>The specific techniques matter less than the discipline behind them.</p>\n<p>NanoClaw's developer looked at a 500,000-line framework and asked: what are my <em>actual</em> constraints? Single user. Local machine. One AI provider. And then built exactly the architecture those constraints required — nothing more.</p>\n<p>Most teams don't do this. They build for imaginary scale, imaginary multi-tenancy requirements, imaginary traffic spikes. They reach for Kubernetes before they've outgrown a single server. They deploy event buses before they've outgrown a polling loop. They write complex authorisation middleware before they've considered whether infrastructure isolation would eliminate the problem entirely.</p>\n<p>The pattern worth stealing isn't the credential proxy or the polling loop or Postgres-as-queue. It's the habit of understanding your constraints first and letting them delete complexity from your architecture.</p>\n<p>Hardest pattern to adopt, though. Because it means admitting you're smaller than you think.</p>\n","date_published":"Sun, 05 Apr 2026 00:00:00 GMT","image":"https://jonno.nz/og/stealing-nanoclaw-patterns-for-webapps-and-saas.png"},{"id":"https://jonno.nz/posts/what-if-your-browser-built-the-ui-for-you/","url":"https://jonno.nz/posts/what-if-your-browser-built-the-ui-for-you/","title":"What if your browser built the UI for you?","content_html":"<p>We're at a genuinely weird inflection point in frontend development. AI can generate entire interfaces now. LLMs can reason about data and layout. And yet — most SaaS products still ship hand-crafted React apps, each building its own UI, its own accessibility layer, its own theme system, its own responsive breakpoints. Not every service, but the vast majority.</p>\n<p>That's a lot of duplicated effort for what's essentially the same job — showing a human some data and letting them do stuff with it.</p>\n<p>I've been thinking about this a lot lately, and I built a proof of concept to test an idea: what if the browser itself generated the UI?</p>\n<h2>Where we are right now</h2>\n<p>The industry is circling this idea from multiple angles, but nobody's quite landed on it yet.</p>\n<p><a href=\"https://www.apollographql.com/docs/graphos/schema-design/guides/sdui/basics\">Server-driven UI</a> has been around for a while — Airbnb and others pioneered it for mobile, where app store review cycles make shipping UI changes painful. The server sends down a JSON tree describing what to render, and the client just follows instructions. It's clever, but the server is still calling the shots. x.</p>\n<p>Google recently shipped <a href=\"https://developers.google.com/natively-adaptive-interfaces\">Natively Adaptive Interfaces</a> — a framework that uses AI agents to make accessibility a default rather than an afterthought. Really cool idea, and the right instinct. But it's still operating within a single app's boundaries. Your accessibility preferences don't carry between Google's products and, say, your project management tool.</p>\n<p>Then there's the <a href=\"https://www.copilotkit.ai/blog/the-developer-s-guide-to-generative-ui-in-2026\">generative UI</a> wave — CopilotKit, Vercel's AI SDK, and others building frameworks where LLMs generate components on the fly. These are powerful developer tools, but they're still developer tools. The generation happens at build time or on the server. The service is still in control.</p>\n<p>See the pattern? Every approach keeps the power on the service side.</p>\n<h2>Flip it</h2>\n<p>Here's the idea behind the <a href=\"https://github.com/jonnonz1/adaptive-browser\">adaptive browser</a>: what if the generation happened on <em>your</em> side?</p>\n<p>Instead of a service shipping you a finished frontend, it publishes a manifest — a structured description of what it can do. Its capabilities, endpoints, data shapes, what actions are available. Think of it like an API spec, but semantic. Not just &quot;here's a GET endpoint&quot; but &quot;here's a list of repositories, they're sortable by stars and language, you can create, delete, star, or fork them.&quot;</p>\n<p>Your browser takes that manifest, calls the actual APIs, gets real data back, and then generates the UI based on your preferences. Your font size. Your colour scheme. Your preferred layout (tables vs cards vs kanban). Your accessibility needs. All applied universally, across every service.</p>\n<p>The manifest for something like GitHub looks roughly like this — a service describes its capabilities and the browser figures out the rest:</p>\n<pre><code class=\"language-yaml\">service:\n  name: &quot;GitHub&quot;\n  domain: &quot;api.github.com&quot;\n\ncapabilities:\n  - id: &quot;repositories&quot;\n    endpoints:\n      - path: &quot;/user/repos&quot;\n        semantic: &quot;list&quot;\n        entity: &quot;repository&quot;\n        sortable_fields: [name, updated_at, stargazers_count]\n        actions: [create, delete, star, fork]\n</code></pre>\n<p>The browser takes that, fetches the data, and generates a bespoke interface — using an LLM to reason about the best way to present it given who you are and what you're trying to do.</p>\n<h2>Why this matters more than it sounds</h2>\n<p>When I was building the app store and integrations platforms at Xero, one of the constant headaches was that every third-party integration had its own UI patterns. Users had to learn a new interface for every app they connected. If the browser was generating the UI from a shared set of preferences, that problem just… goes away.</p>\n<p>Accessibility is the big one though. Right now, accessibility is a feature that gets bolted on — and often badly. When the browser generates the UI, accessibility isn't a feature. It's the default. Your preferences — high contrast, keyboard-first navigation, screen reader optimisation, larger text — apply everywhere. Not because every developer remembered to implement them, but because they're baked into how the UI gets generated in the first place.</p>\n<p>Customisation becomes genuinely personal too. Not &quot;pick from three themes the developer made&quot; but &quot;this is how I interact with software, full stop.&quot;</p>\n<h2>The trade-off is real though</h2>\n<p>Frontend complexity drops dramatically, but the complexity doesn't disappear — it moves behind the API. And honestly, it probably increases.</p>\n<p>API design becomes way more important. You can't just throw together some REST endpoints and call it a day. Your manifest needs to be semantic — describing what the data means, not just what shape it is. Data contracts between services matter more. Versioning matters more.</p>\n<pre><code class=\"language-mermaid\">graph LR\n    A[Service] --&gt;|Publishes manifest + APIs| B[Browser Agent]\n    C[User Preferences] --&gt; B\n    D[Org Guardrails] --&gt; B\n    B --&gt;|Generates| E[Bespoke UI]\n</code></pre>\n<p>But here's the thing — this trade-off pushes us somewhere genuinely interesting. If every service needs to describe itself semantically through APIs and manifests, those APIs become the actual product surface. Not the frontend. The APIs.</p>\n<p>And once APIs are the product surface, sharing context between platforms becomes the interesting problem. Your project management tool knows what you're working on. Your email client knows who you're talking to. Your code editor knows what you're building. Right now, none of these talk to each other in any meaningful way because they're all locked behind their own UIs. In a manifest-driven world, that context flows through the APIs — and your browser can stitch it all together into something coherent.</p>\n<h2>Where this is headed (IMHO)</h2>\n<p>I reckon we're about 3-5 years from this being mainstream. The pieces are all there — LLMs that can reason about UI, <a href=\"https://www.builder.io/blog/ui-over-apis\">standardisation efforts</a> around sending UI intent over APIs, and a growing expectation from users that software should adapt to them, not the other way around.</p>\n<p>The services that win in this world won't be the ones with the prettiest hand-crafted UI. They'll be the ones with the best APIs, the richest manifests, and the most useful data. The frontend becomes a generated output, not a hand-crafted input.</p>\n<p>Organisations will set preference guardrails — &quot;our people can use dark or light mode, must have destructive action confirmations, these fields are always visible&quot; — while individuals customise within those bounds. Your browser becomes your agent, not just a renderer.</p>\n<p>I built the <a href=\"https://github.com/jonnonz1/adaptive-browser\">adaptive browser</a> as a proof of concept to test this thinking — it uses Claude to generate UIs from a GitHub manifest and user preferences defined in YAML. It's rough, but the direction feels right.</p>\n<p>The frontend isn't dying. But what we think of as &quot;frontend development&quot; is about to change. The interesting work moves to API design, semantic data contracts, and building browsers smart enough to be genuine user agents.</p>\n","date_published":"Sun, 05 Apr 2026 00:00:00 GMT","image":"https://jonno.nz/og/what-if-your-browser-built-the-ui-for-you.png"},{"id":"https://jonno.nz/posts/what-the-data-actually-shows/","url":"https://jonno.nz/posts/what-the-data-actually-shows/","title":"What the Data Actually Shows — Part 4: Exploratory Data Analysis","content_html":"<p>You can't just shove a tensor into a neural network and hope for the best.</p>\n<p>I mean, you <em>can</em>. People do it all the time. But you'll have no idea whether your model is learning something real or just memorising noise. Before we get anywhere near ConvLSTM or ST-ResNet, we need to properly understand what patterns actually exist in this data — and whether they're strong enough for a model to learn.</p>\n<p>This is the part that most ML blog posts skip. It's also the part that saves you weeks of debugging later.</p>\n<h2>When does crime happen?</h2>\n<p>The monthly pattern across Auckland is surprisingly consistent year to year. Crime peaks in late spring and early summer — October through January — and dips in late summer through winter. February is reliably the quietest month at around 7,000–8,000 victimisations, while November and December regularly push past 9,000.</p>\n<p>This tracks with what <a href=\"https://link.springer.com/article/10.1007/s43762-023-00094-x\">criminology research has found globally</a> — warmer months mean more people out and about, more opportunities for property crime, and more interpersonal conflict. It's a well-documented pattern called seasonal variation in crime, and it shows up clearly in the NZ data.</p>\n<p>But here's the interesting bit — the seasonal signal isn't uniform across crime types. Theft drives most of the swing. It surges in summer and drops in winter, accounting for nearly all the monthly variance. Assault has its own rhythm — it peaks around the holiday period (December–January) and shows a secondary bump in winter weekends, probably pub-related. Burglary is flatter, with a slight winter uptick when houses are dark earlier.</p>\n<p>2023 was the peak year across the board, with a noticeable decline through 2024 and into early 2025. Whether that's a real trend or a reporting artefact, I genuinely don't know. But it means the model's training data includes both an upswing and a downswing, which is actually useful — it can't just learn &quot;crime always goes up.&quot;</p>\n<h2>Where does crime cluster?</h2>\n<p>Crime in Auckland is not randomly distributed. That's obvious to anyone who lives here, but it's worth quantifying.</p>\n<p>Running a <a href=\"https://www.publichealth.columbia.edu/research/population-health-methods/hot-spot-spatial-analysis\">Moran's I test</a> on our 500m grid confirms strong positive spatial autocorrelation — cells with high crime counts are surrounded by other high-crime cells. The Moran's I statistic comes out at 0.43 (p &lt; 0.001), which means the clustering is highly significant. Crime begets more crime in adjacent cells.</p>\n<p>The hotspots are exactly where you'd expect. The CBD dominates — Queen Street, Karangahape Road, and the surrounding blocks consistently light up across all crime types. South Auckland corridors — Manukau, Ōtāhuhu, Papatoetoe — form a second cluster, particularly for assault and robbery. Henderson in the west shows up for burglary.</p>\n<p>What's less obvious is how stable these hotspots are over time. The top 5% of cells (about 227 cells) account for over 60% of all recorded crime across the entire four-year period. These aren't random spikes — they're persistent. A cell that's hot in 2022 is almost certainly still hot in 2025. That temporal persistence is exactly what makes this data amenable to prediction. If hotspots moved randomly month to month, no model could learn them.</p>\n<h2>Crime type correlations</h2>\n<p>The six channels in our tensor don't behave independently. Theft and burglary show moderate positive correlation (r ≈ 0.52) — cells with lots of theft tend to have more burglary too, which makes sense given similar opportunity structures (commercial areas, transport hubs).</p>\n<p>Assault correlates weakly with everything else (r ≈ 0.15–0.25). It has its own spatial logic — nightlife areas, specific residential pockets — that doesn't align neatly with property crime.</p>\n<p>Robbery, sexual offences, and harm are so sparse at the 500m monthly resolution that correlation analysis is basically meaningless. Most cells have zero counts for these types in any given month. That sparsity is going to be a real headache for the models.</p>\n<h2>The sparsity problem — again</h2>\n<p>We flagged this in Part 3: 91.7% of the tensor is zeros. But the EDA makes the problem even clearer.</p>\n<p>The distribution of non-zero cell values is heavily right-skewed. The median non-zero value is 1. One crime, in one cell, in one month. The mean is about 2.3. A handful of cells — the CBD, Manukau — hit 30–50+ in peak months for theft. The model needs to learn the difference between &quot;always zero&quot; cells, &quot;occasionally one&quot; cells, and &quot;consistently busy&quot; cells.</p>\n<p>If you plot the crime count distribution across non-zero cells, it follows something close to a power law. A tiny number of cells carry an outsized share of the signal. This is textbook <a href=\"https://pmc.ncbi.nlm.nih.gov/articles/PMC7319308/\">spatial concentration of crime</a> — it's been documented in basically every city ever studied.</p>\n<p>For modelling, this means two things. First, aggregate metrics like RMSE will be dominated by how well the model predicts the high-count cells. Second, predicting &quot;zero&quot; for a sparse cell is almost always correct but completely uninformative. We'll need to think carefully about what &quot;accuracy&quot; actually means when we get to evaluation.</p>\n<h2>What this means for the models</h2>\n<p>The EDA tells us a few things that should directly shape how we build and evaluate the models:</p>\n<p>The seasonal signal is strong and consistent. A model that can't capture monthly seasonality is worse than useless — it's worse than a calendar.</p>\n<p>Spatial structure is real and persistent. Hotspots don't move much. A model that learns static spatial patterns will get a lot of the way there, even without understanding temporal dynamics.</p>\n<p>We already know the CBD will have lots of theft next month. That's not what we're trying to predict. The real value is in the margins — the cells that go from quiet to active, or the months where a normally stable area spikes. That's where deep learning might actually add something over simple baselines.</p>\n<p>Speaking of which — we need baselines. Otherwise we won't know if ConvLSTM is actually clever or just expensive. That's next.</p>\n","date_published":"Thu, 02 Apr 2026 00:00:00 GMT","image":"https://jonno.nz/og/what-the-data-actually-shows.png"},{"id":"https://jonno.nz/posts/built-an-sms-gateway-with-a-20-dollar-android-phone/","url":"https://jonno.nz/posts/built-an-sms-gateway-with-a-20-dollar-android-phone/","title":"How I Built an SMS Gateway with a $20 Android Phone","content_html":"<p>Twilio charges around $0.05–0.06 per SMS round-trip. Doesn't sound like much until you're building an MVP that sends reminders, confirmations, and notifications — suddenly you're looking at $50/month for a thousand messages. For an app that's not making money yet, that's a dumb tax.</p>\n<p>Here's what I did instead: grabbed a cheap Android phone, installed an open-source app called <a href=\"https://github.com/capcom6/android-sms-gateway\">SMS Gateway for Android</a>, and turned it into a full SMS gateway with a REST API. My SMS costs dropped to whatever my mobile plan charges — which on plenty of prepaid plans is zero. Unlimited texts.</p>\n<p>This post walks through exactly how to wire it into a Next.js app, from first install to receiving webhooks. The whole thing took an afternoon.</p>\n<hr>\n<h2>What You're Building</h2>\n<p>By the end of this you'll have:</p>\n<ul>\n<li>An Android phone acting as your SMS gateway</li>\n<li>A webhook endpoint receiving inbound SMS in real-time</li>\n<li>Outbound SMS sent via a simple REST API call</li>\n<li>A provider abstraction so you can swap between SMS Gateway, Twilio, or console logging</li>\n</ul>\n<h2>Prerequisites</h2>\n<ul>\n<li>An Android phone (5.0+) with a SIM card</li>\n<li>A Next.js app (I'm using 15 with App Router, but any backend works)</li>\n<li>Node.js 18+</li>\n<li>ngrok for testing with cloud mode</li>\n</ul>\n<hr>\n<h2>Install SMS Gateway on Android</h2>\n<ol>\n<li>\n<p>Install <strong>SMS Gateway for Android</strong> from the <a href=\"https://play.google.com/store/apps/details?id=me.capcom.smsgateway\">Google Play Store</a> or grab the APK from <a href=\"https://github.com/capcom6/android-sms-gateway/releases\">GitHub Releases</a></p>\n</li>\n<li>\n<p>Open the app and <strong>grant SMS permissions</strong> when prompted</p>\n</li>\n<li>\n<p>You'll see the main screen with toggles for Local Server and Cloud Server:</p>\n</li>\n</ol>\n<p><img src=\"https://jonno.nz/img/posts/sms-gateway/screenshot.png\" alt=\"SMS Gateway main screen\"></p>\n<p>The app supports two modes — local and cloud. Both work well, and I'll cover each.</p>\n<hr>\n<h2>Local Server Mode</h2>\n<p>Local mode runs an HTTP server directly on the phone. Your backend talks to it over your local network. No cloud dependency, no third-party servers — the simplest setup.</p>\n<h3>Configure It</h3>\n<p><img src=\"https://jonno.nz/img/posts/sms-gateway/local-server.png\" alt=\"Local server settings\">\n<em>Local server configuration</em></p>\n<ol>\n<li>Toggle <strong>&quot;Local Server&quot;</strong> on</li>\n<li>Go to <strong>Settings &gt; Local Server</strong> to configure:\n<ul>\n<li><strong>Port:</strong> 1024–65535 (default <code>8080</code>)</li>\n<li><strong>Username:</strong> minimum 3 characters</li>\n<li><strong>Password:</strong> minimum 8 characters</li>\n</ul>\n</li>\n<li>Tap <strong>&quot;Offline&quot;</strong> — it changes to <strong>&quot;Online&quot;</strong></li>\n<li>Note the <strong>local IP address</strong> displayed (e.g. <code>192.168.1.50</code>)</li>\n</ol>\n<p>Your phone is now running an HTTP server. Verify it:</p>\n<pre><code class=\"language-bash\"># Health check\ncurl http://192.168.1.50:8080/health\n\n# Swagger docs\nopen http://192.168.1.50:8080/docs\n</code></pre>\n<h3>Send Your First SMS</h3>\n<pre><code class=\"language-bash\">curl -X POST http://192.168.1.50:8080/message \\\n  -u &quot;admin:yourpassword&quot; \\\n  -H &quot;Content-Type: application/json&quot; \\\n  -d '{\n    &quot;textMessage&quot;: { &quot;text&quot;: &quot;Hello from my SMS gateway!&quot; },\n    &quot;phoneNumbers&quot;: [&quot;+15551234567&quot;]\n  }'\n</code></pre>\n<p>That's it. The phone sends the SMS from its own number, using your mobile plan's rates.</p>\n<h3>Register a Webhook for Inbound SMS</h3>\n<p>To receive SMS messages as webhooks:</p>\n<pre><code class=\"language-bash\">curl -X POST http://192.168.1.50:8080/webhooks \\\n  -u &quot;admin:yourpassword&quot; \\\n  -H &quot;Content-Type: application/json&quot; \\\n  -d '{\n    &quot;id&quot;: &quot;my-webhook&quot;,\n    &quot;url&quot;: &quot;http://192.168.1.100:4000/api/sms/webhook&quot;,\n    &quot;event&quot;: &quot;sms:received&quot;\n  }'\n</code></pre>\n<p>Replace <code>192.168.1.100</code> with your dev machine's local IP. Both devices need to be on the same WiFi network.</p>\n<h3>Local Mode Gotchas</h3>\n<ul>\n<li><strong>AP isolation:</strong> Many routers — especially mesh networks and office WiFi — block device-to-device traffic. If you can't reach the phone, check your router settings for &quot;AP isolation&quot; or &quot;client isolation&quot; and disable it. This one caught me out for a good 20 minutes.</li>\n<li><strong>Battery optimisation:</strong> Android will kill the background server to save battery. Disable battery optimisation for SMS Gateway in your phone settings. <a href=\"https://dontkillmyapp.com/\">dontkillmyapp.com</a> has device-specific instructions — genuinely useful site.</li>\n<li><strong>Keep it plugged in:</strong> During development and in production, the phone lives on a charger. It's not going anywhere.</li>\n</ul>\n<hr>\n<h2>Cloud Server Mode</h2>\n<p>Cloud mode is easier to set up and works from anywhere — no local network required. The phone connects to SMS Gateway's cloud relay (<code>api.sms-gate.app</code>), and your backend talks to the same cloud API.</p>\n<p><img src=\"https://jonno.nz/img/posts/sms-gateway/cloud-server.png\" alt=\"Cloud server settings\">\n<em>Cloud server configuration</em></p>\n<h3>Enable It</h3>\n<ol>\n<li>Toggle <strong>&quot;Cloud Server&quot;</strong> on in the app</li>\n<li>Tap <strong>&quot;Offline&quot;</strong> — it connects and registers automatically</li>\n<li>A <strong>username</strong> and <strong>password</strong> are auto-generated (visible in the Cloud Server section)</li>\n<li>Note these credentials — you'll need them for API calls</li>\n</ol>\n<p>The cloud uses a hybrid push architecture: Firebase Cloud Messaging as the primary channel, Server-Sent Events as fallback, and 15-minute polling as a last resort. It's well thought through.</p>\n<h3>Send an SMS via Cloud API</h3>\n<pre><code class=\"language-bash\">curl -X POST https://api.sms-gate.app/3rdparty/v1/messages \\\n  -u &quot;YOUR_USERNAME:YOUR_PASSWORD&quot; \\\n  -H &quot;Content-Type: application/json&quot; \\\n  -d '{\n    &quot;textMessage&quot;: { &quot;text&quot;: &quot;Hello from the cloud!&quot; },\n    &quot;phoneNumbers&quot;: [&quot;+15551234567&quot;]\n  }'\n</code></pre>\n<h3>Register a Webhook (Cloud Mode)</h3>\n<p>Your webhook URL <strong>must be HTTPS</strong> in cloud mode. For local development, use ngrok:</p>\n<pre><code class=\"language-bash\"># Start ngrok tunnel to your dev server\nngrok http 4000\n# Output: https://abc123.ngrok.app\n\n# Register the webhook\ncurl -X POST https://api.sms-gate.app/3rdparty/v1/webhooks \\\n  -u &quot;YOUR_USERNAME:YOUR_PASSWORD&quot; \\\n  -H &quot;Content-Type: application/json&quot; \\\n  -d '{\n    &quot;url&quot;: &quot;https://abc123.ngrok.app/api/sms/webhook&quot;,\n    &quot;event&quot;: &quot;sms:received&quot;\n  }'\n</code></pre>\n<h3>Manage Webhooks</h3>\n<pre><code class=\"language-bash\"># List webhooks\ncurl -u &quot;YOUR_USERNAME:YOUR_PASSWORD&quot; \\\n  https://api.sms-gate.app/3rdparty/v1/webhooks\n\n# Delete a webhook\ncurl -X DELETE -u &quot;YOUR_USERNAME:YOUR_PASSWORD&quot; \\\n  https://api.sms-gate.app/3rdparty/v1/webhooks/WEBHOOK_ID\n</code></pre>\n<hr>\n<h2>The Code — Next.js Integration</h2>\n<p>Here's how I integrated SMS Gateway into a Next.js app with a clean provider abstraction. The idea is simple — swap providers without touching business logic.</p>\n<h3>Provider Interface</h3>\n<pre><code class=\"language-typescript\">// src/lib/sms/provider.ts\n\nexport interface InboundSms {\n  from: string;\n  body: string;\n  receivedAt?: Date;\n}\n\nexport interface SmsProvider {\n  send(to: string, body: string): Promise&lt;string&gt;;\n  parseWebhook(req: Request): Promise&lt;InboundSms | null&gt;;\n  webhookResponse(replyText?: string): Response;\n}\n\nexport async function getSmsProvider(): Promise&lt;SmsProvider&gt; {\n  const provider = process.env.SMS_PROVIDER || &quot;sms-gate&quot;;\n\n  switch (provider) {\n    case &quot;sms-gate&quot;: {\n      const { SmsGateProvider } = await import(&quot;./sms-gate&quot;);\n      return new SmsGateProvider();\n    }\n    case &quot;console&quot;: {\n      const { ConsoleProvider } = await import(&quot;./console&quot;);\n      return new ConsoleProvider();\n    }\n    default:\n      throw new Error(`Unknown SMS provider: ${provider}`);\n  }\n}\n</code></pre>\n<h3>SMS Gate Provider</h3>\n<p>The provider handles both local and cloud API differences:</p>\n<pre><code class=\"language-typescript\">// src/lib/sms/sms-gate.ts\n\nimport type { SmsProvider, InboundSms } from &quot;./provider&quot;;\n\nconst SMSGATE_URL = process.env.SMSGATE_URL || &quot;http://localhost:8080&quot;;\nconst SMSGATE_USER = process.env.SMSGATE_USER || &quot;&quot;;\nconst SMSGATE_PASSWORD = process.env.SMSGATE_PASSWORD || &quot;&quot;;\n\nexport class SmsGateProvider implements SmsProvider {\n  private headers(): Record&lt;string, string&gt; {\n    const auth = Buffer.from(\n      `${SMSGATE_USER}:${SMSGATE_PASSWORD}`\n    ).toString(&quot;base64&quot;);\n    return {\n      &quot;Content-Type&quot;: &quot;application/json&quot;,\n      Authorization: `Basic ${auth}`,\n    };\n  }\n\n  async send(to: string, body: string): Promise&lt;string&gt; {\n    const isCloud = SMSGATE_URL.includes(&quot;api.sms-gate.app&quot;);\n    const endpoint = isCloud\n      ? `${SMSGATE_URL}/3rdparty/v1/messages`\n      : `${SMSGATE_URL}/api/3rdparty/v1/message`;\n    const payload = isCloud\n      ? { textMessage: { text: body }, phoneNumbers: [to] }\n      : { phoneNumbers: [to], message: body };\n\n    const res = await fetch(endpoint, {\n      method: &quot;POST&quot;,\n      headers: this.headers(),\n      body: JSON.stringify(payload),\n    });\n\n    if (!res.ok) {\n      const err = await res.text();\n      throw new Error(`SMS Gate send failed: ${res.status} ${err}`);\n    }\n\n    const data = await res.json();\n    return data.id || &quot;sent&quot;;\n  }\n\n  async parseWebhook(req: Request): Promise&lt;InboundSms | null&gt; {\n    try {\n      const body = await req.json();\n\n      if (body.event !== &quot;sms:received&quot; || !body.payload) {\n        return null;\n      }\n\n      const { phoneNumber, message, receivedAt } = body.payload;\n      if (!phoneNumber || !message) return null;\n\n      return {\n        from: phoneNumber,\n        body: message,\n        receivedAt: receivedAt ? new Date(receivedAt) : new Date(),\n      };\n    } catch {\n      return null;\n    }\n  }\n\n  webhookResponse(): Response {\n    return new Response(JSON.stringify({ ok: true }), {\n      headers: { &quot;Content-Type&quot;: &quot;application/json&quot; },\n    });\n  }\n}\n</code></pre>\n<h3>Webhook Route</h3>\n<p>A basic webhook handler that receives inbound SMS and replies:</p>\n<pre><code class=\"language-typescript\">// src/app/api/sms/webhook/route.ts\n\nimport { NextRequest } from &quot;next/server&quot;;\nimport { getSmsProvider } from &quot;@/lib/sms/provider&quot;;\n\nexport async function POST(req: NextRequest) {\n  const provider = await getSmsProvider();\n  const sms = await provider.parseWebhook(req);\n\n  if (!sms) {\n    return new Response(&quot;Bad request&quot;, { status: 400 });\n  }\n\n  const { from, body } = sms;\n\n  // Look up the sender — replace with your own user lookup\n  const user = await findUserByPhone(from);\n\n  if (!user) {\n    await provider.send(from, &quot;Hey! Text us back once you've signed up.&quot;);\n    return provider.webhookResponse();\n  }\n\n  // Known user — do whatever your app needs\n  console.log(`[SMS from ${from}]: ${body}`);\n  await provider.send(from, &quot;Got it — we're on it!&quot;);\n  return provider.webhookResponse();\n}\n</code></pre>\n<h3>Console Provider (for Testing)</h3>\n<p>For local development without a phone:</p>\n<pre><code class=\"language-typescript\">// src/lib/sms/console.ts\n\nimport type { SmsProvider, InboundSms } from &quot;./provider&quot;;\n\nexport class ConsoleProvider implements SmsProvider {\n  async send(to: string, body: string): Promise&lt;string&gt; {\n    console.log(`[SMS -&gt; ${to}] ${body}`);\n    return `console-${Date.now()}`;\n  }\n\n  async parseWebhook(req: Request): Promise&lt;InboundSms | null&gt; {\n    const data = await req.json();\n    return {\n      from: data.from || &quot;+15550000000&quot;,\n      body: data.body || &quot;&quot;,\n      receivedAt: new Date(),\n    };\n  }\n\n  webhookResponse(): Response {\n    return new Response(JSON.stringify({ ok: true }), {\n      headers: { &quot;Content-Type&quot;: &quot;application/json&quot; },\n    });\n  }\n}\n</code></pre>\n<h3>Environment Variables</h3>\n<pre><code class=\"language-bash\"># .env\n\n# Provider: &quot;sms-gate&quot; | &quot;console&quot;\nSMS_PROVIDER=sms-gate\n\n# Local mode\nSMSGATE_URL=http://192.168.1.50:8080\nSMSGATE_USER=admin\nSMSGATE_PASSWORD=yourpassword\n\n# Cloud mode\n# SMSGATE_URL=https://api.sms-gate.app\n# SMSGATE_USER=auto-generated-username\n# SMSGATE_PASSWORD=auto-generated-password\n</code></pre>\n<hr>\n<h2>Webhook Payload Reference</h2>\n<p>When someone texts your Android phone, SMS Gateway sends a POST to your webhook URL:</p>\n<pre><code class=\"language-json\">{\n  &quot;id&quot;: &quot;Ey6ECgOkVVFjz3CL48B8C&quot;,\n  &quot;webhookId&quot;: &quot;LreFUt-Z3sSq0JufY9uWB&quot;,\n  &quot;deviceId&quot;: &quot;your-device-id&quot;,\n  &quot;event&quot;: &quot;sms:received&quot;,\n  &quot;payload&quot;: {\n    &quot;messageId&quot;: &quot;abc123&quot;,\n    &quot;message&quot;: &quot;Hello!&quot;,\n    &quot;sender&quot;: &quot;+15551234567&quot;,\n    &quot;recipient&quot;: &quot;+15559876543&quot;,\n    &quot;simNumber&quot;: 1,\n    &quot;receivedAt&quot;: &quot;2026-04-01T12:41:59.000+00:00&quot;\n  }\n}\n</code></pre>\n<h3>Available Events</h3>\n<table>\n<thead>\n<tr>\n<th>Event</th>\n<th>Description</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td><code>sms:received</code></td>\n<td>Inbound SMS received</td>\n</tr>\n<tr>\n<td><code>sms:sent</code></td>\n<td>Outbound SMS sent</td>\n</tr>\n<tr>\n<td><code>sms:delivered</code></td>\n<td>Outbound SMS confirmed delivered</td>\n</tr>\n<tr>\n<td><code>sms:failed</code></td>\n<td>Outbound SMS failed</td>\n</tr>\n<tr>\n<td><code>system:ping</code></td>\n<td>Heartbeat — device still alive</td>\n</tr>\n</tbody>\n</table>\n<h3>Webhook Security</h3>\n<p>SMS Gateway signs webhook payloads with HMAC-SHA256. Two headers are included:</p>\n<ul>\n<li><code>X-Signature</code> — hex-encoded HMAC-SHA256 signature</li>\n<li><code>X-Timestamp</code> — Unix timestamp used in signing</li>\n</ul>\n<pre><code class=\"language-typescript\">import crypto from &quot;crypto&quot;;\n\nfunction verifyWebhook(\n  signingKey: string,\n  payload: string,\n  timestamp: string,\n  signature: string\n): boolean {\n  const expected = crypto\n    .createHmac(&quot;sha256&quot;, signingKey)\n    .update(payload + timestamp)\n    .digest(&quot;hex&quot;);\n  return crypto.timingSafeEqual(\n    Buffer.from(expected, &quot;hex&quot;),\n    Buffer.from(signature, &quot;hex&quot;)\n  );\n}\n</code></pre>\n<h3>Retry Behaviour</h3>\n<p>If your server doesn't respond 2xx within 30 seconds, SMS Gateway retries with exponential backoff — starting at 10 seconds, doubling each time, up to 14 attempts (~2 days). Solid default behaviour, you don't need to configure anything.</p>\n<hr>\n<h2>Testing the Full Flow</h2>\n<h3>1. Start Your Dev Server</h3>\n<pre><code class=\"language-bash\">npm run dev\n# Next.js running at http://localhost:4000\n</code></pre>\n<h3>2. Expose It (Cloud Mode)</h3>\n<pre><code class=\"language-bash\">ngrok http 4000\n# https://abc123.ngrok.app -&gt; http://localhost:4000\n</code></pre>\n<h3>3. Register the Webhook</h3>\n<pre><code class=\"language-bash\">curl -X POST https://api.sms-gate.app/3rdparty/v1/webhooks \\\n  -u &quot;USERNAME:PASSWORD&quot; \\\n  -H &quot;Content-Type: application/json&quot; \\\n  -d '{\n    &quot;url&quot;: &quot;https://abc123.ngrok.app/api/sms/webhook&quot;,\n    &quot;event&quot;: &quot;sms:received&quot;\n  }'\n</code></pre>\n<h3>4. Send a Text</h3>\n<p>Text your Android phone from another phone. You should see:</p>\n<ol>\n<li>SMS Gateway receives the text</li>\n<li>Webhook fires to your ngrok URL</li>\n<li>Your Next.js server processes it</li>\n<li>A reply SMS is sent back via the API</li>\n<li>The sender's phone receives the reply</li>\n</ol>\n<p>That moment when the reply lands on your phone — genuinely satisfying.</p>\n<h3>Test Without a Phone</h3>\n<pre><code class=\"language-bash\"># Simulate an inbound SMS with the console provider\nSMS_PROVIDER=console npm run dev\n\ncurl -X POST http://localhost:4000/api/sms/webhook \\\n  -H &quot;Content-Type: application/json&quot; \\\n  -d '{&quot;from&quot;: &quot;+15551234567&quot;, &quot;body&quot;: &quot;Hello&quot;}'\n</code></pre>\n<hr>\n<h2>Production Considerations</h2>\n<h3>The Phone Setup</h3>\n<ul>\n<li><strong>Dedicated device:</strong> Use a cheap Android phone ($20) with a prepaid SIM. It sits on a charger plugged into power and WiFi. That's its whole life now.</li>\n<li><strong>Battery optimisation off:</strong> Disable battery optimisation for SMS Gateway or Android will kill it. <a href=\"https://dontkillmyapp.com/\">dontkillmyapp.com</a> for your specific device.</li>\n<li><strong>Auto-start:</strong> Enable &quot;start on boot&quot; in the SMS Gateway app settings.</li>\n<li><strong>Monitoring:</strong> Register a <code>system:ping</code> webhook to alert if the device goes offline.</li>\n</ul>\n<h3>Local vs Cloud</h3>\n<table>\n<thead>\n<tr>\n<th></th>\n<th>Local</th>\n<th>Cloud</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td><strong>Latency</strong></td>\n<td>Lower (direct)</td>\n<td>Slightly higher (relay)</td>\n</tr>\n<tr>\n<td><strong>Network</strong></td>\n<td>Same network required</td>\n<td>Works from anywhere</td>\n</tr>\n<tr>\n<td><strong>Privacy</strong></td>\n<td>Messages never leave your network</td>\n<td>Messages transit through SMS Gateway's servers</td>\n</tr>\n<tr>\n<td><strong>Reliability</strong></td>\n<td>Depends on your network</td>\n<td>Adds FCM/SSE redundancy</td>\n</tr>\n<tr>\n<td><strong>Cost</strong></td>\n<td>Free</td>\n<td>Free (community tier)</td>\n</tr>\n</tbody>\n</table>\n<p>I use <strong>cloud mode in production</strong> because my server's hosted on Railway and can't reach the phone's local network. For development on the same WiFi, local mode is simpler and faster.</p>\n<h3>Cost Comparison</h3>\n<table>\n<thead>\n<tr>\n<th>Provider</th>\n<th>SMS Cost</th>\n<th>Monthly (1,000 msgs)</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Twilio</td>\n<td>~$0.05/msg</td>\n<td>~$50</td>\n</tr>\n<tr>\n<td>SMS Gateway + Prepaid SIM</td>\n<td>$0/msg (unlimited plan)</td>\n<td>~$8 (plan cost)</td>\n</tr>\n</tbody>\n</table>\n<p>That's an <strong>80%+ saving</strong>, and it scales linearly — 10,000 messages a month is still just your plan cost.</p>\n<hr>\n<p>It's worth knowing this is a whole category now. <a href=\"https://httpsms.com/\">httpSMS</a> and <a href=\"https://textbee.dev/\">textbee</a> do similar things. I went with <a href=\"https://github.com/capcom6/android-sms-gateway\">SMS Gateway for Android</a> because the local mode is properly useful for development, the <a href=\"https://docs.sms-gate.app/\">documentation</a> is solid, and it's actively maintained — v1.56.0 dropped in March 2026.</p>\n<p>For an MVP, the maths is obvious. A $20 phone and an $8/month plan gets you a programmable SMS gateway that you fully control. No per-message fees, no carrier contracts, no vendor lock-in. If you outgrow it, swap the provider interface to Twilio and you're done — that's why the abstraction exists.</p>\n<p><strong>Links:</strong></p>\n<ul>\n<li><a href=\"https://github.com/capcom6/android-sms-gateway\">SMS Gateway for Android on GitHub</a></li>\n<li><a href=\"https://docs.sms-gate.app/\">SMS Gateway Documentation</a></li>\n<li><a href=\"https://play.google.com/store/apps/details?id=me.capcom.smsgateway\">Google Play Store listing</a></li>\n</ul>\n","date_published":"Thu, 02 Apr 2026 00:00:00 GMT","image":"https://jonno.nz/og/built-an-sms-gateway-with-a-20-dollar-android-phone.png"},{"id":"https://jonno.nz/posts/giving-crime-a-place-on-the-map/","url":"https://jonno.nz/posts/giving-crime-a-place-on-the-map/","title":"Giving Crime a Place on the Map — Part 2: Geographic Data Pipeline","content_html":"<p>A crime record that says &quot;Woodglen, meshblock 0284305&quot; is useless for spatial modelling. It's a name and a number. You can't plot it, you can't measure distances from it, and you definitely can't feed it to a neural network that thinks in grid cells.</p>\n<p>To do anything spatial, every record needs actual coordinates — latitude, longitude, or ideally metres on a proper projection. That means downloading Stats NZ's geographic boundary files and joining them to our crime data.</p>\n<h2>NZ's geographic hierarchy</h2>\n<p>New Zealand has a neat nested system of geographic units maintained by <a href=\"https://datafinder.stats.govt.nz/layer/92197-meshblock-2018-generalised/\">Stats NZ</a>:</p>\n<pre><code class=\"language-mermaid\">graph TD\n    A[&quot;Region (16)&quot;] --&gt; B[&quot;Territorial Authority (67)&quot;]\n    B --&gt; C[&quot;Area Unit / SA2 (~2,000)&quot;]\n    C --&gt; D[&quot;Meshblock (~53,000)&quot;]\n</code></pre>\n<p>Regions are the big ones — Auckland, Canterbury, Wellington. Territorial authorities are your cities and districts. Area units are roughly suburb-sized. And meshblocks are the smallest unit — about 100 people each, roughly a city block. Our crime data uses area units and meshblocks, so those are the layers we need.</p>\n<p>There's a gotcha here. Stats NZ replaced &quot;Area Units&quot; with &quot;Statistical Area 2&quot; (SA2) in 2018 as part of a geographic classification overhaul. But the NZ Police crime data still uses the old area unit names. So we need the <strong>2017 vintage</strong> boundary files, not the current ones. Use the wrong vintage and your join silently fails on hundreds of area units. Ask me how I know.</p>\n<h2>Three boundary files</h2>\n<p>We downloaded three layers from <a href=\"https://datafinder.stats.govt.nz/\">Stats NZ DataFinder</a> via their WFS API:</p>\n<table>\n<thead>\n<tr>\n<th>Layer</th>\n<th>Features</th>\n<th>Size</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Area Unit 2017 (generalised)</td>\n<td>2,004</td>\n<td>88 MB</td>\n</tr>\n<tr>\n<td>Meshblock 2018 (generalised)</td>\n<td>53,589</td>\n<td>213 MB</td>\n</tr>\n<tr>\n<td>Territorial Authority 2023</td>\n<td>68</td>\n<td>34 MB</td>\n</tr>\n</tbody>\n</table>\n<p>All three come in <a href=\"https://epsg.io/2193\">EPSG:2193</a> — that's <a href=\"https://www.linz.govt.nz/guidance/geodetic-system/coordinate-systems-used-new-zealand/projections/new-zealand-transverse-mercator-2000-nztm2000\">NZTM2000</a>, New Zealand's official projected coordinate system. The units are metres, not degrees. This matters a lot later when we need to build a &quot;500m grid&quot; — you want that to be 500 actual metres, not some approximation based on latitude.</p>\n<p>We use generalised (simplified) versions rather than high-definition. The full-resolution meshblock layer is over a gigabyte. For centroid calculations and spatial joins, the generalised versions are more than accurate enough.</p>\n<h2>The area unit join: 99.4%</h2>\n<p>Joining crime records to area unit boundaries by name was almost perfect. 1,146,721 of 1,154,102 records matched — 99.4%.</p>\n<p>Only two area unit codes failed:</p>\n<ul>\n<li><code>999999</code> — the official &quot;unspecified&quot; catch-all (7,331 records)</li>\n<li><code>-29</code> — a straight-up data entry error (50 records)</li>\n</ul>\n<p>That's a genuinely excellent result. The unmatched records aren't a bug in our pipeline — they're genuinely unlocatable crimes that the police couldn't assign to a specific area. Nothing we can do about those, and nothing we should try to do.</p>\n<h2>The meshblock join: 81.2%</h2>\n<p>The meshblock join came in lower at 81.2% — 937,604 records matched out of 1,154,102.</p>\n<p>This is expected and it's fine. Here's why: NZ meshblock boundaries get revised with every census. We're using 2018 boundaries, but our crime data runs through January 2026. Any crime from 2023 onwards might reference a 2023-vintage meshblock code that simply doesn't exist in the 2018 file. Some meshblocks get split, some get merged, some get renumbered entirely.</p>\n<p>81.2% still gives us fine-grained coordinates for the vast majority of records. For the ~19% that miss, we fall back to the area unit centroid. It's less precise — suburb-level instead of block-level — but it's better than dropping the records entirely.</p>\n<h2>Two coordinate systems</h2>\n<p>This is one of those things that seems like a minor detail but will bite you hard if you get it wrong. We use two coordinate reference systems throughout the project:</p>\n<p><strong>NZTM2000 (EPSG:2193)</strong> for all spatial analysis. The units are metres, which makes grid construction trivial — a 500m cell is literally 500 units on each axis. Distance calculations are straightforward. No need to worry about the fact that a degree of longitude means different things at different latitudes.</p>\n<p><strong>WGS84 (EPSG:4326)</strong> for the frontend dashboard only. deck.gl and MapLibre expect coordinates in degrees (latitude/longitude), which is the standard for web mapping.</p>\n<p>The rule is simple: do everything in NZTM2000, convert to WGS84 at the very end when exporting for the dashboard. Mixing coordinate systems mid-pipeline is a recipe for bugs that are incredibly annoying to track down.</p>\n<h2>The output</h2>\n<p>Each crime record now has up to 8 new geographic columns — area unit centroids, meshblock centroids, and areas in both coordinate systems. The enriched dataset saves as <code>crimes_with_geo.parquet</code> at 21.9 MB with 29 columns.</p>\n<p>Quick sanity check: Auckland's mean crime centroid lands at lat -36.90, lon 174.78 — right in the middle of the urban area. If that number had come back as somewhere in the Waikato, we'd know something went wrong.</p>\n<h2>What's next</h2>\n<p>Every crime record now has a place in physical space. But individual points aren't what the neural network needs — it needs a regular grid. In the next post, we'll overlay a 500m × 500m grid on Auckland, count crimes per cell per month, and build the 4D tensor that turns crime prediction into a video prediction problem.</p>\n","date_published":"Thu, 26 Mar 2026 00:00:00 GMT","image":"https://jonno.nz/og/giving-crime-a-place-on-the-map.png"},{"id":"https://jonno.nz/posts/crime-as-video/","url":"https://jonno.nz/posts/crime-as-video/","title":"Crime as Video — Part 3: Spatiotemporal Grid Construction","content_html":"<p>This is where the project gets properly fun.</p>\n<p>We've got 1.15 million clean crime records. Every one of them has coordinates — either precise meshblock centroids or area unit fallbacks from Part 2. But a bag of lat/lon points isn't what a neural network wants. ConvLSTM and ST-ResNet are fundamentally image-processing architectures. They expect regular 2D grids — rows and columns, like pixels in a photograph.</p>\n<p>So our job now is to convert the messy reality of crime locations into clean, regular &quot;crime images&quot; that a convolutional network can actually consume. And once you see it framed that way, crime prediction becomes video prediction. Each month is a frame. Each grid cell is a pixel. The brightness is the crime count.</p>\n<h2>Choosing 500m</h2>\n<p>This is the single most consequential decision in the entire data pipeline. Get the grid resolution wrong and everything downstream suffers.</p>\n<p>Too fine — say 100m cells — and the vast majority of cells are empty in any given month. The model sees an ocean of zeros with occasional spikes, which is incredibly hard to learn from. Too coarse — say 2km — and you've blurred away the spatial patterns you're trying to detect. &quot;Auckland CBD&quot; and &quot;Ponsonby&quot; become the same cell, which is useless.</p>\n<p>We computed Auckland's urban crime extent from the meshblock centroids (5th to 95th percentile to exclude outliers like Great Barrier Island):</p>\n<table>\n<thead>\n<tr>\n<th>Metric</th>\n<th>Value</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Urban extent</td>\n<td>27.7 km × 36.9 km</td>\n</tr>\n<tr>\n<td>Grid resolution</td>\n<td>500m × 500m</td>\n</tr>\n<tr>\n<td>Grid dimensions</td>\n<td>77 rows × 59 columns</td>\n</tr>\n<tr>\n<td>Total cells</td>\n<td>4,543</td>\n</tr>\n</tbody>\n</table>\n<p>At 500m, each cell covers roughly a few city blocks. That's fine enough to distinguish a commercial strip from a residential street, but coarse enough that most cells accumulate at least some crime over the 48-month period. It's a sweet spot — and it's consistent with what <a href=\"https://arxiv.org/abs/2502.07465\">recent crime forecasting research</a> uses for similar models in US cities.</p>\n<h2>Simple maths, no spatial joins</h2>\n<p>Here's the nice thing about working in NZTM2000 — the coordinate system we set up in Part 2 where units are metres. Assigning a crime to a grid cell is just floor division:</p>\n<pre><code class=\"language-python\">grid_j = floor((x - xmin) / 500)  # column index\ngrid_i = floor((y - ymin) / 500)  # row index\n</code></pre>\n<p>No spatial joins, no polygon intersection, no geopandas overhead. Just arithmetic. It processes all 400k Auckland records in under a second.</p>\n<p>For the ~22% of Auckland records that didn't get meshblock coordinates in Part 2, we fall back to area unit centroids converted to NZTM2000. Those records land at the centre of their suburb rather than their exact location — less precise, but dropping them entirely would be worse.</p>\n<p>The result: 354,387 of 412,669 Auckland records (86.2%) fall within the grid. The remaining 14% are in Auckland's outer fringes — Great Barrier Island, rural Rodney, the edges of the Waitakere Ranges — beyond our urban bounding box. That's fine. We're modelling urban crime patterns, not rural ones.</p>\n<h2>The 4D tensor</h2>\n<p>With every crime assigned to a cell, we aggregate by grid position, month, and crime type:</p>\n<pre><code>(grid_i, grid_j, month, crime_type) → sum(victimisations)\n</code></pre>\n<p>This gives us a 4D tensor:</p>\n<table>\n<thead>\n<tr>\n<th>Dimension</th>\n<th>Size</th>\n<th>Meaning</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>T (time)</td>\n<td>48</td>\n<td>Months: Feb 2022 – Jan 2026</td>\n</tr>\n<tr>\n<td>H (height)</td>\n<td>77</td>\n<td>Grid rows (south → north)</td>\n</tr>\n<tr>\n<td>W (width)</td>\n<td>59</td>\n<td>Grid columns (west → east)</td>\n</tr>\n<tr>\n<td>C (channels)</td>\n<td>6</td>\n<td>Crime types: theft, burglary, assault, robbery, sexual, harm</td>\n</tr>\n</tbody>\n</table>\n<p>Think of it as a 48-frame video with 6 colour channels. A regular video has 3 channels — red, green, blue. Ours has 6 — theft, burglary, assault, robbery, sexual offences, harm. Each pixel's brightness in a given channel tells you how many of that crime type happened in that 500m cell during that month.</p>\n<p>I genuinely love this framing. It takes a complicated spatial-temporal prediction problem and maps it onto something that decades of computer vision research already knows how to handle.</p>\n<h2>91.7% zeros</h2>\n<p>The tensor is overwhelmingly empty. 91.7% of all cells are zero.</p>\n<p>This makes complete sense if you think about it. Most 500m squares in Auckland don't have a single reported crime in any given month. Crime clusters — commercial corridors, transport hubs, specific residential pockets. The non-zero 8.3% is where all the signal lives.</p>\n<p>The sparsity does create a training challenge though. If the model just predicted zero everywhere, it'd be right 91.7% of the time. Useless, but technically accurate. That's why we'll use <code>log1p</code> normalisation during training — it compresses the range from [0, 50+] to [0, ~4], giving the model a more balanced gradient to learn from. And it's why the loss function needs to care more about the non-zero cells than the empty ones.</p>\n<p>The upside of all those zeros is storage. The <a href=\"https://numpy.org/doc/stable/reference/generated/numpy.savez_compressed.html\">compressed numpy format</a> handles sparse data beautifully — the full 4D tensor saves to just 0.2 MB. Compare that to the 21.9 MB Parquet from Part 2.</p>\n<h2>Train, validate, test</h2>\n<p>We split the 48 months temporally — no shuffling, no random sampling:</p>\n<table>\n<thead>\n<tr>\n<th>Set</th>\n<th>Months</th>\n<th>Range</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Train</td>\n<td>36</td>\n<td>Feb 2022 – Jan 2025</td>\n</tr>\n<tr>\n<td>Validation</td>\n<td>6</td>\n<td>Feb 2025 – Jul 2025</td>\n</tr>\n<tr>\n<td>Test</td>\n<td>6</td>\n<td>Aug 2025 – Jan 2026</td>\n</tr>\n</tbody>\n</table>\n<p>The model trains on three years, tunes on six months, and gets evaluated on the most recent six months it's never seen. There's no spatial leakage either — we don't hold out specific grid cells. The model has to predict all locations for future months simultaneously.</p>\n<p>This is the only honest way to evaluate a time-series model. If you randomly shuffle months into train and test, the model can memorise seasonal patterns and look brilliant without actually learning anything useful about temporal dynamics.</p>\n<h2>What the tensor reveals</h2>\n<p>Even at this aggregate level, clear patterns jump out.</p>\n<p>February tends to be the quietest month (~7–8k victimisations across Auckland), while October through January — spring and early summer — consistently peaks at 8.5–9.5k. 2023 was the peak year across the board, with a gradual decline through 2024 and into 2025.</p>\n<p>Theft accounts for 72% of the tensor values (283k victimisations), burglary 17% (68k), and assault 9% (34k). That theft dominance from Part 1 — the 66% figure — gets even more pronounced when you focus on Auckland, because theft clusters harder in urban areas than other crime types do.</p>\n<h2>What's next</h2>\n<p>The tensor is built. The model input is ready. But before throwing deep learning at anything, we need to properly understand what patterns actually exist in this data — when does crime peak, where does it cluster, and how do different crime types behave differently. Next post: exploratory data analysis.</p>\n","date_published":"Thu, 26 Mar 2026 00:00:00 GMT","image":"https://jonno.nz/og/crime-as-video.png"},{"id":"https://jonno.nz/posts/wrangling-a-million-crime-records/","url":"https://jonno.nz/posts/wrangling-a-million-crime-records/","title":"Wrangling a Million Crime Records — Part 1: Data Acquisition","content_html":"<p>The very first thing NZ Police's crime dataset teaches you is that government data is never straightforward.</p>\n<p>You download the CSV from <a href=\"https://www.police.govt.nz/about-us/publications-statistics/data-and-statistics/policedatanz/victimisation-time-and-place\">policedata.nz</a>, expecting to do a quick <code>pd.read_csv()</code> and start exploring. Instead you get a 503MB file encoded in UTF-16 Little Endian with tab delimiters. Not a regular CSV. Not even close. This is a legacy format from old Excel exports and most tools just silently corrupt it if you try to read it as UTF-8.</p>\n<pre><code class=\"language-python\">df = pd.read_csv(&quot;data.csv&quot;, encoding=&quot;utf-16-le&quot;, sep=&quot;\\t&quot;)\n</code></pre>\n<p>That one line took longer to figure out than I'd like to admit.</p>\n<h2>What's actually in here</h2>\n<p>Once you get past the encoding, there's a lot to work with. 1,154,102 rows covering every reported victimisation in New Zealand from February 2022 through January 2026. Each row tells you the crime type (ANZSOC Division), where it happened (down to meshblock level), when it happened (month, day of week, hour of day), and sometimes what weapon was involved.</p>\n<p>There are 20 columns, but five of them are useless — three are just duplicates of &quot;Year Month&quot; and two are constants that add zero information. Every area name and territorial authority has a trailing period stuck on the end: &quot;Auckland.&quot;, &quot;Woodglen.&quot;, &quot;Christchurch City.&quot;. A quirk of the export that'll break any geographic join if you don't strip them.</p>\n<p>And meshblock IDs? Some are 6 digits, some are 7. Stats NZ boundary files use 7-digit codes consistently, so shorter ones need zero-padding. The kind of thing that's invisible until your join silently drops 19% of your records and you spend an afternoon figuring out why.</p>\n<h2>The interesting absence</h2>\n<p>Here's the bit that actually made me stop and think. 32.2% of records have the hour of day recorded as 99 — unknown. Another 23.2% have the day of week as &quot;UNKNOWN&quot;.</p>\n<p>At first this looks like a data quality problem. But it's not — it's actually telling you something about the nature of the crime. If someone breaks into your house while you're at work, you come home to find your stuff gone. Was it 9am or 2pm? You've got no idea, and neither do the police.</p>\n<p>Property crimes — theft, burglary — make up the bulk of these unknowns. Assault, by contrast, almost always has a precise time because there's a victim present when it happens. The absence of data is itself a signal about what kind of crime you're looking at.</p>\n<p>78.6% of location type values are just &quot;.&quot; — effectively missing. That column is sparsely populated but still useful for the roughly one in five records that have it.</p>\n<h2>Cleaning it up</h2>\n<p>We built a modular pipeline where each cleaning step is its own function. Nothing fancy, just practical:</p>\n<pre><code class=\"language-python\">def ingest() -&gt; pd.DataFrame:\n    df = load_raw_csv(RAW_CSV)            # UTF-16 LE, tab-delimited\n    df = drop_redundant_columns(df)        # Remove 5 useless columns\n    df = rename_columns(df)                # snake_case everything\n    df = parse_dates(df)                   # &quot;July 2022&quot; → datetime\n    df = clean_strings(df)                 # Strip trailing periods\n    df = clean_meshblocks(df)              # Zero-pad to 7 digits\n    df = encode_unknowns(df)              # 99 → NaN, &quot;UNKNOWN&quot; → NaN\n    df = map_crime_types(df)               # ANZSOC Division → short enum\n    return df\n</code></pre>\n<p>Each function does one thing. If something breaks, you know exactly where. If someone wants to understand the pipeline, they can read it top to bottom in about thirty seconds. I've been bitten enough times by monolithic data scripts that I'm allergic to them now.</p>\n<p>The crime type mapping turns six ANZSOC Division values into short enums:</p>\n<table>\n<thead>\n<tr>\n<th>Crime Type</th>\n<th>Count</th>\n<th>Share</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Theft</td>\n<td>761,977</td>\n<td>66.0%</td>\n</tr>\n<tr>\n<td>Burglary</td>\n<td>247,034</td>\n<td>21.4%</td>\n</tr>\n<tr>\n<td>Assault</td>\n<td>115,383</td>\n<td>10.0%</td>\n</tr>\n<tr>\n<td>Robbery</td>\n<td>14,860</td>\n<td>1.3%</td>\n</tr>\n<tr>\n<td>Sexual</td>\n<td>13,943</td>\n<td>1.2%</td>\n</tr>\n<tr>\n<td>Harm</td>\n<td>905</td>\n<td>0.1%</td>\n</tr>\n</tbody>\n</table>\n<p>That 66% theft number is going to haunt us when we get to model training. Any loss function you throw at this data will overwhelmingly optimise for predicting theft, because that's two-thirds of everything. The class imbalance is real and it matters.</p>\n<h2>503MB to 6.3MB</h2>\n<p>The cleaned output goes to <a href=\"https://www.datacamp.com/tutorial/apache-parquet\">Apache Parquet</a> with snappy compression. The result?</p>\n<ul>\n<li><strong>Input</strong>: 503MB CSV (UTF-16, 20 columns)</li>\n<li><strong>Output</strong>: 6.3MB Parquet (21 columns including derived fields)</li>\n<li><strong>Compression</strong>: ~80x</li>\n</ul>\n<p>That's not a typo. Parquet's columnar storage is dramatically more efficient than row-oriented CSV, especially when you've got columns full of repeated values like crime types and territorial authorities. The file loads in under a second compared to 3+ seconds for the CSV. When you're iterating on analysis and loading this data hundreds of times, that adds up fast.</p>\n<p>The 21 output columns include the original 16 we kept plus five derived ones: a proper datetime, year, month, day-of-week as an integer, and the short crime type enum.</p>\n<h2>Sanity checks</h2>\n<p>Before calling the data clean, we verify everything that matters:</p>\n<ul>\n<li>Row count: 1,154,102 — all rows preserved, nothing dropped</li>\n<li>No nulls in key columns: crime_type, date, area_unit, territorial_authority, meshblock</li>\n<li>Date range: Feb 2022 to Jan 2026 — all 48 months present</li>\n<li>Auckland: 412,669 records, 36% of total — exactly where it should be</li>\n<li>Theft: 761,977 records, 66% — as expected</li>\n<li>No trailing periods anywhere in area names</li>\n<li>All meshblock IDs are 7 digits</li>\n<li>Max hour value is 23 — no more 99s leaking through</li>\n</ul>\n<p>You want these checks automated and running every time you regenerate the data. Future you will thank past you when something upstream changes and a check catches it.</p>\n<h2>What's next</h2>\n<p>We've got clean, compressed crime data — but the records only have meshblock IDs and area unit names. No coordinates. No shapes on a map. In the next post, we'll download Stats NZ geographic boundary files and join them to our crime records, giving every victimisation a place in physical space.</p>\n","date_published":"Thu, 26 Mar 2026 00:00:00 GMT","image":"https://jonno.nz/og/wrangling-a-million-crime-records.png"},{"id":"https://jonno.nz/posts/predicting-crime-in-aotearoa/","url":"https://jonno.nz/posts/predicting-crime-in-aotearoa/","title":"Predicting Crime in Aotearoa — A Hobby Project with 1.15 Million Records","content_html":"<p>NZ Police publish every recorded victimisation in the country — over a million records — and most people have no idea.</p>\n<p>I stumbled across <a href=\"https://www.police.govt.nz/about-us/publications-statistics/data-and-statistics/policedatanz\">policedata.nz</a> a while back and was surprised by how much is there. Every reported theft, assault, burglary, robbery — broken down by location, time of day, day of week, and month. All the way down to meshblock level, which is roughly a city block. Updated monthly. <a href=\"https://www.police.govt.nz/about-us/publications-statistics/data-and-statistics/policedatanz\">Creative Commons licensed</a>. You can just... use it.</p>\n<p>So naturally I started wondering — what happens if you point deep learning at this?</p>\n<h2>A million rows of crime</h2>\n<p>The dataset I pulled covers February 2022 through January 2026. Four years, 1,154,102 records across the whole country. The breakdown is roughly what you'd expect: theft dominates at 66%, followed by burglary at 21% and assault at 10%. The remaining sliver covers robbery, sexual offences, and harm/endangerment.</p>\n<p>What makes it interesting for modelling is the spatial granularity. Each record maps to one of 42,778 meshblocks — tiny geographic units defined by Stats NZ. That's detailed enough to see patterns at a neighbourhood level, not just &quot;Auckland has more crime than Tauranga&quot; (which, yeah, obviously).</p>\n<p>Auckland alone accounts for about 36% of all recorded crime. Then there's a long tail — Wellington, Christchurch, Hamilton, and then it drops off fast. NZ's urban geography is weird like that. One mega-city and a bunch of mid-size towns.</p>\n<h2>The idea</h2>\n<p>The core question is pretty simple: given the crime patterns of the last few months, can we predict what the next month looks like?</p>\n<p>This isn't Minority Report. Nobody's getting arrested for crimes they haven't committed. It's pattern recognition on publicly available statistics — the same kind of modelling people do with weather data or traffic flows.</p>\n<p>The neat trick is how you frame it. If you overlay a grid on a city — say 500m by 500m cells — and count crimes per cell per month, you get something that looks a lot like a video. Each month is a frame. Each cell is a pixel. The brightness is the crime count.</p>\n<p>Predicting next month's crime becomes a video prediction problem. And there are some really cool deep learning architectures built exactly for that.</p>\n<h2>ConvLSTM and ST-ResNet</h2>\n<p>The two models I'm building are ConvLSTM and ST-ResNet. Don't worry if those sound like gibberish — the short version is they're neural networks designed to learn patterns that are both spatial (where things cluster) and temporal (how those clusters change over time).</p>\n<p><strong>ConvLSTM</strong> is the primary model. A standard LSTM network is great at learning sequences — it's the architecture behind a lot of language and time-series models. ConvLSTM swaps out the matrix multiplications for convolutions, which means it can process grid-structured data. Feed it the last six months of crime grids and it learns both the shape of hotspots and how they evolve. <a href=\"https://arxiv.org/abs/2502.07465\">Recent research</a> has shown these work well for crime forecasting across multiple US cities.</p>\n<p><strong>ST-ResNet</strong> takes a different angle. Instead of one sequential view, it captures three temporal perspectives — what happened recently, what happened at the same time last year, and what's the long-term trend. Each gets its own branch of residual convolutional networks, and a learned fusion layer combines them. The <a href=\"https://ojs.aaai.org/index.php/AAAI/article/view/10735\">original paper</a> was for crowd flow prediction in Beijing, but the architecture <a href=\"https://www.nature.com/articles/s41598-025-24559-7\">translates well to crime data</a>.</p>\n<h2>Why NZ?</h2>\n<p>Almost all published crime prediction research uses US data. Chicago, Los Angeles, New York. A <a href=\"https://pmc.ncbi.nlm.nih.gov/articles/PMC7319308/\">systematic review of spatial crime forecasting</a> makes this pretty clear. The models are well-studied, but they're trained on American cities with American urban patterns.</p>\n<p>New Zealand doesn't look like that. Our cities are smaller, more spread out, and the distribution is completely different. Auckland dominates in a way that no single US city does relative to the rest of the country. The spatial patterns here are their own thing, and I couldn't find anyone who'd applied these deep learning approaches to NZ data.</p>\n<p>That's what got me keen. Not because I think I'll beat the published benchmarks — those researchers have GPUs and PhD students and I have a Ryzen 5 desktop with no graphics card. But applying known techniques to new geography is useful work, and nobody else seems to have done it.</p>\n<h2>No GPU, no problem (mostly)</h2>\n<p>All of this runs on my desktop — an AMD Ryzen 5 5600GT with 12 threads and 30GB of RAM. No GPU at all. That sounds limiting, but the Auckland 500m grid works out to about 60 by 80 cells. The ConvLSTM model ends up around 5 million parameters, which trains in under an hour on CPU. You don't always need a beefy rig.</p>\n<p>It does mean being smart about model sizing and not going crazy with hyperparameter searches. But for a hobby project, it's more than enough.</p>\n<h2>What's coming</h2>\n<p>This is the first post in a ten-part series covering the whole project end to end.</p>\n<p><strong>Part 1: Data Acquisition and Exploration</strong> — We start with a 503MB CSV file from NZ Police that's UTF-16 encoded (because of course it is), has trailing periods on area names, and 32% of records with unknown hour-of-day. We'll wrangle it into a clean, typed Parquet file and get our first look at what's actually in there.</p>\n<p><strong>Part 2: Geographic Data Pipeline</strong> — Crime records come with meshblock IDs, but no coordinates. We'll join them to Stats NZ geographic boundary files using geopandas, giving every record a place on the map.</p>\n<p><strong>Part 3: Spatiotemporal Grid Construction</strong> — This is where it gets fun. We overlay a 500m by 500m grid on Auckland, count crimes per cell per month, and build the 4D tensors that feed the neural networks. Crime prediction becomes video prediction.</p>\n<p><strong>Part 4: Exploratory Data Analysis</strong> — Before throwing deep learning at anything, we need to understand what patterns actually exist. When does crime peak? Where does it cluster? How do different crime types behave differently?</p>\n<p><strong>Part 5: Baseline Models</strong> — Simple benchmarks — historical averages, naive persistence — so we know whether the deep learning is actually adding value or just being fancy for the sake of it.</p>\n<p><strong>Part 6: ConvLSTM Architecture</strong> — Building and training the primary model. Three ConvLSTM layers, six-month lookback window, learning spatial hotspots and temporal dynamics simultaneously.</p>\n<p><strong>Part 7: ST-ResNet Architecture</strong> — The three-branch alternative that captures closeness, periodicity, and long-term trend separately, then fuses them with learned weights.</p>\n<p><strong>Part 8: Model Evaluation and Comparison</strong> — Which model wins? By how much? And more importantly — where do they fail?</p>\n<p><strong>Part 9: Building the Dashboard</strong> — A 3D interactive map built with deck.gl where you can watch crime patterns evolve over time. Dark theme, extruded columns, time-lapse playback.</p>\n<p><strong>Part 10: Deployment and Reflections</strong> — Shipping to Vercel, what worked, what didn't, and what I'd do differently next time.</p>\n<p>Every post will include code and real results. The whole codebase will be open source. And I'll be upfront about the stuff that didn't work — which, trust me, there's plenty of.</p>\n<p>This is a hobby project. It's not a policing tool, it's not a product, and it's definitely not claiming to solve crime. It's just me being curious about what's sitting in a publicly available dataset and seeing how far you can push it with some Python and a bit of patience.</p>\n","date_published":"Tue, 24 Mar 2026 00:00:00 GMT","image":"https://jonno.nz/og/predicting-crime-in-aotearoa.png"},{"id":"https://jonno.nz/posts/nanoclaw-architecture-masterclass-in-doing-less/","url":"https://jonno.nz/posts/nanoclaw-architecture-masterclass-in-doing-less/","title":"NanoClaw's Architecture is a Masterclass in Doing Less","content_html":"<p>Most codebases grow by adding things. More abstractions, more dependencies, more config files. <a href=\"https://github.com/qwibitai/nanoclaw\">NanoClaw</a> does the opposite — it replaces a 500,000-line AI assistant framework with roughly 8,000 lines of TypeScript and six production dependencies.</p>\n<p>That's not a flex about line count. It's the result of six architectural bets that are genuinely interesting, and honestly, patterns I wish I'd seen more teams adopt over the years.</p>\n<p>I pulled the entire codebase apart to understand how it works. Here's what I found — and why these patterns matter way beyond a personal AI assistant.</p>\n<h2>The credential proxy</h2>\n<p>This is the one that properly impressed me.</p>\n<p>NanoClaw runs AI agents inside Linux containers. Those agents need API keys to talk to Anthropic. The obvious approach is to pass the key as an environment variable. That's what most of us do, right? Stick it in an env var, maybe pull it from a secrets manager at startup, move on.</p>\n<p>NanoClaw doesn't do that. Instead, every container gets a <em>placeholder</em> API key and a base URL pointing at a localhost proxy on port 3001. The proxy sits on the host, intercepts every outbound API request, strips the placeholder, and injects the real credential before forwarding upstream.</p>\n<pre><code class=\"language-mermaid\">sequenceDiagram\n    participant Agent as Agent (Container)\n    participant Proxy as Credential Proxy (Host :3001)\n    participant API as Anthropic API\n\n    Agent-&gt;&gt;Proxy: POST /v1/messages&lt;br/&gt;x-api-key: &quot;placeholder&quot;\n    Proxy-&gt;&gt;Proxy: Strip placeholder&lt;br/&gt;Inject real API key\n    Proxy-&gt;&gt;API: POST /v1/messages&lt;br/&gt;x-api-key: &quot;sk-ant-...&quot;\n    API--&gt;&gt;Proxy: Response\n    Proxy--&gt;&gt;Agent: Response\n</code></pre>\n<p>The agent literally cannot leak the real key — it never has it. Even if a prompt injection attack tricks the agent into dumping its environment variables, all you get is <code>ANTHROPIC_API_KEY=placeholder</code>. There's nothing to exfiltrate.</p>\n<p>This pattern now has a proper name — the <a href=\"https://nono.sh/blog/blog-credential-injection\">Phantom Token Pattern</a>. It's being adopted as a <a href=\"https://www.apistronghold.com/blog/phantom-token-pattern-production-ai-agents\">sidecar proxy in Kubernetes</a> for production AI agent deployments. NanoClaw arrived at the same design independently, which tells you something about how natural this solution is once you frame the problem correctly.</p>\n<p>The proxy also handles two auth modes (API key and OAuth) and binds differently per platform — <code>127.0.0.1</code> on macOS/WSL2, the <code>docker0</code> bridge IP on Linux. Secrets are loaded into a plain object, never into <code>process.env</code>, so they can't leak to child processes.</p>\n<h2>Container isolation as authorization</h2>\n<p>At Vend and Xero I spent a lot of time thinking about authorization. Role-based access, permission checks, middleware that decides what each user can do. It works, but it's a constant source of bugs — one missed check and you've got an escalation vulnerability.</p>\n<p>NanoClaw takes a completely different approach. Instead of checking what an agent is <em>allowed</em> to do, it controls what the agent can <em>see</em>.</p>\n<pre><code class=\"language-mermaid\">graph TB\n    subgraph &quot;Non-main group container&quot;\n        A[&quot;/workspace/group/ — Writable&quot;]\n        B[&quot;/workspace/global/ — Read-only&quot;]\n        C[&quot;/workspace/ipc/ — Writable&quot;]\n    end\n\n    subgraph &quot;Main group container&quot;\n        D[&quot;/workspace/project/ — Read-only&quot;]\n        E[&quot;/workspace/project/.env — Shadowed to /dev/null&quot;]\n        F[&quot;/workspace/group/ — Writable&quot;]\n        G[&quot;/workspace/ipc/ — Writable&quot;]\n    end\n\n    subgraph &quot;Host filesystem&quot;\n        H[&quot;Project root&quot;]\n        I[&quot;.env (real secrets)&quot;]\n        J[&quot;Group folders&quot;]\n        K[&quot;IPC directories&quot;]\n    end\n\n    H --&gt; D\n    I -.-&gt;|blocked| E\n    J --&gt; A\n    J --&gt; F\n    K --&gt; C\n    K --&gt; G\n</code></pre>\n<p>Each container gets a specific set of mounts. A non-main group container can only see its own folder — it physically cannot access the project root, other groups' data, or host files. The main group gets read-only access to the project, but <code>.env</code> is shadow-mounted to <code>/dev/null</code>. Even with the full project mounted, secrets are invisible.</p>\n<p>The agent inside the container runs with <code>bypassPermissions</code> — it can use Bash, write files, do whatever it wants. But &quot;whatever it wants&quot; is constrained by what the OS lets it see. No application-level permission checks needed.</p>\n<p>Additional mounts are validated against a blocklist (<code>.ssh</code>, <code>.gnupg</code>, <code>.aws</code>, <code>credentials</code>, etc.) that lives <em>outside</em> the project root, making it tamper-proof from inside any container. Symlink resolution prevents path traversal attacks like creating <code>/tmp/innocent -&gt; ~/.ssh</code>.</p>\n<p>The security model is the filesystem topology itself.</p>\n<h2>The two-cursor system</h2>\n<p>Message processing in NanoClaw uses two independent cursors — and the interaction between them is the cleverest bit of the whole system.</p>\n<p><strong>Cursor 1</strong> is a global watermark. The main polling loop reads new messages from SQLite every 2 seconds. When it sees messages, it advances the watermark <em>immediately</em> and persists it. This prevents the same batch from being processed twice if the loop runs again before the agent finishes.</p>\n<p><strong>Cursor 2</strong> is per-group. When a group's messages are about to be processed, the per-group cursor advances <em>optimistically</em> before the agent starts. If the agent fails:</p>\n<ul>\n<li>If nothing was sent to the user yet — roll back the cursor, retry later</li>\n<li>If a response was already sent — keep the cursor advanced, because rolling back would cause duplicate messages</li>\n</ul>\n<pre><code class=\"language-mermaid\">flowchart TD\n    A[New messages arrive] --&gt; B[Cursor 1: Advance global watermark]\n    B --&gt; C[Persist to SQLite]\n    C --&gt; D[Cursor 2: Advance per-group cursor optimistically]\n    D --&gt; E[Run agent in container]\n    E --&gt; F{Agent succeeded?}\n    F --&gt;|Yes| G[Keep cursor position]\n    F --&gt;|No| H{Output already sent to user?}\n    H --&gt;|Yes| I[Keep cursor — avoid duplicates]\n    H --&gt;|No| J[Roll back cursor — safe to retry]\n</code></pre>\n<p>This gives you at-most-once delivery to the user and at-least-once processing for the agent. That's the right tradeoff for a messaging system — users should never see duplicate messages, but retrying agent work is fine.</p>\n<p>I've built billing platforms where we had to think about exactly these kinds of <a href=\"https://hookdeck.com/webhooks/guides/webhook-delivery-guarantees\">delivery guarantees</a>. Getting this wrong in a payments context means double-charging customers. NanoClaw's approach is clean — two cursors, one rule: once you've spoken to the user, you can't unsay it.</p>\n<h2>File-based IPC</h2>\n<p>No Redis. No RabbitMQ. No gRPC. NanoClaw's inter-process communication is JSON files on the filesystem.</p>\n<p>Containers write outbound messages to <code>/workspace/ipc/messages/</code>. The host writes follow-up messages to <code>/workspace/ipc/input/</code>. Both sides use the same pattern — write to a <code>.tmp</code> file, then <code>rename</code> it into place. <code>rename</code> is atomic on POSIX filesystems. The file either exists with complete content or doesn't exist at all.</p>\n<pre><code class=\"language-mermaid\">graph LR\n    subgraph Container\n        A[Agent writes response] --&gt; B[&quot;Write to .tmp file&quot;]\n        B --&gt; C[&quot;Atomic rename to .json&quot;]\n    end\n\n    subgraph Host\n        D[&quot;IPC Watcher 1s poll&quot;] --&gt; E[&quot;Read .json files&quot;]\n        E --&gt; F{Source group?}\n        F --&gt;|Main group| G[&quot;Send to any target&quot;]\n        F --&gt;|Other group| H[&quot;Send only to self&quot;]\n    end\n\n    C --&gt; D\n</code></pre>\n<p>Authorization is enforced by directory identity. Each group has its own IPC directory (<code>/data/ipc/{group}/</code>). The host knows which group wrote a file based on <em>which directory it appeared in</em>. A container can't spoof its identity because its mount configuration fixes which directories it can write to.</p>\n<p>The main group can send messages to any other group. Non-main groups can only send to themselves. This isn't enforced by application logic checking tokens — it's enforced by the host reading from a directory that only one container can write to.</p>\n<p>This is the same <a href=\"https://dev.to/constanta/crash-safe-json-at-scale-atomic-writes-recovery-without-a-db-3aic\">atomic write pattern</a> used in crash-safe database implementations. The key constraint — temp file and target must be on the same filesystem — is guaranteed because both live inside the same container mount.</p>\n<h2>Polling over events</h2>\n<p>The main loop polls SQLite every 2 seconds. The IPC watcher polls the filesystem every 1 second. The task scheduler polls every 60 seconds.</p>\n<p>In a system designed for one user, this eliminates entire categories of race conditions. No WebSocket connection management, no event ordering guarantees to maintain, no callback hell. Just: check for new stuff, process it, sleep, repeat.</p>\n<p>I've been on teams that reached for event-driven architectures way too early. At Xero we shipped multiple times a day — but the internal tooling that supported that? Half of it would've been simpler with a poll loop. When you know your scale ceiling, polling is a feature, not a limitation.</p>\n<h2>Recompilation over plugins</h2>\n<p>Each container recompiles TypeScript on startup. The agent-runner source is mounted from a per-group directory, so each group can have customised agent behaviour. The compiled output goes to <code>/tmp/dist</code> which is then made read-only — the agent can't modify its own runner.</p>\n<p>No plugin registry. No abstraction layer. No dependency injection. You want different behaviour per group? Change the source files. The container picks it up on next startup.</p>\n<p>This is slower than a plugin system, sure. Container startup pays a compilation tax. But it eliminates an entire layer of indirection — and that layer is usually where the bugs live.</p>\n<h2>What ties it all together</h2>\n<p>These aren't random choices. They're all expressions of the same principle: <strong>know your constraints, and use them to delete complexity</strong>.</p>\n<p>NanoClaw is a single-user system. It knows this. So it polls instead of pushing, uses the filesystem instead of a message queue, and isolates with containers instead of writing authorization middleware.</p>\n<p>The interesting question isn't whether these patterns work for NanoClaw — they obviously do. The question is: where do these same constraints exist in <em>your</em> stack? Because they're more common than you think.</p>\n<p>That's what <a href=\"https://jonno.nz/posts/stealing-nanoclaw-patterns-for-webapps-and-saas/\">Part 2</a> is about.</p>\n","date_published":"Mon, 23 Mar 2026 00:00:00 GMT","image":"https://jonno.nz/og/nanoclaw-architecture-masterclass-in-doing-less.png"},{"id":"https://jonno.nz/posts/careers-in-the-self-driving-codebase-era/","url":"https://jonno.nz/posts/careers-in-the-self-driving-codebase-era/","title":"Careers in the Self-Driving Codebase Era","content_html":"<p>The engineering career ladder made perfect sense when code was expensive to produce. A junior learned syntax and patterns. A mid-level learned how to build features end-to-end. A senior learned architecture and trade-offs. A staff engineer learned strategy and influence. Each level was gated by coding ability — your value was measured by how much complexity you could wrangle into working software.</p>\n<p>Then <a href=\"https://cursor.com/blog/self-driving-codebases\">Cursor published research</a> showing their agents producing 1,000 commits per hour across 10 million tool calls. Not toy commits on a demo project. Real changes, on a real codebase, with minimal human intervention.</p>\n<p>Code is now essentially free. Not literally free — you're still paying for Cursor subscriptions and inference budgets — but compared to what a team of engineers used to cost to produce the same output? It's not even close. What does the career ladder measure when anyone can ship code?</p>\n<h2>The old matrix was fit for purpose</h2>\n<p>The junior → mid → senior → staff progression assumed three things: coding ability is the scarce resource, learning takes years of hands-on practice, and each level unlocks access to more complex work.</p>\n<p>If you wanted complex software built in 2015, you needed people who'd spent years developing the muscle memory. Understanding a large codebase took months. Writing clean, maintainable code took practice. Debugging production issues at 2am took experience.</p>\n<p>All of that is still true. But the bottleneck has shifted. The hard part isn't writing the code anymore — it's knowing what code to write, securing what gets built, and operating it at scale once it's running.</p>\n<p>Nick Durkin, Field CTO at Harness, <a href=\"https://medium.com/generative-ai-revolution-ai-native-transformation/2026-is-the-year-of-agentic-engineering-the-ai-skills-gap-enterprises-cant-ignore-346e07a7a50d\">put it bluntly</a>: &quot;By 2026, every engineer effectively becomes an engineering manager. Not of people, but of AI agents.&quot;</p>\n<h2>What's actually valuable now</h2>\n<p>I've spent the last few years moving between roles — Head of Engineering at Cotiss, now Principal Engineer at Prosaic. And the thing I've noticed is that my value has shifted dramatically away from writing code.</p>\n<p>When code production is commoditised, value moves to everything <em>around</em> the code:</p>\n<p><strong>Securing it.</strong> Agent-generated code ships fast. Who's checking it for vulnerabilities? Who understands the attack surface? Security isn't a nice-to-have anymore — it's the difference between shipping confidently and shipping scared.</p>\n<p><strong>Architecting it.</strong> Agents can implement whatever you specify. But someone still needs to understand the business requirements deeply enough to know <em>what</em> to specify. What are the trade-offs? What will scale? What won't? That judgement doesn't come from prompts.</p>\n<p><strong>Operating it.</strong> The code ships. Now what? Who's on-call? Who understands the failure modes? Who can debug a production incident when the agent that wrote the code is long gone?</p>\n<p><strong>Orchestrating it.</strong> This is the new skill. Understanding business context, product requirements, and technical constraints well enough to specify clearly what agents should build — and then validating they built the right thing.</p>\n<p>Cursor's research called this &quot;taste, judgement, and direction.&quot; I'd call it something simpler: understanding the business deeply enough to give good instructions.</p>\n<h2>The new hybrid roles</h2>\n<p>Here's my take on where this goes. The future isn't junior → senior. It's hybrid roles that span disciplines.</p>\n<p><strong>Product-engineer.</strong> Someone who understands user outcomes and technical constraints. They don't just build what's in the ticket — they understand <em>why</em> it's in the ticket and whether it's the right thing to build.</p>\n<p><strong>Security-architect.</strong> Someone who can secure agent-generated code at scale. Not just running a scanner — actually understanding the threat model and building secure systems.</p>\n<p><strong>Business-technologist.</strong> Someone who translates strategy into specifications. They sit between the CEO's vision and the agent's implementation, making sure nothing gets lost.</p>\n<p>These aren't new roles invented from scratch. They're the natural evolution of what senior engineers have always done — but with the coding part increasingly handled by agents.</p>\n<p>The key insight: you need to understand more of the business deeply to stay valuable. Security. Org structure. Go-to-market. Operations. The more context you hold, the better your specifications, the better your agent-driven outcomes.</p>\n<h2>Entry level evolves</h2>\n<p>What happens to junior engineers? <a href=\"https://www.cnbc.com/2025/09/07/ai-entry-level-jobs-hiring-careers.html\">Stanford's research</a> is sobering — early-career workers aged 22-25 in AI-exposed jobs saw a 16% employment decline, while workers aged 35-49 grew by 8%.</p>\n<p>But I don't think entry-level disappears. It transforms.</p>\n<p>The baseline shifts from &quot;can you code?&quot; to &quot;can you specify, verify, and operate?&quot; Entry-level becomes prompt engineer plus systems thinker. <a href=\"https://spectrum.ieee.org/ai-effect-entry-level-jobs\">IEEE Spectrum's reporting</a> quotes a hiring manager saying you need to &quot;slot in at a higher level almost from day one.&quot;</p>\n<p>That's brutal if you're expecting the old training path — start with simple tickets, learn the codebase over six months, gradually take on harder work. That path assumed coding reps were the training ground.</p>\n<p>But it's also an opportunity. If you can think in systems, understand business context, and validate outcomes — you can contribute faster than any previous generation of engineers. The barrier to producing working software has never been lower. The barrier to producing <em>the right</em> working software is higher than ever.</p>\n<h2>Organisational agility as the competitive advantage</h2>\n<p>Here's what's been rattling around my head: software engineers used to be the bottleneck.</p>\n<p>Want a new feature? Wait for engineering capacity. Want to test a hypothesis? Wait for engineering capacity. Want to pivot the product? Wait for engineering capacity.</p>\n<p>Now? Agents can ship fast. The bottleneck moves elsewhere.</p>\n<p>The question isn't &quot;how fast can we code this?&quot; anymore. It's &quot;how fast can the organisation decide what to build and validate it works?&quot;</p>\n<p>Engineers who understand business context — who can move between product, operations, security, and strategy — can now capitalise on that agility instead of being constrained by it. The organisations that win will be the ones where engineers aren't waiting for briefs. They're helping shape them.</p>\n<h2>A caveat — this is about SaaS, not science</h2>\n<p>Everything I've said applies to the world I know — traditional SaaS, product engineering, building applications. The stuff most of us do day to day.</p>\n<p>If you're doing deep tech — building compilers, writing ML frameworks, working on kernel internals, pushing the boundaries of computer science — the calculus is different. That work still requires deep, specialised knowledge that agents can't replace. The science matters.</p>\n<p>But here's the thing. Most of us aren't doing that. Most of us are standing on the shoulders of giants — using frameworks, libraries, and platforms that brilliant people built — and assembling them into products that solve business problems. That's not a knock. That's the whole point of good abstraction. It means there's a clear path even in this era: the scientists keep pushing the frontier, and the rest of us get radically better tools to build on top of it.</p>\n<p>The engineers who understand that distinction — who know when they're doing science and when they're doing integration — will navigate this shift much better than the ones who think writing CRUD endpoints is a craft that can't be automated.</p>\n<h2>The opportunity</h2>\n<p>I started my career in retail at Spark NZ. Moved into support at Vend. Learned to code on the job. Became an engineer, then a lead, then a manager, then a Head of Engineering, then a founder, then to principal engineer hands on with AI. That path made sense because each step built on coding ability.</p>\n<p>The next generation's path looks different. It's less about climbing a ladder and more about expanding across disciplines. Less about depth in code and more about breadth in context.</p>\n<p>The ladder isn't dead. But the rungs have changed. And if you're still climbing the old one, you might be wondering why the top doesn't look like what you expected.</p>\n<p>The good news? This is an opportunity for engineers who want to expand beyond code. The people who've been frustrated that &quot;just write the code&quot; never seemed to capture the full value they could bring — this is your moment. The organisations that figure out how to deploy these people effectively will move faster than anyone thought possible.</p>\n<p>The ones that keep hiring for the old ladder? They'll keep being confused about why their engineering investment isn't translating into outcomes.</p>\n","date_published":"Sat, 21 Mar 2026 00:00:00 GMT","image":"https://jonno.nz/og/careers-in-the-self-driving-codebase-era.png"},{"id":"https://jonno.nz/posts/why-llms-waste-half-their-brain/","url":"https://jonno.nz/posts/why-llms-waste-half-their-brain/","title":"Why LLMs waste half their brain","content_html":"<p>You know the pattern. You give Claude Code a task — &quot;add pagination to the invoices endpoint&quot; — and before it writes a single line of code, it does this:</p>\n<p>Glob for <code>*invoice*</code>. 47 results. Read 5 files to understand the structure. Grep for <code>fetchInvoices</code>. 12 results. Read 4 more. Grep for existing pagination patterns. 8 results. Read 3 more. Then, finally, it starts coding.</p>\n<p>That's 12+ files read across 10+ tool calls, burning 60-100k tokens on exploration alone. And that's the good outcome — sometimes it misses a file entirely and heads off in a direction that makes zero sense.</p>\n<p>I've been using Claude Code daily on production codebases at <a href=\"https://prosaic.works/\">Prosaic</a> and this pattern drives me nuts. The agent is genuinely good at writing code. It's just spending way too much of its context window figuring out <em>where</em> things are.</p>\n<h2>The insight</h2>\n<p>I saw <a href=\"https://news.ycombinator.com/item?id=43367518\">jeremychone's approach on Hacker News</a> and it immediately clicked. The idea is dead simple: code map with a cheap model, auto-context with a cheap model, then code with the big model.</p>\n<p>The key bit — you index every file in your repo with a one-line summary using something cheap and fast like Haiku. That's your code map. Then when you have a task, you send those summaries (not the source code — just the summaries) to the cheap model and ask it to pick the files that matter. It returns 5-10 file paths. You feed the full source of those files to your expensive model.</p>\n<p>381 candidate files compressed down to 5. 1.62 MB down to 27 KB. That's a 98% reduction in context — and the selection is intelligent, not keyword matching.</p>\n<p>Higher precision on the input leads to higher precision on the output.</p>\n<pre><code class=\"language-mermaid\">graph LR\n    A[&quot;Your repo&lt;br/&gt;2688 files&quot;] --&gt;|codemap build| B[&quot;Per-file index&lt;br/&gt;summary, types,&lt;br/&gt;functions, imports&quot;]\n    B --&gt;|codemap select| C[&quot;Cheap model picks&lt;br/&gt;381 → 5 files&quot;]\n    C --&gt; D[&quot;Agent gets&lt;br/&gt;focused context&lt;br/&gt;30-80k tokens&quot;]\n</code></pre>\n<h2>What I built</h2>\n<p><a href=\"https://github.com/jonnonz1/codemap\">Codemap</a> is a Go CLI that implements this pipeline. Two commands do most of the work.</p>\n<p><strong><code>codemap build</code></strong> indexes your repo. For each file it generates a structured entry:</p>\n<ul>\n<li><code>summary</code> — one-sentence description (from the LLM)</li>\n<li><code>when_to_use</code> — when a developer would need this file (from the LLM)</li>\n<li><code>public_types</code> and <code>public_functions</code> — exported symbols (from the Go AST parser)</li>\n<li><code>imports</code> — dependency list (from the parser)</li>\n<li><code>keywords</code> — domain terms (from the LLM)</li>\n</ul>\n<p>Deterministic facts come from static analysis. Semantic fields come from the cheap model. The index is cached as JSON and keyed on mtime + BLAKE3 hash — so subsequent builds only re-index files that actually changed. First run on a 2700-file repo takes a few minutes. After that, it's seconds.</p>\n<p>Cost for a full index? About $2-3 with Haiku across 2700 files. Pennies for incremental rebuilds.</p>\n<p><strong><code>codemap select</code></strong> is where the magic happens. Given a task description, it loads the code map (summaries only — small), sends them to the cheap model with your task, and the LLM picks the 5-10 files that are actually relevant. Then it reads the full source of those files and hands everything back.</p>\n<p>This isn't keyword overlap or TF-IDF scoring. The LLM reads all the summaries, understands what &quot;add pagination to invoices&quot; actually requires, and returns the files you need — the handler, the repository, the existing pagination helper, the relevant tests. One call, done.</p>\n<h2>Claude Code integration</h2>\n<p>Codemap runs as an MCP server over stdio. You register it once:</p>\n<pre><code class=\"language-bash\">claude mcp add codemap -- codemap mcp\n</code></pre>\n<p>Claude Code gets three tools:</p>\n<ul>\n<li><code>codemap_select</code> — the main one. Claude calls this with the task description, gets back full source of selected files, starts coding immediately.</li>\n<li><code>codemap_status</code> — quick check on whether the index is fresh or stale.</li>\n<li><code>codemap_build</code> — triggers an incremental rebuild if needed.</li>\n</ul>\n<p>The workflow becomes: Claude gets a task, calls <code>codemap_select</code>, receives focused context, writes code. No glob. No grep. No wrong turns.</p>\n<h2>Measuring it properly</h2>\n<p>I'm not keen on hand-wavy claims about &quot;tokens saved&quot; or &quot;X% faster.&quot; You can't measure a counterfactual. What you can measure is whether the file selection was actually right.</p>\n<p>Codemap tracks real metrics from observed data:</p>\n<p><strong>Selection accuracy</strong> — after a session, compare the files codemap selected against the files actually modified via <code>git diff --name-only</code>. That gives you precision (how many selected files were actually needed) and recall (how many changed files were pre-selected).</p>\n<p><strong>Exploration overhead</strong> — a PostToolUse hook logs every Read, Glob, and Grep call Claude makes after receiving codemap context. If Claude needed to go exploring beyond what codemap gave it, that shows up as overhead. Trending toward zero means the selection is doing its job.</p>\n<p><strong>Context compression</strong> — candidates vs selected, measured in bytes. Both numbers are real.</p>\n<p>Here's what a typical stats output looks like:</p>\n<pre><code>Selection Accuracy (last 10 sessions)\n  Avg hit rate:          82%\n  Avg precision:         65%\n  Avg compression:       97%\n\nExploration Overhead (last 10 sessions)\n  Avg extra Read calls:  1.8\n  Avg total Read calls:  6.2\n  Overhead ratio:        29%\n</code></pre>\n<p>No estimates. No counterfactuals. Just: did codemap pick the right files, and did Claude need to look elsewhere?</p>\n<h2>The bigger picture</h2>\n<p>The pattern here is simple but powerful — use cheap, fast models for the boring structural work so the expensive model can focus its entire context window on actually solving the problem. Haiku is bloody good at reading 2700 one-line summaries and picking the 5 that matter. That's not a task that needs Opus.</p>\n<p>Codemap is open source, written in Go, and supports Anthropic, OpenAI, and Google as LLM providers. If you're using Claude Code on anything bigger than a toy project, <a href=\"https://github.com/jonnonz1/codemap\">give it a go</a>.</p>\n","date_published":"Wed, 18 Mar 2026 00:00:00 GMT","image":"https://jonno.nz/og/why-llms-waste-half-their-brain.png"},{"id":"https://jonno.nz/posts/what-if-a-worm-could-make-ai-agents-smarter/","url":"https://jonno.nz/posts/what-if-a-worm-could-make-ai-agents-smarter/","title":"What if a Worm Could Make AI Agents Smarter?","content_html":"<p>AI coding agents are genuinely impressive right now. You give Claude or GPT a task, it reads your code, writes a fix, runs the tests. Awesome. But here's the thing that's been bugging me — most practical agents still rely on largely fixed control policies. Same temperature, same approach, same level of caution whether they just broke the build or nailed it first try. They're locally reactive, not deeply adaptive. It's like having a developer who responds to each step but never changes their underlying strategy based on what just happened.</p>\n<p>That felt like a problem worth poking at. And the answer I landed on came from a pretty unexpected place — a roundworm with 302 neurons.</p>\n<h2>The static agent problem</h2>\n<p>If you've used AI coding agents for any real work, you've probably noticed this. Yes, modern agents can maintain state, use memory, and even reflect on previous attempts — but their <em>behavioural policy</em> for how they react usually doesn't change unless a human rewrites the prompt or the harness. The agent doesn't get more cautious after breaking something. It doesn't get more exploratory when it's stuck. It just keeps doing the same thing with the same settings, tick after tick.</p>\n<p>I've spent years managing engineering teams — at Vend, at Xero, at Cotiss — and the best engineers constantly adjust their approach. They slow down when things are fragile. They get creative when they're blocked. They know when to stop. Current AI agents don't do any of that.</p>\n<p>So the question became: what if you could build a control layer that sits <em>outside</em> the AI and adjusts its behaviour in real time based on how things are going?</p>\n<h2>Enter c302</h2>\n<p>c302 is a research project I've been building. The name comes from the <a href=\"https://openworm.org/science.html\">c302 model</a> in the OpenWorm project — an open-source computational model of the <em>Caenorhabditis elegans</em> roundworm's nervous system.</p>\n<p>C. elegans is a fascinating little creature. It's got exactly 302 neurons and roughly 7,000 synaptic connections, and it's the only organism where we've mapped the <em>entire</em> nervous system. Every neuron, every connection. That's it — that's the whole brain. And with those 302 neurons, it navigates its environment, finds food, avoids danger, and adapts its behaviour based on what's working and what isn't.</p>\n<p>My project uses those neural dynamics as a behavioural controller for an LLM coding agent. The worm doesn't write code — it doesn't see code, prompts, or any LLM output at all. It sees five simple signals about what just happened (did tests pass? did the build break? how big was the change?) and emits seven parameters that constrain how the AI behaves on the next step — things like mode (diagnose, search, edit, test), temperature, token budget, and how aggressive the edits should be.</p>\n<p>It's a closed loop. The AI acts, the outcomes get observed, a reward signal feeds back into the controller, and the controller's internal state shifts. Then the next set of behavioural parameters reflects that changed state.</p>\n<h2>Why a worm?</h2>\n<p>This sounds ridiculous, I know. But there's a serious idea underneath it.</p>\n<p>C. elegans has had roughly a billion years of evolution to solve what's fundamentally a control problem — how do you steer behaviour adaptively with minimal hardware? Its nervous system doesn't have the luxury of scale. It can't throw more neurons at the problem. Instead, it's evolved incredibly efficient circuits for things like balancing exploration and exploitation, persisting on a strategy when it's working, and backing off when it's not.</p>\n<p>Those are <em>exactly</em> the problems a coding agent faces. When should it keep editing? When should it stop and run tests? When should it try a completely different approach? When should it just stop?</p>\n<p>The bet is that the control architecture biology evolved for navigating a physical environment might generalise to navigating a software engineering task. Not because worms and code have anything in common — but because the <em>control problem</em> is structurally similar.</p>\n<h2>How this differs from other C. elegans + AI work</h2>\n<p>There's actually a growing body of research connecting C. elegans neuroscience to artificial intelligence, and I want to be upfront about where my work sits relative to it — because the approach is quite different.</p>\n<p>Most existing work uses the connectome to <em>build</em> AI. The <a href=\"https://www.sciencedirect.com/science/article/pii/S0925231224003692\">Elegans-AI project</a> used the connectome's topology to design neural network architectures for image classification. <a href=\"https://www.nature.com/articles/s43588-024-00738-w\">BAAIWorm</a>, published in Nature Computational Science in 2024, built an incredible closed-loop simulation of the worm's brain, body, and physical environment — the worm navigating through a 3D fluid simulation. A recent Frontiers paper on <a href=\"https://www.frontiersin.org/journals/neural-circuits/articles/10.3389/fncir.2026.1731513/full\">translating C. elegans circuits into AI</a> reviews the whole movement, and their framing captures it well: extracting functional principles from the connectome to design better neural network architectures.</p>\n<p>All of that work asks: <em>can the worm's brain make a better AI model?</em></p>\n<p>My project asks something different: <em>can the worm's brain control an existing AI model?</em></p>\n<p>The LLM stays completely untouched. Claude Sonnet does all the reasoning, all the code generation, all the problem-solving. The connectome-derived controller sits outside it and modulates <em>how</em> it behaves — not <em>what</em> it knows. It's a control layer, not a model architecture.</p>\n<p>BAAIWorm is probably the closest conceptual parallel. They've got the same closed-loop structure — brain modulates behaviour, environment provides feedback. But their environment is physics (fluid dynamics, muscle actuation, chemotaxis). Mine is a git repository with failing tests.</p>\n<h2>A disclaimer about who I am</h2>\n<p>I should be really clear about something: I'm not a neuroscientist. I'm not a research scientist of any kind. I'm a software engineer who's spent the last fifteen years building products and leading engineering teams in the NZ tech scene.</p>\n<p>This is hobbyist research. Citizen science, if you want a fancier term for it. I had a question that genuinely fascinated me, I had access to open-source tools like the OpenWorm simulation models, and I had evenings and weekends. The <a href=\"https://www.nature.com/articles/d41586-024-03182-y\">FlyWire connectome project</a> showed that amateur contributors can make real contributions to neuroscience — they credited volunteer researchers as authors on peer-reviewed papers. That gave me confidence that you don't need a PhD to ask interesting questions.</p>\n<p>I'm going to share my findings honestly — including every limitation, every thing I got wrong, and every place where the data doesn't support the conclusion I wanted. I've run 167 experiments across three difficulty levels. The experimental pipeline is solid. But the sample sizes are small, and most individual comparisons aren't statistically significant at p&lt;0.05. I'll be upfront about all of that.</p>\n<h2>What's coming in this series</h2>\n<p>This post is the first in a series where I'll walk through the entire c302 research project — what I built, what I found, and what it might mean.</p>\n<p>The series will cover the experimental results across three difficulty levels — from a trivial single-function task where every controller succeeds, through multi-file coordination where differences start appearing, to regression recovery where the adaptive controller's error detection actually matters. I'll share the actual data, the traces, the cost breakdowns, and the honest limitations.</p>\n<p>After that, the really interesting bit — Phase 2, where the hand-tuned synthetic controller gets replaced with actual connectome-derived neural dynamics from the c302 simulation. That's where the neuroscience question gets answered: does biological provenance actually add something that hand-tuning can't replicate?</p>\n<p>The tools are all open source. The question — whether a billion years of evolved neural architecture can improve how we control AI systems — is genuinely one of the most interesting things I've worked on.</p>\n","date_published":"Wed, 18 Mar 2026 00:00:00 GMT","image":"https://jonno.nz/og/what-if-a-worm-could-make-ai-agents-smarter.png"}]}