The first time I got a Firecracker VM to boot and respond to a vsock ping from the host, I sat there grinning like an idiot. Typed a command on my machine, it reached through a kernel-level socket into a completely separate Linux system with its own kernel, and got a reply. Under a second.
That was about 30 hours into the project. The previous 29 were mostly fighting with rootfs images and iptables rules.
Part 1 covered why I built this — Firecracker MicroVMs for running Claude Code in full-permission isolation. This post is the actual build. Rootfs, networking, the guest agent, and the streaming pipeline.
Building the rootfs
A Firecracker VM needs two things: an uncompressed Linux kernel (vmlinux, not bzImage — there's no bootloader) and an ext4 filesystem image to use as the root disk.
The kernel is straightforward — grab a prebuilt 6.1 LTS vmlinux. The rootfs took more work.
It's a standard ext4 image with Debian Bookworm, and it needs everything Claude Code might want: Node.js 24, Python 3.11, Chromium for browser automation, git, curl, jq, and the full Claude Code CLI installed globally via npm. The image ends up at about 4GB.
The guest agent — the Go binary that listens for commands from the host — lives inside the rootfs as a systemd service:
sudo mount /opt/firecracker/rootfs/base-rootfs.ext4 /mnt
sudo cp bin/agent /mnt/usr/local/bin/agent
sudo chmod +x /mnt/usr/local/bin/agent
sudo tee /mnt/etc/systemd/system/agent.service <<'EOF'
[Unit]
Description=Orchestrator Guest Agent
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/agent
Restart=always
RestartSec=1
[Install]
WantedBy=multi-user.target
EOF
sudo chroot /mnt systemctl enable agent.service
sudo umount /mnt
That RestartSec=1 matters. If the agent crashes for any reason, systemd has it back up in a second. The orchestrator polls vsock every 500ms waiting for the agent, so even a crash during boot is barely noticeable.
You build this rootfs once, by hand. Every new VM gets a sparse copy of it.
VM lifecycle
internal/vm/manager.go handles the whole lifecycle. It's sequential with cleanup at each step — if anything fails, it tears down what it already set up and returns the error.
flowchart TD
A["Copy rootfs (sparse)"] --> B["Mount & inject network config"]
B --> C["Create TAP device"]
C --> D["Add iptables rules"]
D --> E["Setup jailer chroot"]
E --> F["Write Firecracker config JSON"]
F --> G["Launch via jailer --daemonize"]
G --> H["Find PID, save metadata"]
H --> I["VM ready — poll vsock"]
The sparse copy is the first thing that happens:
cmd := exec.Command("cp", "--sparse=always", BaseRootfs, vm.RootfsPath)
--sparse=always means zero blocks aren't allocated on disk. A 4GB image might only use 2GB of actual disk space. Takes under a second on NVMe.
After copying, the rootfs gets mounted and three files are injected: a systemd-networkd config with a static IP, /etc/resolv.conf for DNS, and /etc/hostname. Then it's unmounted and copied again into the jailer chroot.
Yeah, that's two copies of the rootfs per VM. The first for network injection, the second because the jailer expects everything inside its chroot. I could collapse this into one copy by injecting the network config directly into the chroot copy, but it's never been a bottleneck — sparse copy of 4GB takes less time than Firecracker takes to boot. So I left it.
The jailer
Firecracker's jailer is a separate binary that creates a chroot, sets up minimal /dev entries (kvm, net/tun, urandom), and runs the Firecracker process inside it. The VM config is a JSON file:
vmConfig := map[string]interface{}{
"boot-source": map[string]interface{}{
"kernel_image_path": "/vmlinux",
"boot_args": "console=ttyS0 reboot=k panic=1 pci=off init=/sbin/init",
},
"drives": []map[string]interface{}{{
"drive_id": "rootfs",
"path_on_host": "/rootfs.ext4",
"is_root_device": true,
"is_read_only": false,
}},
"machine-config": map[string]interface{}{
"vcpu_count": vm.VCPUs,
"mem_size_mib": vm.RamMB,
},
"network-interfaces": []map[string]interface{}{{
"iface_id": "eth0",
"guest_mac": "06:00:AC:10:00:02",
"host_dev_name": netCfg.TapDev,
}},
"vsock": map[string]interface{}{
"guest_cid": vm.VsockCID,
"uds_path": "/vsock.sock",
},
}
pci=off because Firecracker doesn't emulate PCI. Paths are relative to the jailer chroot. The vsock entry creates a Unix domain socket at /vsock.sock inside the chroot — that's how the host talks to the guest.
Launch looks like this:
cmd := exec.Command(JailerBin,
"--id", vm.JailID,
"--exec-file", FCBin,
"--uid", "0", "--gid", "0",
"--cgroup-version", "2",
"--daemonize",
"--",
"--config-file", "/vm-config.json",
)
cmd.Run()
After launch there's a 2-second sleep — Firecracker needs a moment to start — then the PID is found via pgrep and saved to a metadata file. If the orchestrator restarts, it reads these metadata files and picks up where it left off. VMs survive orchestrator crashes.
Networking
This is where I burned the most time. Not because the concepts are hard, but because of one specific bug that had me questioning reality.
Each VM needs internet access for Claude Code to fetch packages, clone repos, and hit the Anthropic API. The approach: each VM gets a Linux TAP device on the host, a dedicated /24 subnet, and iptables rules for NAT.
IP allocation
Subnets are deterministic, derived from the VM name using FNV-1a hashing:
func NetSlot(name string) int {
h := fnv.New32a()
h.Write([]byte(name))
return int(h.Sum32()%253) + 1
}
VM named task-a3bfca80 might hash to slot 61, giving it subnet 172.16.61.0/24, guest IP 172.16.61.2, TAP IP 172.16.61.1. No coordination needed, no DHCP server, no IP pool to manage. The collision space is 253 slots — more than enough for 12-13 concurrent VMs.
TAP devices
A TAP device is a virtual ethernet interface. Firecracker attaches the guest's eth0 to it.
tap := &netlink.Tuntap{
LinkAttrs: netlink.LinkAttrs{Name: cfg.TapDev},
Mode: netlink.TUNTAP_MODE_TAP,
}
netlink.LinkAdd(tap)
addr, _ := netlink.ParseAddr(cfg.TapIP + "/24")
link, _ := netlink.LinkByName(cfg.TapDev)
netlink.AddrAdd(link, addr)
netlink.LinkSetUp(link)
TAP names are fc-<vm-name>, truncated to 15 characters because Linux interface names can't be longer. A fun constraint to discover at runtime.
The iptables rules
Three rules per VM:
// NAT — rewrite source IP when traffic exits the host
ipt.AppendUnique("nat", "POSTROUTING",
"-s", cfg.Subnet, "-o", cfg.HostIface, "-j", "MASQUERADE")
// FORWARD — allow outbound from TAP
ipt.Insert("filter", "FORWARD", 1,
"-i", cfg.TapDev, "-o", cfg.HostIface, "-j", "ACCEPT")
// FORWARD — allow established/related inbound
ipt.Insert("filter", "FORWARD", 1,
"-i", cfg.HostIface, "-o", cfg.TapDev,
"-m", "state", "--state", "RELATED,ESTABLISHED", "-j", "ACCEPT")
See those Insert calls with position 1? That's the bug fix.
The UFW bug
I originally used Append for the FORWARD rules. Traffic from the VM would leave the host fine (NAT worked), but return traffic got dropped. The VM could resolve DNS but couldn't complete TCP handshakes. I spent an embarrassing amount of time staring at tcpdump output before I figured it out.
Ubuntu's UFW adds a blanket DROP rule to the FORWARD chain. If you append your ACCEPT rules, they land after UFW's DROP. They never match. The packets hit the DROP rule first and get silently killed.
Insert at position 1 puts the rules before UFW's. Return traffic flows, VMs get internet access, everything works.
The traffic path through a working VM:
Guest (172.16.61.2) → eth0 → TAP (fc-task-xxx) → FORWARD ACCEPT
→ NAT MASQUERADE (rewrite src to host IP) → host interface → internet
→ response → RELATED,ESTABLISHED → TAP → guest eth0
VMs can't reach each other. Each TAP device is point-to-point on its own /24. There's no route between subnets.
The guest agent
cmd/agent/main.go — 420 lines of Go. It's a static binary that starts on boot, listens on vsock port 9001, and handles five request types: ping, exec, write_files, read_file, and signal.
The interesting one is streaming exec.
When the orchestrator wants to run Claude Code, it sends an exec request with stream: true. The agent spawns the command, reads stdout and stderr line by line, and sends each line back as a framed event over the vsock connection. When the process exits, it sends an exit event with the exit code.
Sounds straightforward. The tricky part is background processes.
Claude Code can start things that outlive the main command — dev servers, file watchers, whatever it decides it needs. These child processes inherit the stdout/stderr pipes. If the agent waits for the pipes to close (the normal approach), it hangs forever because the children are still holding them open.
The fix has three parts:
// 1. Process group isolation
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
// 2. Wait for the main process, not the pipes
<-waitDone
// 3. Kill the entire process group
pgid, _ := syscall.Getpgid(cmd.Process.Pid)
syscall.Kill(-pgid, syscall.SIGTERM)
time.Sleep(500 * time.Millisecond)
syscall.Kill(-pgid, syscall.SIGKILL)
Setpgid: true puts the command in its own process group. When the main process exits, kill the group (-pgid means "everything in this group"). SIGTERM first, wait half a second, then SIGKILL for anything that didn't listen.
Even after killing the group, there's a 3-second timeout waiting for the pipe-reading goroutines to drain. If they're still stuck after that, move on and send the exit event anyway. Can't let a hung pipe block the entire task.
The line-by-line reader uses a 256KB buffer because Claude Code's --output-format stream-json can produce enormous single lines — tool results that include the full contents of files it read.
Credential injection
Before Claude Code runs, the orchestrator writes five things into the VM via vsock:
OAuth credentials from the host's ~/.claude/.credentials.json (mode 0600). A settings file that allows all tools. An environment script that sets CLAUDE_DANGEROUSLY_SKIP_PERMISSIONS=true. Task metadata. And a marker file to create the output directory.
The prompt itself gets written to a temp file inside the VM to avoid shell escaping nightmares, then referenced in the command:
claudeArgs := fmt.Sprintf(
"claude -p \"$(cat %s)\" --output-format stream-json --verbose",
promptFile,
)
cmd := []string{"bash", "-c",
"source /etc/profile.d/claude.sh && " + claudeArgs}
When the VM is destroyed, the rootfs — containing the credentials — is deleted. Credentials only exist for the lifetime of the task.
Collecting results
After Claude Code finishes, the orchestrator searches for files it created:
// Anything in the output directory
vsock.Exec(jailID, []string{"find", outputDir, "-type", "f", "-not", "-name", ".keep"}, nil, "/root")
// Any new files under /root, created after the prompt was written
vsock.Exec(jailID, []string{"find", "/root", "-maxdepth", "2", "-type", "f",
"-newer", "/tmp/claude-prompt.txt"}, nil, "/root")
Each file gets downloaded via vsock.ReadFile and saved to /opt/firecracker/results/<task-id>/. The runner also scans the accumulated output for Claude's total_cost_usd field to record what the task cost in API credits.
Then the VM is destroyed. Firecracker process killed, TAP device removed, iptables rules deleted, jailer chroot deleted, VM state directory deleted. Clean slate.
The whole cycle — boot, inject, run, collect, destroy — typically takes 30-120 seconds depending on how complex the prompt is. The 4-second boot and ~1-second teardown are rounding errors compared to the time Claude actually spends thinking.
Part 3 gets into the fun stuff — the MCP server that lets Claude delegate tasks to itself, the streaming architecture, the web dashboard, and what productionising this would actually look like.
Comments