What Happens When a Worm Drives Claude?

A few weeks ago I was on the couch with my eight-year-old, and she was going on about worms. Kids ask the kind of questions we forget to ask as adults, the ones that drop off once we get all mature and sensible. That conversation is the reason a worm ended up driving Claude.

Right now everyone builds LLM agents the same way. Take a model, bolt more tools onto it, hand it a bigger toolbox. I wanted to try the opposite. Leave the model alone, and put a brain inside the loop to do the steering.

I'll say this up front: it's a fun project, not a serious one. But the thing it does is genuinely strange, so it's worth a watch.

(Can't see the embed? Watch it on YouTube.) All the code is on GitHub if you'd rather pull it apart yourself.

The setup is a car

Think of it as a car. Claude (Sonnet 4) is the engine. A big engine with nobody at the wheel. It has five tools: read, grep, write, bash, and run the tests. That's the lot.

The driver is a small controller program. Every tick it looks at what Claude just did and decides what Claude does next. It never writes a line of code itself. It just shapes the behaviour, the way you'd drive a car without ever touching the pistons.

And the driver doesn't read the code either. It reads the dashboard. Five numbers, every tick, and only these five: how many tests are failing, the share passing, how many files Claude has touched, how many ticks have gone by, and the last tool it used. That's the whole view of the world.

Out the other side there are seven knobs. The big one is the gear: diagnose, edit, test, or stop. Then a handful more, like how much risk to take, how long to think, how widely to read, how hard to commit, and how done it reckons it is. Five numbers in, seven knobs out.

So who turns the knobs?

The worm.

It's C. elegans, a roundworm about a millimetre long. It has 302 neurons, and scientists mapped every single connection between them back in 1986 (the year I was born, which I'm choosing not to read anything into). It's one of the only nervous systems we understand all the way down.

I took 14 of those neurons and ran them live in a simulation. Four of them do the heavy lifting. One is a salt sensor, and to a worm salt means food, so I wired it to reward. Its opposite I wired to bad results. One drives the worm forward, so I wired it to how much work is left. One throws it into reverse, so I wired it to errors.

I didn't make any of that wiring up. It's the published connectome. All I did was connect the worm's senses to the agent's situation and let it run.

(If you want the longer argument for why a worm of all things, I made the case when I first started this.)

One run, 14 ticks, 45 seconds

Does it crash, or does it do something useful?

It starts at baseline with one failing test. Claude opens a few files, runs a search. The tests are failing, so the worm leans on the accelerator. I never wrote a rule that says "press the accelerator when there's work to do". It falls straight out of the wiring.

Then Claude runs the tests and gets a real failure. The error neuron lights up, the worm changes gear, and Claude makes its first proper edit. Risk drops, it reads less, it commits harder. Again, no rule for that combination. The error signal and the work-left signal crossed a threshold together, and the worm turned that into "stop wandering, make a change". It found a scent.

A test goes green. Then another. The worm is cooking. If the run ended there I'd have gone home happy.

It didn't. The second edit was sloppy and broke something else. The pass rate drops, the errors jump, and the punish neuron (my favourite, for the record) lights up hot.

This is the moment I cared about. The obvious move is to panic, drop back to square one, and start over. The worm doesn't. It stays in gear, eases off, gets careful, reads before it edits. Two ticks later the careful edit lands and the pass rate climbs back higher than it was before the break.

It recovered without restarting, which is exactly what a good pair would do. Look at the diff, don't burn the house down and start fresh. And nothing in the wiring connects a regression to a change of gear. The punish signal pulls one knob down. It doesn't yank the worm out of edit mode, because that's a different knob entirely.

By tick 14 the last test goes green, ten out of ten, and the worm stops and parks the car. Not because I told it to stop at ten, but because there was nothing left pulling it forward.

Same brain, one of them alive

So I ran it properly. About 200 times. The live worm averaged a 0.96 pass rate.

Then I ran the exact same connectome, same wiring, same neurons, but pre-recorded, like playing a tape of the worm's brain instead of letting it react. That version averaged 0.87, and the live worm beat it every single time. Same brain. The only difference is one of them was alive and getting feedback, and the other was just replaying.

Now the part I have to be honest about. I ran a colder version where Claude hadn't indexed the repo first, and plain Claude with no worm and no driver beat every controller I built, the live worm included. The worm is great at following a gradient once the ground is mapped. It cannot plan its way around a codebase it has never seen. So this isn't "worm beats Claude". It's "feedback beats no feedback", which is a smaller and far more useful claim. (The full methodology and numbers are in the paper.)

What I took from it

Give your agent a way to know when it's done. The baseline that couldn't tell kept running long after the job was finished, burning money for nothing.

Don't restart on a setback. The worm that held its nerve through the regression did most of the real work.

And feedback beats wiring. Two identical brains, and the live one won purely because it was in the loop. The data you feed an agent and the loop you close around it matter more than how clever the thing in the middle is.

The bit that isn't about a worm

I built this in about three weeks, a couple of evenings here and there, on roughly $50 of API credits. I used Claude Code as my research assistant, which means my brain was steering Claude to build a thing that lets a worm's brain steer Claude.

A few years ago a question this daft would have been a research project. Now it's a weekend and a coffee budget. The cost of chasing a strange idea has fallen through the floor.

So chase the strange ideas. The code's on GitHub. Go on.