Why LLMs waste half their brain

You know the pattern. You give Claude Code a task — "add pagination to the invoices endpoint" — and before it writes a single line of code, it does this:

Glob for *invoice*. 47 results. Read 5 files to understand the structure. Grep for fetchInvoices. 12 results. Read 4 more. Grep for existing pagination patterns. 8 results. Read 3 more. Then, finally, it starts coding.

That's 12+ files read across 10+ tool calls, burning 60-100k tokens on exploration alone. And that's the good outcome — sometimes it misses a file entirely and heads off in a direction that makes zero sense.

I've been using Claude Code daily on production codebases at Prosaic and this pattern drives me nuts. The agent is genuinely good at writing code. It's just spending way too much of its context window figuring out where things are.

The insight

I saw jeremychone's approach on Hacker News and it immediately clicked. The idea is dead simple: code map with a cheap model, auto-context with a cheap model, then code with the big model.

The key bit — you index every file in your repo with a one-line summary using something cheap and fast like Haiku. That's your code map. Then when you have a task, you send those summaries (not the source code — just the summaries) to the cheap model and ask it to pick the files that matter. It returns 5-10 file paths. You feed the full source of those files to your expensive model.

381 candidate files compressed down to 5. 1.62 MB down to 27 KB. That's a 98% reduction in context — and the selection is intelligent, not keyword matching.

Higher precision on the input leads to higher precision on the output.

graph LR
    A["Your repo<br/>2688 files"] -->|codemap build| B["Per-file index<br/>summary, types,<br/>functions, imports"]
    B -->|codemap select| C["Cheap model picks<br/>381 → 5 files"]
    C --> D["Agent gets<br/>focused context<br/>30-80k tokens"]

What I built

Codemap is a Go CLI that implements this pipeline. Two commands do most of the work.

codemap build indexes your repo. For each file it generates a structured entry:

summary — one-sentence description (from the LLM)
when_to_use — when a developer would need this file (from the LLM)
public_types and public_functions — exported symbols (from the Go AST parser)
imports — dependency list (from the parser)
keywords — domain terms (from the LLM)

Deterministic facts come from static analysis. Semantic fields come from the cheap model. The index is cached as JSON and keyed on mtime + BLAKE3 hash — so subsequent builds only re-index files that actually changed. First run on a 2700-file repo takes a few minutes. After that, it's seconds.

Cost for a full index? About $2-3 with Haiku across 2700 files. Pennies for incremental rebuilds.

codemap select is where the magic happens. Given a task description, it loads the code map (summaries only — small), sends them to the cheap model with your task, and the LLM picks the 5-10 files that are actually relevant. Then it reads the full source of those files and hands everything back.

This isn't keyword overlap or TF-IDF scoring. The LLM reads all the summaries, understands what "add pagination to invoices" actually requires, and returns the files you need — the handler, the repository, the existing pagination helper, the relevant tests. One call, done.

Claude Code integration

Codemap runs as an MCP server over stdio. You register it once:

claude mcp add codemap -- codemap mcp

Claude Code gets three tools:

codemap_select — the main one. Claude calls this with the task description, gets back full source of selected files, starts coding immediately.
codemap_status — quick check on whether the index is fresh or stale.
codemap_build — triggers an incremental rebuild if needed.

The workflow becomes: Claude gets a task, calls codemap_select, receives focused context, writes code. No glob. No grep. No wrong turns.

Measuring it properly

I'm not keen on hand-wavy claims about "tokens saved" or "X% faster." You can't measure a counterfactual. What you can measure is whether the file selection was actually right.

Codemap tracks real metrics from observed data:

Selection accuracy — after a session, compare the files codemap selected against the files actually modified via git diff --name-only. That gives you precision (how many selected files were actually needed) and recall (how many changed files were pre-selected).

Exploration overhead — a PostToolUse hook logs every Read, Glob, and Grep call Claude makes after receiving codemap context. If Claude needed to go exploring beyond what codemap gave it, that shows up as overhead. Trending toward zero means the selection is doing its job.

Context compression — candidates vs selected, measured in bytes. Both numbers are real.

Here's what a typical stats output looks like:

Selection Accuracy (last 10 sessions)
  Avg hit rate:          82%
  Avg precision:         65%
  Avg compression:       97%

Exploration Overhead (last 10 sessions)
  Avg extra Read calls:  1.8
  Avg total Read calls:  6.2
  Overhead ratio:        29%

No estimates. No counterfactuals. Just: did codemap pick the right files, and did Claude need to look elsewhere?

The bigger picture

The pattern here is simple but powerful — use cheap, fast models for the boring structural work so the expensive model can focus its entire context window on actually solving the problem. Haiku is bloody good at reading 2700 one-line summaries and picking the 5 that matter. That's not a task that needs Opus.

Codemap is open source, written in Go, and supports Anthropic, OpenAI, and Google as LLM providers. If you're using Claude Code on anything bigger than a toy project, give it a go.

The insight

What I built

Claude Code integration

Measuring it properly

The bigger picture

Comments