You're Blaming the Model. The Harness Did It.

A benchmark dropped quietly on Hacker News today. By tomorrow it'll probably be forgotten under the next model release announcement. But I think it's the most important AI engineering post of the week, and possibly the month.

The headline: one engineer improved 15 different LLMs at coding tasks in a single afternoon. He didn't change the models. He didn't fine-tune anything. He changed the edit tool.

That's it.

The benchmark nobody's running

If you're building with coding agents (or just using them), you've probably blamed the model when things went sideways. The file didn't get updated correctly. The patch failed. The agent got confused about whitespace and spiraled into retry hell.

That's the harness failing, not the model.

The author of oh-my-pi, a fork of Mario Zechner's Pi coding agent, ran 16 models across three different edit tool formats against 180 tasks derived from the React codebase. Real bugs. Real files. The results are not subtle:

Grok Code Fast 1: 6.7% → 68.3% (roughly 10x)
MiniMax: more than doubled
Grok 4 Fast's output token count: dropped 61%, because it stopped burning tokens on failed retry loops

The format that destroyed performance? apply_patch, the OpenAI-flavored diff format used by Codex. It works great if you're using a model specifically fine-tuned to produce it. Give it to any other model and patch failures hit 46–51%. The model wasn't confused about what to fix. It just couldn't express it in a format the harness understood.

Three broken approaches everyone's using

apply_patch (Codex): a string-based diff format with strict rules. It's tuned into the weights of OpenAI models at the inference layer. Hand it to Grok or GLM and the model knows exactly what the bug is. It just can't speak the language. Failure rate: around 50% on non-OpenAI models.

str_replace (Claude Code, most others): find the exact old text, swap in the new. Simple to reason about. The catch is that the model has to reproduce every character exactly — whitespace, indentation, the whole thing. When it can't (which happens a lot with longer files), you get "String to replace not found in file," a bug so common it has its own GitHub issues megathread with 27+ linked issues.

Cursor's approach: they trained a separate 70B model just to handle the edit application step. The problem was hard enough that a well-funded company threw another model at it. Their own blog post admits that "fully rewriting the full file outperforms diff-like edits for files under 400 lines." That's a workaround dressed up as an architecture decision.

The Hashline idea

The alternative proposed in the post is clever and surprisingly simple. When the model reads a file, every line comes back tagged with a short content hash:

typescript

11:a3|function hello() {
22:f1|  return "world";
33:0e|}

The model edits by referencing those tags: "replace line 2:f1" or "insert after 3:0e." No need to reproduce the content. If the file changed since the last read, the hashes won't match and the edit is rejected before anything corrupts.

I've seen variations of this in production systems before: UUID-keyed operations, content-addressable edits. But nobody had benchmarked it cleanly across 16 models until now. Hashline matches or beats str_replace across the board. Weaker models gain the most.

The broader point

Aider's own benchmarks showed that format choice alone swung GPT-4 Turbo from 26% to 59%. A JetBrains paper (Diff-XYZ) confirmed it systematically: no single edit format dominates across models and use cases. This isn't new information. It's been in the literature for over a year. The industry keeps ignoring it because arguing about which foundation model is better is a much more exciting conversation.

The engineer spent $300 on benchmarking. The improvement on Gemini alone was +8%, bigger than most model upgrades deliver, at zero training compute.

We have an entire discourse built around "which model is the best coder" and we're not even controlling for the most basic variable in the experiment: how you ask the model to write the file.

You wouldn't benchmark a race car driver by putting them in a car with a broken transmission and then publishing a chart of lap times. But that's exactly what most model coding benchmarks do.

What this means if you're integrating LLMs

If you're building on top of LLM coding APIs or running evals, three things to act on:

Your eval harness is probably a confound. If you're using the same edit format across all models, you're measuring format compatibility as much as coding ability. That's not a model benchmark. It's a format benchmark.

Retry loops are a symptom. If your agent is burning tokens on retries, check the edit format before you blame the model. Nine times out of ten, the model knew the answer. It just couldn't write it down in a way your tooling accepted.

The cheapest performance win you're ignoring: audit your tool schemas, error messages, and output parsing. The model probably knows the answer. The question is whether the interface lets it show you.

The real question isn't "which model is best at coding?" It's "given this model, what interface lets it actually show what it knows?" Those are completely different problems, and right now almost nobody is working on the second one.

You're Blaming the Model. The Harness Did It.

The benchmark nobody's running

Three broken approaches everyone's using

The Hashline idea

The broader point

What this means if you're integrating LLMs

Get new posts in your inbox

Keep reading

Gemini 3.1 Can Solve Puzzles. It Still Can't Use a Screwdriver.

Sonnet Is the New Opus: Why Mid-Tier Models Keep Eating the Premium Tier

Your coding agent is a slot machine. You're already pulling the lever.