Building Repo Bench
For more than a year now, I’ve been spending most of my free time trying to understand the best ways to prompt the world’s most powerful AI models.
In the summer of 2024, right before the Quebec National holiday - Saint-Jean-Baptiste, Anthropic released Sonnet 3.5. While many of you reading this probably spent many hours prompting that model, most of you probably didn’t spend too much time prompting Opus 3, Sonnet’s predecessor.
You see, at the time, I was working on a game for the Apple Vision Pro called Bomb Squad. As I was working on that game, there were 3 main challenges I had trying to use AI models to help me build it.
- I didn’t want to write any of the code for this game by hand. I was an experienced game developer, but I wanted to see how far I could push these models to write the entire thing for me.
- Because of 1, I needed the model to stay aware of my growing codebase, but before Opus, the biggest context window you could find was 32k tokens. Gemini / Bard existed at the time, but was nowhere near as capable as it is today.
- Also because of 1, I needed the model to output complete code so that I could have it update my increasingly complex files.
When I had started working on that game, GPT-4 was widely considered the smartest model around, but it had a big character flaw - it was LAZY. You could beg and plead, but it would only ever output small slivers of code, which made using it to make changes to your code extremely tedious, as you’d have to spend a lot of time copy pasting snippets back into your files.
When Opus 3 came around back in February 2024, it solved two of GPT-4’s biggest flaws, that made working on this project possible. When asked, it would output complete code and adhere to my instructions far better than GPT-4. Furthermore, it was the first model to have a truly massive context window - 200k tokens, which meant I could feed it much larger parts of my code than was previously possible, and it showed glimpses of being able to output code I’d have to spend a lot of time iterating on to get right, were I to write it myself.
It had one major flaw however - it was extremely slow to output code, which meant that every time I wanted to make a change to my increasingly complex level generator, I’d have to wait - a lot.
That brings us back to June 2024 - Sonnet 3.5. While many people flocked to this model because it was much more capable in Cursor, I turned to it because it was much faster, wrote better code and still supported that 200k token context window. While Cursor did support this model, they limited the available context to 32k tokens, and had complex black box systems that broke down the context you’d try serving it, in order to stay within the allocated budget that made it cost effective enough to serve at scale. Because of this fact, to use Sonnet 3.5 to it’s full potential without going broke, I’d have to build my own prompts from my files, and paste those prompts on the Claude.ai website, which thankfully took my large prompts verbatim, and served them to the model intact.
Two weeks later, the first version of Repo Prompt was ready to be tested, fueled by long hours of tediously prompting Claude, having it iterate on my files in full, and patiently waiting for my 5 hour limits to reset.
While the first version of Repo Prompt solved the problem feeding Claude with my codebase context, applying edits to my code involved a lot of tedious manual copy pasting. At the time, Cursor’s solution to this was having the model submit a set of changes for their apply model to consume, which would after much awaiting, rewrite your file in full. As files would grow in size, the apply model would make more and more mistakes, sometimes failing to apply the specified edits or accidentally delete large swaths of code.
I looked around for what other tools were doing, and the only other tool to have a sensible solution to this problem was Aider - a CLI tool that pioneered the concept of diff edits. This approach really spoke to me because it would allow the model applying the change to output only enough tokens to describe the change, and the tool would generate a patch. You no longer needed to ask for the complete file, and as a result, you could spend FAR less on tokens to get that patch, while also iterating on code much faster.
The main idea for the approach was simply that you’d have the model echo the part of the code you’d want to replace, and it then it would fill in the code with the replacement block. Before agreeing with Aider that this was the best solution however, I obsessed over trying different approaches to solving this problem.
- I tried having it generate a unified diff. You could rely on context lines to locate the chunk of code to change, and then use it’s + - lines to change the code. The problem with this was that in practice, all the models I tested would make trivial mistakes. Context lines would be different, changing a function to private, adding whitespace that wasn’t there, or forgetting to indicate which lines should be deleted or added.
- I tried having the model isolate start and end selectors to tell me the section of code I wanted to change. this often worked quite well, and was more token efficient that search/replace, but it had other issues. The model’s mental map of the file was flawed, and often you’d see large sections of code deleted between the start and end selector, as the model failed to perfectly fill the holes it created.
Moving to search / replace was a lot more reliable, but it still had many issues. Models loved to change whitespace, make subtle changes to code, or provide ambiguous search blocks that matched in many locations. Today, this problem is much improved, but even Sonnet 4.5 or GPT-5 Codex will make these exact same mistakes, albeit much less frequently.
I spent hours on end pouring over how to engineer systems to accomodate these failures and improve edit success rate, and I’m quite proud of Repo Prompt’s apply_edits tooling today, as it makes it possible to one shot many complicated edits and handle fairly severe model formatting failures with grace.
File editing prowess is at the core of what makes a model capable as a coding assistant. Being precise with what to change leads to easier to review diffs, fewer regressions and more efficient use of output tokens. Most people only know how to get models to edit code from the convenience of an agent harness, but with sufficiently detailed prompts, models can write their responses with structured formatting that can be parsed and used to make dozens of file edits at once in a single response. With the right prompting, you can get models to output far more ambitious changes than they’d be able to manage with a tool call, and, given how expensive tool use can be (which requires multiple model requests made cheaper from caching), this approach can dramatically improve token efficiency.
Given that all this infrastructure for prompting, parsing, and applying edits was already built, I set out to build a suite of problems that formalized the many failure modes models run into when trying to edit files with this technique.
When applying edits from a model the first thing you have to do is locate a search block that perfectly identifies a unique block of code that can be replaced. As I mentioned, models tend to fail a lot at this. Consequently, for all tests in this benchmark, as difficulty scales, the challenge of efficiently locating a minimal search block increases, and models failing to identify these blocks, or echoing too much code (like trying to rewrite the entire file), will simply fail their tests.
Second, as context builds for more complex requests, or long running agent sessions, models start to forget some of the intricacies of your prompt and problem, and start failing to make the right changes. This bench goes a step further and creates purposeful decoy files that require the model to properly reason over the context it’s given. Many models, given these decoys will attempt to make edits with by selecting either the wrong section of code in the right file, or similar looking code in the wrong file. While this example may seem trite, in a long running agentic session, models will retain many duplicate or slightly different file slices from repeated file reads, holding an increasingly noisy history of changes.
Finally, in addition to needing to respect the formatting requirements to propose edits, and having to navigate noisy context, each task in this bench requires precise multi-edits, that fails the model for making unintended changes. One example I’ve often run into, especially with the GPT series of models, is that they will attempt to change a complex function but leave curly braces imbalanced, leading to very noisy hard to read compiler errors. Fixing such broken files is one of the problems on the test! Another involves precisely making changes to functions, with any collateral edit causing a failure.
Now given this gauntlet of tests, we have two main problems to address to build a reliable benchmark.
- We need to efficiently serve many variations of these problems to models, and they need to have never seen this exact variant of the problem to control for memorization.
- We need to verify that each problem is solved correctly, without failing the model for minor formatting differences that are technically correct.
As a result, Repo Bench is a generative problem set, with code created deterministically from a seed, to build a set of 10 of these problems, while scaling difficulty to present models with easy, medium and hard variants. Problem difficulty scales on many axes in Repo Bench, but all along the vectors we described above. Harder problems involve editing larger files, with more decoys and noise, while also requiring more edits in a single response than easier ones. Given how much more demanding the hard problems are to solve for models, it only makes sense to weigh them differently.
The final piece to control for with a benchmark like this is response variance. It might be surprising to learn that models are non-deterministic. Each time you feed them a prompt, they can and will respond with potentially very different answers. To account for this, while still properly ranking model scores, the final piece of this puzzle was to give this bench to the community of Repo Prompt users, and allow them to test the models they care about, and upload their scores to the leaderboard.
It took a bit of time to get the right formula for how to handle noisy variance in responses, but here’s where I landed:
- We measure interquartile ranges of the response distribution, and remove outliers on both ends of the spectrum that are more than 2 standard deviations from the rest.
- From that filtered set, we take the highest score, and remove all responses that are more than 15% lower from that top score.
- We take the median of the remaining score, favoring the higher score if only two runs remaining in the dataset.
The end result is a scoreboard that, given the problems we know to be measuring, matches my experience and intuition for the success rate of these models in practice.
As I mention on the leaderboard, these tests will not tell you how intelligent a model is, nor how good it is at writing the best code for a problem. Rather, they paint a picture of a model’s adaptibility to output formats it wasn’t explicitly trained on, and verifying how well it adheres to instructions given few shot examples in the system prompt. Most importantly, however, it pushes the model to respond with a high degree of accuracy given large amounts of confusing and ambiguous context.
In practice, these skills differentiate the best assistant and coding models from the rest.
If you’ve made it this far, I hope you’ve come to appreciate what makes Repo Bench different and worth running. Check out the benchmark and leaderboard here, and if you haven’t already, please also give Repo Prompt a try! It’s become a rather powerful toolbox for context engineers and AI agents alike (via mcp!).