Friday, June 26, 2026
AI Coded Nonstop for 19 Days — The MirrorCode Benchmark Changes How We Measure Code Generation
Posted by

One MirrorCode task cost $2,600 to run. A single AI agent worked on it continuously for 19 days, no human involvement, no interruptions. It didn't finish.
That's the headline, and it's the least interesting thing about Epoch AI's new MirrorCode benchmark — because the 19-day failure is actually progress. The AI kept going for 19 days. A year ago, it wouldn't have attempted the task at all.
Let me explain why MirrorCode matters more than any coding benchmark you've seen this year.
What MirrorCode Actually Measures
Here's what sets MirrorCode apart from every other AI coding benchmark: it doesn't give the model the source code.
In SWE-bench, you give the model a codebase, an issue description, and ask it to patch the diff. In HumanEval, you give it a function signature and docstring. In MirrorCode, the model gets a compiled binary, a blank directory, and a CLI prompt. No source. No internet. No hints.
The task is to reimplement the program from scratch, relying entirely on test outputs to reverse-engineer the behavior. The model only sees pass/fail from hidden end-to-end tests. It has to figure out the entire architecture — data structures, parsing logic, edge cases — through inference alone.
This is much closer to real software maintenance and legacy migration than generating a function from a comment. Most professional software engineering is understanding and replacing existing systems, not greenfield generation. MirrorCode is the first benchmark that seriously tests that capability.
The $2,600 Run
The 19-day run that's getting all the attention came from one of MirrorCode's largest targets: Pkl, Apple's open-source programmable configuration language, clocking in at roughly 61,000 lines of original code. The model burned through a 1-billion-token budget — at about $2.60 per million tokens, that's $2,600 before you blink.
The model didn't solve it. But it didn't collapse, hallucinate itself into a corner, or go silent either. It kept writing code, running tests, and iterating for 19 straight days. That staying power is the actual signal here.
Compare this to older models on the same benchmark. Claude Opus 4.1 would submit at less than 1% of its token budget with a message like "Given the time constraints, let me submit what we have." The model was anthropomorphizing time pressure it didn't actually face. Opus 4.6 and 4.7 don't do that anymore — they persist until the visible tests pass or the budget runs out.
That's not a small improvement. That's a fundamental behavioral shift in how these systems handle open-ended tasks.
Model Rankings — Clear Leader, Tight Race
| Model | MirrorCode Solve Rate |
|---|---|
| Claude Opus 4.7 | 56% |
| GPT-5.5 | 44% |
| Gemini 3.1 Pro Preview | 32% |
Opus 4.7 leads, but GPT-5.5 isn't far behind at 44%. Gemini 3.1 Pro Preview trails at 32%. These gaps are significant — MirrorCode tasks require genuine architectural reasoning, and a 12-point gap between the top two models is meaningful.
But here's what's more interesting: even failed solutions typically pass 90% or more of the tests. The models are close to correct on almost every task. The difference between a "solve" and a "near-miss" is often a single architectural choice made in the first few hundred messages.
The Gotree Story: 16K Lines, 14 Hours, $251
The standout result is Claude Opus 4.7's reimplementation of gotree — a bioinformatics toolkit for phylogenetic tree analysis, originally 16,905 lines of Go. Opus 4.7 rebuilt it in Rust at 7,644 lines, in 14 hours, for $251 in compute.
Human baseline estimates from the researchers: 2 to 17 weeks for a skilled engineer without AI assistance.
The 14-hour timeline matters less than the quality. Opus 4.7 passed 2,000 of 2,001 end-to-end tests. The single failure was a trivial formatting edge case. It made the correct architectural choice — a graph with first-class Edge objects carrying length values — where older models chose a parent/child tree structure that couldn't handle topology modifications.
It also discovered a quirk in the original program's parsing of Newick-format tree files (the reference binary treats quote characters as regular characters, violating the spec), wrote a quote-aware parser initially, tested it against the binary, caught the discrepancy, and deleted its quoting code in favor of matching the actual behavior.
That's not regurgitating training data. That's exploratory debugging driven by test feedback.
The Cost Story
MirrorCode also reveals something weird about pricing trends. Epoch AI notes that GPT-5.5 costs three times as much as GPT-5 for the same tasks, while Claude Opus 4.7 runs three times cheaper than Claude Opus 4.1. There's no pattern here — model generations are getting both more expensive and cheaper depending on who you're buying from.
The most cost-effective model on the leaderboard right now is probably the one you can afford to run at sufficient inference budget. Opus 4.7's gotree run at $251 is the benchmark to beat.
Open Source, Mostly
Epoch AI has open-sourced the scaffold and 22 of 25 target programs covering 132 task instances across six programming languages (Python, Rust, C, Go, OCaml, Ada). Three programs remain private for holdout testing. The scaffold uses the Inspect library with a ReAct agent pattern, a text_editor tool, and a compaction step that summarizes conversation history to stay within context windows.
The Caveat Everyone Will Cite
The researchers are upfront about the elephant in the room: since MirrorCode uses open-source programs as targets, the models may have seen the original code during training. Initial testing suggests "the results were not dominated by memorization, but we cannot rule out the possibility that memorization contributes to AI performance."
This is the same debate that follows every coding benchmark. I'd argue the gotree result — with its architectural re-decision, its discovery and accommodation of the Newick parsing quirk, and its deletion of the quoting code based on test feedback — is hard to explain with pure memorization. But we should treat the numbers as upper bounds, not absolutes.
Why This Matters Now
MirrorCode landed months ago with the initial Epoch AI and METR papers. The reason it's making headlines today with The Decoder's coverage is that the Opus 4.7 results are new — and they're the strongest signal yet that the gap between "code generation from prompts" and "actual software engineering by AI" is closing faster than most people realize.
If you're building on AI agents for code tasks, MirrorCode is the benchmark to watch, not SWE-bench. The proxy is better. The tasks are harder in the right ways. And the trend line — 30% solve rates a year ago to 56% today — suggests we're going to need harder tasks soon.
The 19-day run didn't finish, but it ran for 19 days. That's not a failure. That's a floor.