Frontier models now exceed 99% Pass@1 on LiveCodeBench easy, saturating our benchmarks. BenchEvolver automatically transforms existing coding problems into substantially harder, verified variants β tasks that are hard even for the model that created them.
On LiveCodeBench, state-of-the-art models exceed 99% Pass@1 on the newest easy split and over 90% on average. Building new, sufficiently hard datasets by hand is slow and expensive β a bottleneck for continued progress.
Most benchmark generation methods are problem-centric: they start by writing a new task and hope it requires new reasoning. In practice, this often produces surface-level variants of existing problems, while still relying on increasingly strong models to solve and validate them. BenchEvolver flips the direction. We evolve solutions first, then derive tasks from them. Because the reasoning structure changes before the problem statement is written, the resulting benchmarks impose genuinely new algorithmic demands while retaining executable ground truth by construction.
Mutate the reference solution to force a dominant algorithmic lift, then derive the statement and tests around the evolved, executable solution.
Brute-force triangulation and statement-faithfulness checks ensure statement, solution, and tests define the same task β not a single LLM judge.
Difficulty is measured, not assigned: a candidate is accepted only if a panel of target models empirically fails more than on the seed.
The surface story stays familiar; the underlying computation jumps to a different regime. The same solution-centric principle works across two very different coding domains.
Count arrays whose adjacent differences match the original and whose entries satisfy per-index bounds.
one unknown: copy[i] = original[i] + d bounds: u_i β€ copy[i] β€ v_i solve: intersect intervals for d β O(N)Now adjacent XORs must match. The feasible sets are no longer contiguous β interval intersection fails.
one unknown: copy[i] = x XOR p_i bounds: u_i β€ x XOR p_i β€ v_i solve: XOR sets non-contiguous β digit-DP, O(NΒ·bits)Implement a classical fourth-order RungeβKutta integrator for a driven damped pendulum, returning the full state-space trajectory.
# given f, state, dt, n ... runge_kutta_4th_order(...) # integrate forward β trajectoryEstimate the unknown initial state and ODE parameters from sparse observations β turning integration into a full nonlinear solver.
# RK4 forward sim, then: damped Gauss-Newton + finite-diff Jacobian + backtracking line searchA Proposer evolves solutions and writes tasks; an Evaluator validates and measures empirical difficulty; a Memory module feeds accepted lineages and past failures back into search β turning repeated sampling into adaptive evolution.
Mutates the parent solution into a structurally different one, then derives a natural statement, public examples, and tiered hidden tests β all anchored by executing the evolved reference.
Triangulates the reference, a brute-force solver, and a statement-only oracle to catch inconsistencies; runs bounded repair; then accepts only if the target panel empirically fails more.
Local memory tracks each seed's lineage and error patterns; global memory enforces diversity across seeds β a family that already succeeded must clear a higher difficulty bar.
Across two domains, four target models, and multiple evolvers, evolved problems consistently and substantially reduce Pass@1 relative to their seeds. Crucially, each evolver also drops on its own evolved tasks β this is self-challenging generation, not teacher-to-student distillation.
Six competitive-programming experts (Codeforces master / IOI / ICPC level) blindly reviewed 207 evolved problems across 72 seeds. Evolved tasks are rated more novel and far more difficult, span a much broader algorithmic surface β and are actually rated clearer than their seeds.
A 91-problem benchmark combining 64 human-vetted evolved tasks with 27 difficult original LCB-v6 problems. Every problem passes correctness, quality (β₯3/5, Olympiad standard), and difficulty-range gates. Frontier Pass@1 spans 27.5%β62.6% β restoring clear discrimination among the strongest models.
Reasoning settings: GPT models use medium reasoning effort, DeepSeek uses high, and Gemini models use adaptive reasoning.
| Model | Medium seed | Medium evolved | Hard seed | Hard evolved | Ξ Hard |
|---|---|---|---|---|---|
| GPT-5.5 | 100.0 | 80.0 | 97.1 | 62.3 | β34.8 |
| GPT-5.4 | 98.9 | 74.3 | 94.8 | 49.7 | β45.1 |
| GPT-5.4-mini | 95.7 | 59.3 | 79.7 | 21.7 | β58.0 |
| Gemini-3.1-Pro | 100.0 | 78.6 | 96.5 | 56.8 | β39.7 |
| DeepSeek-V4-Pro | 95.7 | 57.1 | 83.7 | 23.2 | β60.5 |
Averaged across all evaluated models, the Hard split drops from 87.0% β 45.7% Pass@1 β an absolute reduction of 41.3 points.
Using gpt-oss-20b as both evolver and target, we evolve problems it already solves into harder verified variants, then train on them with RL. Evolved tasks improve held-out coding performance beyond training on the original seeds alone β the same model exposes and then learns from its own weaknesses.
Any fixed benchmark eventually saturates. BenchEvolver points to a different model of evaluation: a reproducible pipeline that periodically generates, validates, and calibrates new tasks against current frontier models β aligning evaluation with training, so the same verified tasks that reveal failures also become the environments that fix them.
Difficulty is measured by executable model failure β including the generator's own. Frontier models can expose and train on their own weaknesses, not just distill from a larger model.
Only the execution harness is domain-specific. The same mutate β write β verify β select loop works on stdin/stdout competitive programming and assertion-based scientific coding alike.
If you find our work useful, please consider citing: