BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

The saturation problem

Frontier models solve almost everything. Static benchmarks have stopped telling models apart — and stopped providing useful training signal.

On LiveCodeBench, state-of-the-art models exceed 99% Pass@1 on the newest easy split and over 90% on average. Building new, sufficiently hard datasets by hand is slow and expensive — a bottleneck for continued progress.

The key inversion: evolve the solution, not the statement

Most benchmark generation methods are problem-centric: they start by writing a new task and hope it requires new reasoning. In practice, this often produces surface-level variants of existing problems, while still relying on increasingly strong models to solve and validate them. BenchEvolver flips the direction. We evolve solutions first, then derive tasks from them. Because the reasoning structure changes before the problem statement is written, the resulting benchmarks impose genuinely new algorithmic demands while retaining executable ground truth by construction.

🧬

Generate in solution space

Mutate the reference solution to force a dominant algorithmic lift, then derive the statement and tests around the evolved, executable solution.

✓

Verify by consistency

Brute-force triangulation and statement-faithfulness checks ensure statement, solution, and tests define the same task — not a single LLM judge.

📉

Select by real failure

Difficulty is measured, not assigned: a candidate is accepted only if a panel of target models empirically fails more than on the seed.

See it in action

One mutation, a whole new algorithm

The surface story stays familiar; the underlying computation jumps to a different regime. The same solution-centric principle works across two very different coding domains.

Example 1 — Competitive programming LiveCodeBench

Seed · Pass@1 8/8

Copy Arrays

Count arrays whose adjacent differences match the original and whose entries satisfy per-index bounds.

one unknown: copy[i] = original[i] + d bounds: u_i ≤ copy[i] ≤ v_i solve: intersect intervals for d → O(N)

→

ALGORITHMIC LIFT

Evolved · Pass@1 4/8

XOR-Linked Sequence

Now adjacent XORs must match. The feasible sets are no longer contiguous — interval intersection fails.

one unknown: copy[i] = x XOR p_i bounds: u_i ≤ x XOR p_i ≤ v_i solve: XOR sets non-contiguous → digit-DP, O(N·bits)

Why it is harder: the seed is solved by an O(N) interval intersection over one free variable. Switching addition to XOR makes the constraints u_i ≤ x ⊕ p_i ≤ v_i, whose solution sets are non-contiguous — requiring a bitwise digit-DP or trie. The parent's shortcut is provably insufficient.

Example 2 — Scientific coding SciCode

Seed · forward simulation

RK4 Integrator

Implement a classical fourth-order Runge–Kutta integrator for a driven damped pendulum, returning the full state-space trajectory.

# given f, state, dt, n ... runge_kutta_4th_order(...) # integrate forward → trajectory

→

ALGORITHMIC LIFT

Evolved · inverse problem

Fit ODE Trajectory (Gauss–Newton)

Estimate the unknown initial state and ODE parameters from sparse observations — turning integration into a full nonlinear solver.

# RK4 forward sim, then: damped Gauss-Newton + finite-diff Jacobian + backtracking line search

Why it is harder: the seed performs a single forward simulation of a known ODE. The evolved task inverts it — recovering unknown initial conditions and parameters from noisy, sparsely sampled observations. This requires repeated RK4 simulation inside a damped Gauss–Newton loop with finite-difference Jacobians and a backtracking line search, a qualitatively harder numerical-optimization pipeline.

The framework

A closed loop: Proposer → Evaluator → Memory

A Proposer evolves solutions and writes tasks; an Evaluator validates and measures empirical difficulty; a Memory module feeds accepted lineages and past failures back into search — turning repeated sampling into adaptive evolution.

BenchEvolver framework overview — **Overview of BenchEvolver.** A saturated seed task is mutated in solution space; the evaluator filters candidates for validity, diversity, and difficulty; memory records outcomes with reasons; accepted candidates become new parents.

🛠️

Proposer

Mutates the parent solution into a structurally different one, then derives a natural statement, public examples, and tiered hidden tests — all anchored by executing the evolved reference.

⚖️

Evaluator

Triangulates the reference, a brute-force solver, and a statement-only oracle to catch inconsistencies; runs bounded repair; then accepts only if the target panel empirically fails more.

🧠

Memory

Local memory tracks each seed's lineage and error patterns; global memory enforces diversity across seeds — a family that already succeeded must clear a higher difficulty bar.

Result 1 · Empirically harder

Evolved tasks cut pass rates — even for their own generator

Across two domains, four target models, and multiple evolvers, evolved problems consistently and substantially reduce Pass@1 relative to their seeds. Crucially, each evolver also drops on its own evolved tasks — this is self-challenging generation, not teacher-to-student distillation.

LiveCodeBench seed vs evolved Pass@1 — **LiveCodeBench.** Pass@1 on original seeds vs. evolved problems across evolver models (columns) and target models (rows). Evolved tasks reduce pass rates everywhere (k = 4 attempts/model).

SciCode seed vs evolved Pass@1 — **SciCode.** The same solution-centric principle extends beyond competitive programming to research-style scientific coding, with high validity and large difficulty gains.

Result 2 · Human-verified quality

Harder and more diverse — without losing clarity

Six competitive-programming experts (Codeforces master / IOI / ICPC level) blindly reviewed 207 evolved problems across 72 seeds. Evolved tasks are rated more novel and far more difficult, span a much broader algorithmic surface — and are actually rated clearer than their seeds.

Clarity, novelty, difficulty distributions — **Clarity · Novelty · Difficulty.** Evolved problems shift toward higher novelty and difficulty while keeping clarity high.

Algorithm category shift seed to evolved — **Algorithmic surface area.** Seeds are dominated by search/simulation; evolved tasks spread mass across segment trees, HLD/LCT, AC automata, polynomial methods, and more.

The artifact

LiveCodeBench-Plus

A 91-problem benchmark combining 64 human-vetted evolved tasks with 27 difficult original LCB-v6 problems. Every problem passes correctness, quality (≥3/5, Olympiad standard), and difficulty-range gates. Frontier Pass@1 spans 27.5%–62.6% — restoring clear discrimination among the strongest models.

Pass@1 on LiveCodeBench-Plus (91 problems · k=4)

GPT-5.5OpenAI

62.6

Gemini-3.1-ProGoogle

59.1

GPT-5.4OpenAI

54.1

Gemini-3.5-FlashGoogle

50.0

Qwen-3.7-MaxAlibaba

47.5

Gemini-3-FlashGoogle

40.1

GPT-5.4-miniOpenAI

29.6

DeepSeek-V4-ProDeepSeek

27.5

Reasoning settings: GPT models use medium reasoning effort, DeepSeek uses high, and Gemini models use adaptive reasoning.

Difficulty shift: seed → evolved (Pass@1 %)

Model	Medium seed	Medium evolved	Hard seed	Hard evolved	Δ Hard
GPT-5.5	100.0	80.0	97.1	62.3	−34.8
GPT-5.4	98.9	74.3	94.8	49.7	−45.1
GPT-5.4-mini	95.7	59.3	79.7	21.7	−58.0
Gemini-3.1-Pro	100.0	78.6	96.5	56.8	−39.7
DeepSeek-V4-Pro	95.7	57.1	83.7	23.2	−60.5

Averaged across all evaluated models, the Hard split drops from 87.0% → 45.7% Pass@1 — an absolute reduction of 41.3 points.

Result 3 · Closing the loop

Self-generated challenges become training signal

Using gpt-oss-20b as both evolver and target, we evolve problems it already solves into harder verified variants, then train on them with RL. Evolved tasks improve held-out coding performance beyond training on the original seeds alone — the same model exposes and then learns from its own weaknesses.

RL gains on LCB v6 Hard — **LCB v6 Hard** — +8.7 with seed+evolved

RL gains on LCB-Pro Easy — **LCB-Pro Easy** — +8.3 with seed+evolved

RL gains on LCB-Evolved Medium — **LCB-Evolved Medium** — +7.8 with evolved-only

Evolved tasks are not only harder benchmark items — they are reusable training environments that help a model improve on difficult coding regimes beyond its saturated training distribution.

The bigger picture

Toward living benchmarks

Any fixed benchmark eventually saturates. BenchEvolver points to a different model of evaluation: a reproducible pipeline that periodically generates, validates, and calibrates new tasks against current frontier models — aligning evaluation with training, so the same verified tasks that reveal failures also become the environments that fix them.

🔁 Self-challenging, no stronger teacher

Difficulty is measured by executable model failure — including the generator's own. Frontier models can expose and train on their own weaknesses, not just distill from a larger model.

🌐 Domain-general by design

Only the execution harness is domain-specific. The same mutate → write → verify → select loop works on stdin/stdout competitive programming and assertion-based scientific coding alike.

Citation

Cite BenchEvolver

If you find our work useful, please consider citing:

@misc{wu2026benchevolverfrontiertasksynthesis, title={BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution}, author={Yangzhen Wu and Aaron J. Li and Wenjie Ma and Li Cao and Ziheng Zhou and Mert Cemri and Shu Liu and Yuran Xiu and Chenxiao Yan and Haikun Zhao and Bin Yu and Ion Stoica and Dawn Song}, year={2026}, eprint={2606.01286}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2606.01286}, }