SKILLFLOW: Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Ziao Zhang1 Kou Shi1 Shiting Huang1 Avery Nie2 Yu Zeng1 Yiming Zhao1 Zhen Fang1 Qisheng Su1 Haibo Qiu3 Wei Yang1 Qingnan Ren1 Shun Zou1 Wenxuan Huang1 Lin Chen1 Zehui Chen1 Feng Zhao1*
1 University of Science and Technology of China | 2 University of Toronto | 3 University of Sydney
* Corresponding author

SKILLFLOW studies whether autonomous agents can externalize experience into reusable skills, repair those skills after failure, and maintain a compact high-utility library over time.

166tasks
20workflow families
5broad domains
+8.43best completion gain

Why SkillFlow

Most agent benchmarks stop at asking whether a model can consume a provided skill. SKILLFLOW asks a harder and more realistic question: can an agent solve a task, distill the lesson into a reusable skill artifact, repair that skill after later failures, and carry a better library forward through a family of related tasks?

Core idea. SKILLFLOW organizes 166 tasks into 20 workflow families across five domains, all grounded by Domain-Agnostic Execution Flow (DAEF) so that transfer is measured at the level of reusable procedure rather than topical overlap.
SkillFlow conceptual overview
The overview contrasts conventional static-skill evaluation with SKILLFLOW's lifelong setting, where agents externalize experience into reusable artifacts, revise them through patches, and transfer them across tasks that share a common DAEF.

Abstract

As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time.

SKILLFLOW is a benchmark of 166 tasks across 20 workflow families in which task construction follows a shared Domain-Agnostic Execution Flow (DAEF). Under the proposed Agentic Lifelong Learning protocol, agents begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward.

Experiments reveal selective rather than universal gains. Claude Opus 4.6 improves from 62.65% to 71.08% task completion, but high skill usage does not automatically mean high utility: Kimi K2.5 gains only +0.60 despite 66.87% skill usage, while both Qwen-Coder-Next and Qwen3-Coder-480B regress under skill evolution.

Benchmark Visuals

These figures summarize the dual-agent construction pipeline, the five-domain benchmark taxonomy, and the DAEF correspondences that let SKILLFLOW study procedural transfer beyond surface-level domain overlap.

SkillFlow construction pipeline
The dual-agent construction pipeline starts from seed tasks and curated skills, expands new families under a fixed DAEF, and filters them through Docker-grounded validation plus human review.
SkillFlow data taxonomy
SKILLFLOW spans five broad domains and 20 workflow families, giving the benchmark both topical breadth and repeated opportunities for skill transfer within each family.
DAEF example
A DAEF example shows how the same abstract workflow can survive major changes in files, entities, and business semantics while still supporting reusable skill transfer.

What the Benchmark Tracks

SKILLFLOW reports both outcome and process: whether agents solve tasks, how efficiently they do so, and how their skill libraries evolve under sequential evaluation.

Task Success Rate

The primary score is whether the final output satisfies the verifier, giving a clean measure of end-to-end task completion.

Efficiency

SKILLFLOW also reports interaction turns, monetary cost, and output tokens to show whether gains come with better or worse execution efficiency.

Skill Generation and Reuse

Each setting tracks how many skills survive in the final family-local library and how often previously written skills are actually reused later.

Family-Local Transfer

Tasks follow a fixed order within each workflow family, making every improvement or regression traceable to how well lessons move forward under a shared DAEF.

Main Experimental Results

Table 1 reports benchmark-level averages for 11 model variants across Claude Code, Codex CLI, Qwen Coder, and Kimi CLI. The clearest positive case is Claude Opus 4.6, which improves from 62.65% to 71.08% completion under lifelong skill evolution.

▲ higher completion ▼ lower cost / tokens / turns ▼ lower completion ▲ higher cost / tokens / turns
Agent Model Vanilla Skills Evolve Δ
%comp. Turns Cost Out Tok. %comp. Turns Cost Out Tok. #Skills %use %Comp. Turns Cost Tok.
Claude Code Claude Sonnet 4.5 49.425.040.2931.07 55.4224.880.2460.852.5572.89 ▲ 6.02▼ 0.16▼ 16.04%▼ 20.56%
Claude Opus 4.5 58.4318.830.5711.5 60.8418.310.3841.41.560.84 ▲ 2.41▼ 0.52▼ 32.87%▼ 6.67%
Claude Sonnet 4.6 56.6317.480.1681.23 56.6317.420.2451.592.5553.01 • 0.00▼ 0.06▲ 45.83%▲ 29.27%
Claude Opus 4.6 62.6517.340.6653.00 71.0819.000.6152.391.0545.78 ▲ 8.43▲ 1.66▼ 7.52%▼ 20.33%
MiniMax M2.5 28.3135.220.0100.44 34.9434.010.0100.542.5032.53 ▲ 6.63▼ 1.21• 0%▲ 22.73%
MiniMax M2.7 37.3525.440.0120.5 36.7527.420.0170.964.6551.2 ▼ 0.60▲ 1.98▲ 41.67%▲ 92%
Codex CLI GPT 5.4 33.1323.890.414.05 36.7524.170.4594.431.0581.33 ▲ 3.62▲ 0.28▲ 11.95%▲ 9.38%
GPT 5.3 Codex 52.4117.740.4926.8 46.3917.140.4346.821.184.94 ▼ 6.02▼ 0.60▼ 11.79%▲ 0.29%
Qwen Coder Qwen-Coder-Next 45.1818.640.1039.74 44.5819.910.11310.695.4512.05 ▼ 0.60▲ 1.27▲ 9.71%▲ 9.75%
Qwen3-Coder-480B 24.726.220.18912.58 24.128.80.19912.125.266.87 ▼ 0.60▲ 2.58▲ 5.29%▼ 3.66%
Kimi CLI Kimi K2.5 55.4212.620.1037.31 56.0211.510.1047.101.5066.87 ▲ 0.60▼ 1.11▲ 0.97%▼ 2.87%

Main Findings

Finding 1

Opus 4.6 is the clearest positive case

Claude Opus 4.6 rises from 104/166 to 118/166 solved tasks, and a history-only control reaches just 51.04%, suggesting the gain comes from structured skill externalization rather than longer context alone.

Finding 2

Bad skills can create downstream drift

Once an incorrect abstraction is written into the library, later tasks may inherit the same mistake, turning a local failure into a sequence-level pattern.

Finding 3

Compact evolving skills beat fragmented ones

The strongest libraries are built around a small number of repeatedly repaired high-utility skills, not a pile of narrowly scoped task-by-task memories.

Finding 4

Qwen and part of MiniMax suffer from skill inflation

Several weaker settings keep adding overlapping skills almost monotonically with task index, yet still fail to convert that growth into benchmark-level gains.

Finding 5

Codex stays compact, but compactness is not enough

Codex often consolidates nearby variants into a shared evolving core skill, but that organizational strength alone does not match the strongest end-to-end gains.

Finding 6

Repairing bad skills is harder than writing them

Most models can write something after a task, but the real gap is whether they can recognize a flawed skill, repair it, and obtain better behavior on later tasks.

Benchmark Design

  • DAEF-structured workflow families: tasks are grouped by shared executable topology rather than shallow domain overlap.
  • Family-local curricula: agents start with an empty library, follow a fixed within-family difficulty order, and reset the library across different workflows.
  • Auditable skill patches: every update records a summary plus upsert_files and delete_paths, making repair and uncontrolled growth directly inspectable.
  • Docker-based closed evaluation: tasks run inside controlled containerized environments, so transfer is measured under stable execution constraints instead of open-world drift.

BibTeX

@article{zhang2026skillflow,
  title={SkillFlow: Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents},
  author={Zhang, Ziao and Shi, Kou and Huang, Shiting and Nie, Avery and Zeng, Yu and Zhao, Yiming and Fang, Zhen and Su, Qisheng and Qiu, Haibo and Yang, Wei and Ren, Qingnan and Zou, Shun and Huang, Wenxuan and Chen, Lin and Chen, Zehui and Zhao, Feng},
  year={2026},
  journal={arXiv preprint arXiv:2604.17308},
  eprint={2604.17308},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2604.17308}
}