SKILLFLOW: Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Zhang, Ziao; Shi, Kou; Huang, Shiting; Nie, Avery

SKILLFLOW: Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Ziao Zhang¹ Kou Shi¹ Shiting Huang¹ Avery Nie² Yu Zeng¹ Yiming Zhao¹ Zhen Fang¹ Qisheng Su¹ Haibo Qiu³ Wei Yang¹ Qingnan Ren¹ Shun Zou¹ Wenxuan Huang¹ Lin Chen¹ Zehui Chen¹ Feng Zhao^1*

¹ University of Science and Technology of China | ² University of Toronto | ³ University of Sydney

^* Corresponding author

SKILLFLOW studies whether autonomous agents can externalize experience into reusable skills, repair those skills after failure, and maintain a compact high-utility library over time.

arXiv Paper HuggingFace Paper GitHub Project HuggingFace Data

166tasks

20workflow families

5broad domains

+8.43best completion gain

Why SkillFlow

Most agent benchmarks stop at asking whether a model can consume a provided skill. SKILLFLOW asks a harder and more realistic question: can an agent solve a task, distill the lesson into a reusable skill artifact, repair that skill after later failures, and carry a better library forward through a family of related tasks?

Core idea. SKILLFLOW organizes 166 tasks into 20 workflow families across five domains, all grounded by Domain-Agnostic Execution Flow (DAEF) so that transfer is measured at the level of reusable procedure rather than topical overlap.

SkillFlow conceptual overview — The overview contrasts conventional static-skill evaluation with SKILLFLOW's lifelong setting, where agents externalize experience into reusable artifacts, revise them through patches, and transfer them across tasks that share a common DAEF.

Abstract

As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time.

SKILLFLOW is a benchmark of 166 tasks across 20 workflow families in which task construction follows a shared Domain-Agnostic Execution Flow (DAEF). Under the proposed Agentic Lifelong Learning protocol, agents begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward.

Experiments reveal selective rather than universal gains. Claude Opus 4.6 improves from 62.65% to 71.08% task completion, but high skill usage does not automatically mean high utility: Kimi K2.5 gains only +0.60 despite 66.87% skill usage, while both Qwen-Coder-Next and Qwen3-Coder-480B regress under skill evolution.

Benchmark Visuals

These figures summarize the dual-agent construction pipeline, the five-domain benchmark taxonomy, and the DAEF correspondences that let SKILLFLOW study procedural transfer beyond surface-level domain overlap.

SkillFlow construction pipeline — The dual-agent construction pipeline starts from seed tasks and curated skills, expands new families under a fixed DAEF, and filters them through Docker-grounded validation plus human review.

SkillFlow data taxonomy — SKILLFLOW spans five broad domains and 20 workflow families, giving the benchmark both topical breadth and repeated opportunities for skill transfer within each family.

A DAEF example shows how the same abstract workflow can survive major changes in files, entities, and business semantics while still supporting reusable skill transfer.

What the Benchmark Tracks

SKILLFLOW reports both outcome and process: whether agents solve tasks, how efficiently they do so, and how their skill libraries evolve under sequential evaluation.

Task Success Rate

The primary score is whether the final output satisfies the verifier, giving a clean measure of end-to-end task completion.

Efficiency

SKILLFLOW also reports interaction turns, monetary cost, and output tokens to show whether gains come with better or worse execution efficiency.

Skill Generation and Reuse

Each setting tracks how many skills survive in the final family-local library and how often previously written skills are actually reused later.

Family-Local Transfer

Tasks follow a fixed order within each workflow family, making every improvement or regression traceable to how well lessons move forward under a shared DAEF.

Main Experimental Results

Table 1 reports benchmark-level averages for 11 model variants across Claude Code, Codex CLI, Qwen Coder, and Kimi CLI. The clearest positive case is Claude Opus 4.6, which improves from 62.65% to 71.08% completion under lifelong skill evolution.

▲ higher completion ▼ lower cost / tokens / turns ▼ lower completion ▲ higher cost / tokens / turns

Agent	Model	Vanilla				Skills Evolve						Δ
Agent	Model	%comp.	Turns	Cost	Out Tok.	%comp.	Turns	Cost	Out Tok.	#Skills	%use	%Comp.	Turns	Cost	Tok.
Claude Code	Claude Sonnet 4.5	49.4	25.04	0.293	1.07	55.42	24.88	0.246	0.85	2.55	72.89	▲ 6.02	▼ 0.16	▼ 16.04%	▼ 20.56%
	Claude Opus 4.5	58.43	18.83	0.571	1.5	60.84	18.31	0.384	1.4	1.5	60.84	▲ 2.41	▼ 0.52	▼ 32.87%	▼ 6.67%
	Claude Sonnet 4.6	56.63	17.48	0.168	1.23	56.63	17.42	0.245	1.59	2.55	53.01	• 0.00	▼ 0.06	▲ 45.83%	▲ 29.27%
	Claude Opus 4.6	62.65	17.34	0.665	3.00	71.08	19.00	0.615	2.39	1.05	45.78	▲ 8.43	▲ 1.66	▼ 7.52%	▼ 20.33%
	MiniMax M2.5	28.31	35.22	0.010	0.44	34.94	34.01	0.010	0.54	2.50	32.53	▲ 6.63	▼ 1.21	• 0%	▲ 22.73%
	MiniMax M2.7	37.35	25.44	0.012	0.5	36.75	27.42	0.017	0.96	4.65	51.2	▼ 0.60	▲ 1.98	▲ 41.67%	▲ 92%
Codex CLI	GPT 5.4	33.13	23.89	0.41	4.05	36.75	24.17	0.459	4.43	1.05	81.33	▲ 3.62	▲ 0.28	▲ 11.95%	▲ 9.38%
Codex CLI	GPT 5.3 Codex	52.41	17.74	0.492	6.8	46.39	17.14	0.434	6.82	1.1	84.94	▼ 6.02	▼ 0.60	▼ 11.79%	▲ 0.29%
Qwen Coder	Qwen-Coder-Next	45.18	18.64	0.103	9.74	44.58	19.91	0.113	10.69	5.45	12.05	▼ 0.60	▲ 1.27	▲ 9.71%	▲ 9.75%
Qwen Coder	Qwen3-Coder-480B	24.7	26.22	0.189	12.58	24.1	28.8	0.199	12.12	5.2	66.87	▼ 0.60	▲ 2.58	▲ 5.29%	▼ 3.66%
Kimi CLI	Kimi K2.5	55.42	12.62	0.103	7.31	56.02	11.51	0.104	7.10	1.50	66.87	▲ 0.60	▼ 1.11	▲ 0.97%	▼ 2.87%

Main Findings

Finding 1

Opus 4.6 is the clearest positive case

Claude Opus 4.6 rises from 104/166 to 118/166 solved tasks, and a history-only control reaches just 51.04%, suggesting the gain comes from structured skill externalization rather than longer context alone.

Finding 2

Bad skills can create downstream drift

Once an incorrect abstraction is written into the library, later tasks may inherit the same mistake, turning a local failure into a sequence-level pattern.

Finding 3

Compact evolving skills beat fragmented ones

The strongest libraries are built around a small number of repeatedly repaired high-utility skills, not a pile of narrowly scoped task-by-task memories.

Finding 4

Qwen and part of MiniMax suffer from skill inflation

Several weaker settings keep adding overlapping skills almost monotonically with task index, yet still fail to convert that growth into benchmark-level gains.

Finding 5

Codex stays compact, but compactness is not enough

Codex often consolidates nearby variants into a shared evolving core skill, but that organizational strength alone does not match the strongest end-to-end gains.

Finding 6

Repairing bad skills is harder than writing them

Most models can write something after a task, but the real gap is whether they can recognize a flawed skill, repair it, and obtain better behavior on later tasks.

Benchmark Design

DAEF-structured workflow families: tasks are grouped by shared executable topology rather than shallow domain overlap.
Family-local curricula: agents start with an empty library, follow a fixed within-family difficulty order, and reset the library across different workflows.
Auditable skill patches: every update records a summary plus upsert_files and delete_paths, making repair and uncontrolled growth directly inspectable.
Docker-based closed evaluation: tasks run inside controlled containerized environments, so transfer is measured under stable execution constraints instead of open-world drift.

BibTeX

@article{zhang2026skillflow,
  title={SkillFlow: Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents},
  author={Zhang, Ziao and Shi, Kou and Huang, Shiting and Nie, Avery and Zeng, Yu and Zhao, Yiming and Fang, Zhen and Su, Qisheng and Qiu, Haibo and Yang, Wei and Ren, Qingnan and Zou, Shun and Huang, Wenxuan and Chen, Lin and Chen, Zehui and Zhao, Feng},
  year={2026},
  journal={arXiv preprint arXiv:2604.17308},
  eprint={2604.17308},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2604.17308}
}

Explore SkillFlow

Overview

Visuals

Main Results

Citation

SKILLFLOW: Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Why SkillFlow

Abstract

Benchmark Visuals

What the Benchmark Tracks

Task Success Rate

Efficiency

Skill Generation and Reuse

Family-Local Transfer

Main Experimental Results

Main Findings

Opus 4.6 is the clearest positive case

Bad skills can create downstream drift

Compact evolving skills beat fragmented ones

Qwen and part of MiniMax suffer from skill inflation

Codex stays compact, but compactness is not enough

Repairing bad skills is harder than writing them

Benchmark Design

BibTeX