Train Your Skills Like Models: TextGrad, SkillOpt, SkillClaw

What is a skill-evolution framework?

A skill-evolution framework is a system that automatically improves the natural-language instructions an LLM agent runs on — the SKILL.md files, prompts, or query templates — using real usage data instead of hand-rewriting. In 2026 three open-source projects defined the category: TextGrad (Stanford, Nature paper) provides text autograd as a primitive; SkillOpt (Microsoft Research) trains skills with validation-gated supervised learning; SkillClaw (Alibaba DreamX) evolves a shared skill library from multi-user trajectories. This post compares them and tells you which one to pick for which job.

All three are MIT-licensed, paper-anchored, and have ≥1.5K GitHub stars. Star counts verified on 2026-06-01.

The three frameworks at a glance

Project	Stars	License	Best for	Maintainer	Paper
SkillOpt	4,073	MIT	Train a single skill against a benchmark, like a neural net	Microsoft Research	arXiv:2605.23904
TextGrad	3,582	MIT	Build your own optimizer over text objects (prompts, skills, code)	Stanford Zou Group	Nature 2025
SkillClaw	1,537	MIT	Evolve a shared skill library from many real users	Alibaba AMAP-ML / DreamX	arXiv:2604.08377

Star count is not a quality proxy here — TextGrad's 3.6K is mostly the primitive (used by many downstream optimizers), SkillOpt's 4K is a complete training loop, and SkillClaw's 1.5K is a collective system with a runtime component. Different shapes, different markets.

Which skill-evolution framework should I use?

Decision tree:

I want to train a specific skill against a benchmark, with validation-gated edits and a deployable best_skill.md → SkillOpt
I want to build my own optimizer over arbitrary text objects (prompt, skill, query, code) → TextGrad
I run an agent serving many users and want their collective experience to improve everyone's skills → SkillClaw
I want to read the paper that started this entire category → TextGrad (Nature, 2025)
I want the most empirical results across models and harnesses → SkillOpt (52 cells, 6 benchmarks, 7 models, 3 harnesses)
I want immediate compatibility with OpenClaw / Hermes / nanobot / picoclaw / nemoclaw → SkillClaw (broadest ecosystem support out of the box)

Now in detail.

SkillOpt — gradient descent for skills

SkillOpt is Microsoft Research's text-space optimizer that trains reusable natural-language skills for frozen LLM agents through trajectory-driven edits and validation-gated updates. 4,073 stars in three weeks (created 2026-05-08), MIT, arXiv:2605.23904.

The defining design choice is rigor borrowed from supervised learning. A separate optimizer model turns scored rollouts into bounded add / delete / replace edits on a single skill document; a candidate edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, a rejected-edit buffer, and an epoch-wise slow / meta update make skill training stable. Zero inference-time model calls at deployment — the deployed artifact is a compact best_skill.md (typically 300-2,000 tokens) that runs against the unchanged target model.

The empirical bar is the strongest in this space:

52 (model × benchmark × harness) cells evaluated, all best or tied-best
6 benchmarks, 7 target models, 3 execution harnesses (direct chat, Codex CLI, Claude Code CLI)
On GPT-5.5: +23.5 points in direct chat, +24.8 inside Codex agentic loop, +19.1 inside Claude Code
Optimized skill artifacts transfer across model scales, between Codex and Claude Code harnesses, and to nearby benchmarks without further optimization

Best for: anyone with a concrete skill to train and a benchmark to train it against. The "neural-network-style discipline applied to text" framing makes it easy to explain to ML colleagues.

TextGrad — the autograd primitive

TextGrad is Stanford Zou Group's framework for automatic differentiation via text — uses LLMs to backpropagate textual gradients across prompts, skills, queries, and code. 3,582 stars, MIT, published in Nature in 2025. From the same lab as Virtual Lab and CellVoyager (both featured in our Biomedical AI Agents 2026 roundup).

The defining design choice is generality. Where SkillOpt optimizes one specific target (a SKILL.md file) with one specific protocol (validation-gated edits), TextGrad provides the underlying primitive — a backward() over arbitrary text — and lets you compose it into whatever optimization scheme fits your problem.

Concretely, TextGrad lets you treat a prompt, a piece of code, a SQL query, or a SKILL.md as a "variable", attach a loss function written in natural language, and run an optimizer that updates the variable in the gradient direction the loss suggests. The same machinery handles all four. Downstream optimizers including SkillOpt and parts of DSPy compose on top of similar ideas.

Best for: researchers and builders who want to design their own optimization loop. If your problem is something other than "train this skill against this benchmark" — e.g. "iteratively refine this SQL query until it explains the experiment results" — TextGrad gives you the primitives and stays out of your way.

The Nature paper is the canonical reference and gives an intuition for why text autograd makes sense as a general formalism.

SkillClaw — collective evolution from multi-user traces

SkillClaw is Alibaba DreamX Team's framework for collective skill evolution in multi-user agent ecosystems. 1,537 stars, MIT, arXiv:2604.08377. The defining design choice — and the one that makes it distinct from SkillOpt and TextGrad — is the signal source: SkillClaw treats real cross-user interactions as the primary training signal, not a curated benchmark.

The pipeline: an aggregator continuously collects trajectories generated by all users during agent use, an autonomous evolver identifies recurring behavioural patterns across them, and the evolver writes updates back into the shared skill set — either refining existing skills or extending them with new capabilities. Improvements discovered in one context propagate system-wide, with zero additional effort from individual users.

The problem framing is explicit in the paper:

"The challenge is not only to improve performance within a single session, but also to enable cross-user knowledge transfer."

In other words, SkillClaw is not trying to make one agent try harder in one session; it is trying to stop the entire user population from rediscovering the same patches over and over.

Out-of-the-box compatibility is the broadest in this space: Hermes, OpenClaw, Codex, Claude Code, QwenPaw, IronClaw, PicoClaw, ZeroClaw, NanoClaw, NemoClaw, and any OpenAI-compatible API.

Validated on WildClawBench; reported as significantly improving Qwen3-Max in real-world agent scenarios.

Best for: anyone operating an agent that serves many users — internal tools, SaaS, multi-tenant deployments — where the aggregate experience contains signal that no single user's trajectory does.

How do they compare on the architecture axes?

Axis	TextGrad	SkillOpt	SkillClaw
Layer	Primitive (text autograd)	Training loop	Collective system
Signal source	Loss function (natural language)	Scored rollouts on validation set	Real cross-user trajectories
Optimization unit	Any text variable	A single SKILL.md document	A shared skill repository
Update protocol	Gradient step	Validation-gated add/delete/replace edits	Autonomous evolver against shared repo
Deployment artifact	Whatever you defined	Compact `best_skill.md` (300-2,000 tokens)	Updated shared library, synced to users
Inference-time cost	Depends on caller	Zero	Zero (runs in background)
Best empirical bar	Nature paper, broad scope	52/52 cells optimal	Qwen3-Max lift on WildClawBench
Maintainer	Stanford Zou Group	Microsoft Research	Alibaba AMAP-ML / DreamX

The three projects are not competing for the same job. TextGrad is the algebra (the autograd primitive), SkillOpt is one specific optimizer built with that algebra, and SkillClaw is the runtime system that deploys optimized skills across an actual user population.

A clean mental model: TextGrad is autograd. SkillOpt is PyTorch Trainer. SkillClaw is Hugging Face Hub plus an evolutionary update bot.

Why this matters

For five years the assumption in agent design has been: (a) write skills by hand, (b) improve them by hand. The Anthropic skill standard accelerated (a). Skill-evolution frameworks now attack (b) head-on. If the empirical numbers from SkillOpt hold up (+19 to +25 points across harnesses) and the multi-user signal in SkillClaw generalises, the marginal value of a hand-written skill drops fast.

The natural follow-on questions:

Does TextGrad-style text autograd become the standard primitive that DSPy, SkillOpt, SkillClaw, and others converge on?
Does SkillOpt-style "ML discipline for text" become the default training protocol, the way Adam became default for weights?
Does SkillClaw-style "data flywheel for skills" become the default deployment shape for any multi-tenant agent?

We watch these signals every month in our ecosystem reports.

FAQ

Are these three skill-evolution frameworks free?

Yes. All three are MIT-licensed. Free for personal, academic, and commercial use including modification.

Do I still need to write skills by hand?

To start, yes — all three frameworks need an initial skill (a SKILL.md, a prompt, or a starter template) as input. They improve what you wrote; they don't conjure a skill from nothing. Bootstrapping is still on you.

Which model do they require?

All three are model-agnostic. SkillOpt evaluates on 7 models including GPT-5.5. TextGrad works with any model that can score text outputs. SkillClaw needs an autonomous-evolver LLM (usually Qwen / GPT-class) plus the target model whose skills are being evolved.

Can I combine them?

In principle yes. TextGrad's primitives can be embedded into a SkillOpt-style training loop, and the output of either can feed a SkillClaw-style cross-user deployment system. In practice the three teams haven't yet published a worked example of the combination — likely a 2026 H2 development.

Which has the strongest peer-reviewed credentials?

TextGrad — Nature 2025. That is the strongest journal endorsement in the category. SkillOpt has an arXiv paper with strong empirical results across 52 cells. SkillClaw has an arXiv paper with WildClawBench validation.

Which integrates with which agent harness?

SkillOpt: Codex CLI, Claude Code CLI, and direct chat (3 harnesses evaluated)
TextGrad: model-agnostic, harness-agnostic — works wherever Python runs
SkillClaw: Hermes, OpenClaw, Codex, Claude Code, QwenPaw, IronClaw, PicoClaw, ZeroClaw, NanoClaw, NemoClaw, any OpenAI-compatible API (broadest)

What about DSPy?

DSPy is the older, more general framework for programming LLM systems. Its MIPROv2 and GEPA optimizers solve overlapping problems to SkillOpt. We treat DSPy as adjacent infrastructure (not strictly a "skill-evolution" framework in the SkillOpt / SkillClaw sense). It's worth knowing about; it's not what we'd file in this category.

Try it

Train a skill against a benchmark → start with SkillOpt
Build your own text-optimizer → start with TextGrad
Deploy a self-improving skill library across many users → start with SkillClaw
Browse all → Claw4Science skill-evolution group

If you build a fourth skill-evolution framework (especially anything that pairs the three approaches), submit it — this category is fresh and we expect it to grow fast through 2026.

Last updated 2026-06-01. Star counts, paper links, and feature claims verified against the three official repos on the publication date.

Train Your Skills Like Models: TextGrad, SkillOpt, SkillClaw

Table of Contents