One repo started a movement
On March 6, 2026, Andrej Karpathy pushed a repo called autoresearch. The idea was simple: an AI agent that runs experiments overnight. Modify code, run it, score the result, keep what works, discard what doesn't. Repeat until morning.
Six weeks later, the repo has 72,000+ stars. And it's spawned an entire family of variants — each adapting the core idea for a different platform, a different chip, or a different workflow.
We tracked six of them. Here's the family tree.
The Family Tree
karpathy/autoresearch (72K stars, Mar 6)
├── davebcn87/pi-autoresearch (3.6K, Mar 11) — generic version for any metric
├── trevin-creator/autoresearch-mlx (1.4K, Mar 8) — Apple Silicon port
├── uditgoenka/autoresearch (3.7K, Mar 13) — multi-mode Claude Code plugin
├── drivelineresearch/autoresearch-claude-code (258, Mar 12) — Claude Code skill port
└── leo-lilinxiao/codex-autoresearch (1.4K, Mar 18) — Codex-native with smart recoveryTotal family: 82,000+ stars across 6 repos. All created within 12 days of each other.
The Original: karpathy/autoresearch
72,499 stars · Python · Last push: Mar 26
The one that started it all. Karpathy's original implementation is specific: it runs AI research on single-GPU nanochat training automatically. The agent modifies training code, executes it, measures loss, and keeps improvements.
The genius is in the simplicity. The loop is just:
while True:
modify() # Agent changes something
run() # Execute the experiment
score() # Measure the result
if better:
keep() # Commit the change
else:
revert() # Throw it awayThat's it. No multi-agent orchestration, no knowledge graphs, no 23-stage pipelines. Just a tight loop that runs all night.
Use this if: You want the original experience on a Linux GPU box doing ML training experiments.
pi-autoresearch: The Generalist
3,600 stars · TypeScript · Last push: Apr 13 · View on Claw4Science
The first major fork broke the loop free from ML training. pi-autoresearch works on any measurable optimization target — not just loss curves. Code performance, test coverage, latency, memory usage — if you can measure it, pi-autoresearch can optimize it.
Built as an extension for the Pi agent (hence the name), it's the version that proved the autoresearch pattern isn't specific to ML. It's a general-purpose autonomous improvement loop.
Key difference from original: Works on any codebase with any metric, not just ML training.
Use this if: You want the autoresearch loop for non-ML projects — web performance, compiler optimization, algorithm tuning.
autoresearch (uditgoenka): The Swiss Army Knife
3,716 stars · Shell · Last push: Apr 6 · View on Claw4Science
This one took the autoresearch concept and expanded it into a multi-mode Claude Code plugin. Beyond the core optimization loop, it adds:
- Debug mode — autonomous bug hunting
- Fix mode — find and fix issues systematically
- Security audit mode — scan for vulnerabilities
- Ship mode — prepare code for production
- Predict mode — forecast the impact of changes
It's the version that says "if an AI can optimize experiments, why not optimize everything?"
Key difference from original: Not just an experiment loop — it's a full autonomous development toolkit.
Use this if: You use Claude Code and want autoresearch plus debugging, security, and shipping capabilities in one skill.
autoresearch-mlx: The Mac Native
1,422 stars · Python · Last push: Mar 10 · View on Claw4Science
Karpathy's original needs CUDA. If you're on a Mac, you're out of luck — unless you use this fork. autoresearch-mlx replaces PyTorch with Apple's MLX framework, letting the overnight loop run natively on M-series chips.
No PyTorch, no CUDA, no Docker. Just pip install and go.
Created just 2 days after Karpathy's original — someone really wanted this on their MacBook Pro.
Key difference from original: MLX instead of PyTorch. Runs on Apple Silicon without any translation layers.
Use this if: You have a Mac and want local ML experiment loops without setting up a cloud GPU.
autoresearch-claude-code: The Skill Port
258 stars · Last push: Mar 24
The most direct port — takes pi-autoresearch and packages it as a drop-in Claude Code skill. No configuration, no setup. Install the skill, and Claude Code gains the ability to run autonomous experiment loops.
Key difference from original: It's a skill, not a standalone tool. Lives inside your agent.
Use this if: You want the simplest possible way to add autoresearch to Claude Code.
codex-autoresearch: The Codex Specialist
1,365 stars · Last push: today · View on Claw4Science
Built specifically for OpenAI's Codex agent. The key addition: smart recovery. When the loop gets stuck — same score for too many iterations, or an error that keeps recurring — it automatically tries a different approach instead of grinding on the same dead end.
The most actively maintained variant (updated today), possibly because Codex's market share is growing fast.
Key difference from original: Codex-native, with stuck-detection and automatic strategy switching.
Use this if: You use Codex and want autoresearch that doesn't get trapped in local optima.
The Comparison Table
| Variant | Stars | Platform | Scope | Last Active |
|---|---|---|---|---|
| karpathy/autoresearch | 72,499 | Python/CUDA | ML training only | Mar 26 |
| pi-autoresearch | 3,600 | TypeScript/Pi | Any metric | Apr 13 |
| autoresearch (uditgoenka) | 3,716 | Claude Code | Multi-mode toolkit | Apr 6 |
| autoresearch-mlx | 1,422 | Python/MLX | ML on Apple Silicon | Mar 10 |
| autoresearch-claude-code | 258 | Claude Code | Skill port | Mar 24 |
| codex-autoresearch | 1,365 | Codex | ML with recovery | Today |
What the Family Tree Tells Us
1. Good ideas get ported, not forked.
None of these are GitHub forks. They're independent reimplementations. Each author saw the original, understood the pattern, and rebuilt it for their own context. That's a stronger signal than a fork — it means the idea is valuable, not just the code.
2. The loop is the innovation, not the implementation.
Karpathy's contribution wasn't the code (it's not that much code). It was the insight that a simple modify-run-score-keep loop, running overnight, can produce meaningful research results. Every variant preserves this core loop while changing everything around it.
3. Platform fragmentation is real.
Six variants in six weeks, each for a different platform (CUDA, MLX, Claude Code, Codex, Pi, generic). The AI coding tool ecosystem is fragmented enough that "works everywhere" is a feature, not an assumption.
4. The overnight loop is becoming infrastructure.
When Karpathy posted it, "AI runs experiments while you sleep" was a novelty. Six weeks and 82,000 stars later, it's becoming a standard capability that people expect their agent to have. The variants are converging on a shared interface: give it a metric, point it at code, come back in the morning.
Which One Should You Use?
Decision tree:
- Are you doing ML training on NVIDIA GPUs? → karpathy/autoresearch (the original)
- Are you on a Mac with Apple Silicon? → autoresearch-mlx
- Are you using Claude Code? → uditgoenka/autoresearch (most features) or autoresearch-claude-code (simplest)
- Are you using Codex? → codex-autoresearch
- Are you optimizing something that isn't ML? → pi-autoresearch
Or just try them all. They're all open source, and the loop only takes a few minutes to understand.
All 6 variants are listed in our project directory. For the full ecosystem of 142 AI science agents, visit claw4science.org.
