Key Takeaways
- Eight active benchmarks now evaluate AI science agents in 2026 — up from five in early April.
- Scope ranges from 23 coding tasks (PinchBench) to 300 multimodal trajectories (Claw-Eval) to 153 live websites (ClawBench).
- No frontier model has crossed 60% on any agent-style benchmark; ClawMark's best score is 55%, ClawBench's best is 33.3%.
- Scoring methods differ fundamentally: rubric checklists (ResearchClawBench), trajectory audits (Claw-Eval), prompt-injection success rate (ClawSafety), real-world task completion (PinchBench).
- Domain coverage spans general agents → coding → web tasks → bioinformatics → safety → autonomous research → multi-day enterprise workflows.
- Author overlap is rare but real: SJTU's Wanghan Xu group has shipped both ResearchClawBench (40 agent tasks) and SGI-Bench (1000+ LLM probes) — different scopes, easy to confuse.
- Adoption signal: Claw-Eval is already used internally by Qwen, GLM, and MiniMax — the strongest indicator a benchmark is doing useful work.
- Most benchmarks publish leaderboards; cross-comparison is now possible without rerunning experiments yourself.
What Is an AI Science Agent Benchmark?
An AI science agent benchmark is a standardized test suite that measures how well an autonomous AI agent — not a base LLM — performs scientific or professional work end-to-end.
Key differences from traditional LLM benchmarks:
- Tasks are open-ended — agents must plan, use tools, write code, and produce artifacts.
- Scoring is multi-step — judges evaluate trajectory, intermediate outputs, and final results, not single-shot answers.
- Environments are stateful — file systems, external APIs, browsers, and time-evolving data are part of the test.
- Models are agnostic — most suites support Claude, Codex, OpenClaw, NanoBot, EvoScientist, and custom agents.
This category matters because AI agents now write papers, run experiments, and ship code — and we need standardized ways to know whether they're doing it correctly.
The 8 Benchmarks at a Glance
| Benchmark | Tasks | Domain | Top Score | Scoring | Stars | Status |
|---|---|---|---|---|---|---|
| Claw-Eval | 300 | General agent | — | Trajectory + rubric (2,159 items) | 352 | Active |
| PinchBench | 23 | Coding agent | Public leaderboard | Auto-grade + LLM judge | 965 | Active |
| ClawMark | 100 | Multi-day enterprise | 55% | Cross-modal multi-turn | 51 | Active |
| ClawBench | 153 | Real-world web | 33.3% | Live website task completion | 47 | Active |
| ResearchClawBench | 40 | Autonomous research | 50 = match paper, 70+ = surpass | Expert checklist + LLM peer review | 67 | Active |
| BioAgent Bench | ~100 | Bioinformatics | — | Pipeline output + accuracy | 16 | Active |
| HeurekaBench | 40 | AI co-scientist | — | Real experimental research | 11 | ICLR 2026 |
| ClawSafety | 120 | Prompt injection | — | Attack success rate | 2 | Active |
Star counts are GitHub snapshots as of 2026-04-17. Top scores reflect the best frontier model reported by each benchmark.
Information Gain: What Each Benchmark Uniquely Tests
Claw-Eval — Trajectory-Aware Comprehensive Evaluation
Core attributes:
- Tasks: 300 human-verified
- Categories: 9 (service orchestration, multimodal perception, multi-turn dialogue)
- Rubrics: 2,159 individual checks
- Modalities: Text, images, PDFs, video
- Evaluation axes: Completion + Safety + Robustness
- Used by: Qwen, GLM, MiniMax (production model evaluation)
Unique signal: Their experiments showed trajectory-opaque grading misses 44% of safety violations and 13% of robustness failures. Watching how the agent gets to the answer is not optional.
PinchBench — The Practical Coding Agent Leaderboard
Core attributes:
- Tasks: 23 real-world
- Coverage: Productivity, research, writing, coding, analysis, email, memory, skills
- Leaderboard: Public at pinchbench.com
- Scoring: Auto-grade + LLM judge
Unique signal: PinchBench prioritizes practical outcomes over benchmark theatrics — it tests whether the agent actually completed the task you'd ask in real work.
ClawMark — Multi-Day Enterprise Workflows
Core attributes:
- Tasks: 100 across 13 professional domains
- Domains: Insurance, legal, EDA, finance, others
- Format: Multi-day, multimodal, dynamic environment
- Twist: New emails arrive, files update, schedules shift mid-task
- Best score: 55% (frontier models)
- Affiliation: NUS, Evolvent AI, HKU, MIT, UW, UC Berkeley, CUHK, HKUST (40+ scholars)
Unique signal: Most benchmarks freeze the environment. ClawMark mutates it — testing whether the agent notices, adapts, and recovers.
ClawBench — Real Web Tasks on Live Sites
Core attributes:
- Tasks: 153
- Categories: 15 life categories
- Sites: 144 live websites
- Top scores: Claude Sonnet 4.6 = 33.3%, GPT-5.4 = 6.5%
- Affiliation: UBC, Vector Institute, CMU, SJTU, Tsinghua
Unique signal: Sandbox-to-real-world performance gap is enormous. A model that passes 90%+ in synthetic web environments may still fail two-thirds of real tasks.
ResearchClawBench — Re-Discovery to New-Discovery
Core attributes:
- Tasks: 40 real-science tasks
- Disciplines: 10 (Astronomy, Chemistry, Physics, Life sciences, others)
- Pipeline: Two-stage — autonomous research + LLM peer-review scoring
- Scoring: Score 50 = match the original paper; 70+ = surpass it
- Supported agents: Claude Code, Codex CLI, OpenClaw, NanoBot, EvoScientist, ResearchClaw, ARIS Codex
- Affiliation: Shanghai Jiao Tong University (InternScience)
Unique signal: Tasks are sourced from published papers with expert-curated checklists. Above 50 means the agent matched human-published work; above 70 means it produced something better.
BioAgent Bench — Bioinformatics-Specific Tasks
Core attributes:
- Domain: Bioinformatics agents
- Coverage: Sequence analysis, genomics workflows, computational biology pipelines
- Format: Real bioinformatics tasks (not toy problems)
Unique signal: Domain-specific scoring tied to bioinformatics output correctness — not generic agent metrics.
HeurekaBench — AI Co-Scientist Framework
Core attributes:
- Venue: ICLR 2026
- Affiliation: EPFL Machine Learning & Bioinformatics Lab
- Focus: Experimental data-driven scientific research
- Format: Framework for creating benchmarks, not just one fixed benchmark
Unique signal: HeurekaBench is meta — it provides infrastructure to spin up new evaluation tasks for AI co-scientists across domains.
ClawSafety — Prompt Injection Under Realistic Conditions
Core attributes:
- Test cases: 120 adversarial
- Harm domains: 5
- Attack vectors: 3
- Harmful action types: 5
- Models tested: Claude, Gemini, GPT-5.1, DeepSeek
- Scaffolds tested: OpenClaw, Nanobot, NemoClaw
Unique signal: Chat safety ≠ agent safety. A model that refuses harmful chat requests can still be tricked when wrapped in an agent loop.
Scope vs Depth: How the 8 Benchmarks Trade Off
| Tradeoff | Wide Scope (Many Tasks) | Deep Scope (Few Tasks) |
|---|---|---|
| General agent | Claw-Eval (300) | PinchBench (23) |
| Domain-specific | ClawBench (153 web) | ResearchClawBench (40 science) |
| Specialized risk | ClawSafety (120 attacks) | HeurekaBench (40 experiments) |
| Enterprise | ClawMark (100 multi-day) | BioAgent Bench (~100) |
Pattern: Wide-scope benchmarks measure breadth; deep-scope benchmarks measure ceiling capability. Most labs need both.
Selecting a Benchmark by Use Case
Pick Claw-Eval if:
- You ship a general-purpose agent and need to evaluate Completion + Safety + Robustness together.
- You care about how the agent reaches its answer, not just the final output.
- You want a benchmark already trusted by major Chinese model labs.
Pick PinchBench if:
- You build coding agents and want a public leaderboard for credibility.
- You prefer practical, real-world tasks over synthetic problem sets.
- You want fast feedback — 23 tasks runs faster than 300.
Pick ResearchClawBench if:
- Your agent claims to conduct scientific research independently.
- You need scoring grounded in real published papers, not synthetic tasks.
- You want a clear bar: 50 = match human work, 70 = exceed it.
Pick ClawBench if:
- Your agent operates on live websites, not sandboxed copies.
- You need to measure the sandbox-to-production capability gap.
- You care about real-world web navigation breadth across 15 categories.
Pick ClawMark if:
- Your agent must operate in enterprise environments with dynamic state.
- Tasks span multiple days and require multimodal context.
- You evaluate insurance, legal, EDA, or other professional workflows.
Pick BioAgent Bench if:
- Your domain is bioinformatics specifically.
- You need scoring tied to genomics pipeline correctness.
Pick HeurekaBench if:
- You're building an AI co-scientist for experimental research.
- You need a framework to generate new benchmarks, not just run an existing one.
Pick ClawSafety if:
- You need to know how your agent fares under prompt injection attacks.
- You operate in regulated or high-trust environments where safety is non-negotiable.
Common Confusions
ResearchClawBench vs SGI-Bench
Both are from Shanghai Jiao Tong University (lead author Wanghan Xu). They are different benchmarks.
| Attribute | ResearchClawBench | SGI-Bench |
|---|---|---|
| Subject under test | AI agent | Base LLM |
| Tasks | 40 real-science | 1,000+ cross-disciplinary |
| Source | Published papers | Science's 125 Big Questions |
| Scoring | Expert checklist + paper match | Practical Inquiry Model + TTRL |
| arXiv | (No paper as of 2026-04-17) | arxiv.org/abs/2512.16969 |
If you're evaluating an autonomous agent running real analyses, use ResearchClawBench. If you're probing a base model's scientific reasoning capacity, use SGI-Bench.
ClawSafety (Benchmark) vs ClawSafety (Scanner)
Two different projects share the name. The benchmark tests prompt injection across 120 cases. The scanner is a runtime safety tool. See /samename/clawsafety for full disambiguation.
FAQ
What's the difference between an LLM benchmark and an AI agent benchmark?
LLM benchmarks (MMLU, GPQA, HumanEval) test what a model knows in single-shot prompts. AI agent benchmarks test what an autonomous system can do end-to-end — including tool use, planning, multi-turn execution, error recovery, and final artifact quality.
Why are top scores so low (33%, 55%) on these benchmarks?
Real-world tasks are harder than benchmark tasks designed for raw capability. ClawBench at 33% and ClawMark at 55% reveal the gap between sandbox performance and live execution. This gap is the core signal these benchmarks exist to measure.
Which benchmark should I use to compare frontier models?
For general capability: Claw-Eval (used by Qwen, GLM, MiniMax). For coding agents: PinchBench. For scientific research agents: ResearchClawBench. For real-world web execution: ClawBench. Most teams run two or three to triangulate.
Are these benchmarks open-source?
Yes — all eight are on GitHub with permissive licenses. Most include leaderboards, scoring scripts, and instructions to add your own agent. PinchBench, Claw-Eval, ClawBench, and ClawMark also publish public leaderboards.
Can I submit my own agent for evaluation?
Yes. PinchBench, ClawBench, ClawMark, and ResearchClawBench all accept community submissions — typically by opening a pull request or using a Hugging Face submission Space. ResearchClawBench moved task submissions to its HF Space.
Does any benchmark cover multi-agent systems?
Partial coverage. Claw-Eval handles multi-turn dialogue and orchestration. ClawMark explicitly tests multi-day workflows. None are dedicated to multi-agent evaluation yet — that gap is the next frontier.
How fast is the benchmark landscape changing?
Fast. The OpenClaw ecosystem went from five active benchmarks in early April 2026 to eight by mid-April. Expect 12+ by mid-2026 as more labs publish their evaluation suites.
Where do I track new benchmarks as they appear?
Claw4Science's benchmark group maintains a curated list with live GitHub stats, descriptions, and direct links. The group is updated within days of new benchmark releases.
Bottom Line
The AI science agent evaluation landscape now has eight production-grade benchmarks covering general capability, coding, web tasks, scientific research, bioinformatics, safety, multi-day enterprise work, and AI co-scientist frameworks.
No single benchmark is sufficient. Frontier models score below 60% on every agent-style suite, and each benchmark uniquely surfaces a different failure mode — trajectory opacity (Claw-Eval), sandbox-to-real-world gap (ClawBench), environment drift (ClawMark), paper-level reproduction (ResearchClawBench), or prompt injection (ClawSafety).
The field has moved past "does the model know things" into "can the system do things." That's the right question. We're still bad at it. These benchmarks are how we know.
