AI Science Agent Benchmarks 2026: 8 Suites Compared

Key Takeaways

Eight active benchmarks now evaluate AI science agents in 2026 — up from five in early April.
Scope ranges from 23 coding tasks (PinchBench) to 300 multimodal trajectories (Claw-Eval) to 153 live websites (ClawBench).
No frontier model has crossed 60% on any agent-style benchmark; ClawMark's best score is 55%, ClawBench's best is 33.3%.
Scoring methods differ fundamentally: rubric checklists (ResearchClawBench), trajectory audits (Claw-Eval), prompt-injection success rate (ClawSafety), real-world task completion (PinchBench).
Domain coverage spans general agents → coding → web tasks → bioinformatics → safety → autonomous research → multi-day enterprise workflows.
Author overlap is rare but real: SJTU's Wanghan Xu group has shipped both ResearchClawBench (40 agent tasks) and SGI-Bench (1000+ LLM probes) — different scopes, easy to confuse.
Adoption signal: Claw-Eval is already used internally by Qwen, GLM, and MiniMax — the strongest indicator a benchmark is doing useful work.
Most benchmarks publish leaderboards; cross-comparison is now possible without rerunning experiments yourself.

What Is an AI Science Agent Benchmark?

An AI science agent benchmark is a standardized test suite that measures how well an autonomous AI agent — not a base LLM — performs scientific or professional work end-to-end.

Key differences from traditional LLM benchmarks:

Tasks are open-ended — agents must plan, use tools, write code, and produce artifacts.
Scoring is multi-step — judges evaluate trajectory, intermediate outputs, and final results, not single-shot answers.
Environments are stateful — file systems, external APIs, browsers, and time-evolving data are part of the test.
Models are agnostic — most suites support Claude, Codex, OpenClaw, NanoBot, EvoScientist, and custom agents.

This category matters because AI agents now write papers, run experiments, and ship code — and we need standardized ways to know whether they're doing it correctly.

The 8 Benchmarks at a Glance

Benchmark	Tasks	Domain	Top Score	Scoring	Stars	Status
Claw-Eval	300	General agent	—	Trajectory + rubric (2,159 items)	352	Active
PinchBench	23	Coding agent	Public leaderboard	Auto-grade + LLM judge	965	Active
ClawMark	100	Multi-day enterprise	55%	Cross-modal multi-turn	51	Active
ClawBench	153	Real-world web	33.3%	Live website task completion	47	Active
ResearchClawBench	40	Autonomous research	50 = match paper, 70+ = surpass	Expert checklist + LLM peer review	67	Active
BioAgent Bench	~100	Bioinformatics	—	Pipeline output + accuracy	16	Active
HeurekaBench	40	AI co-scientist	—	Real experimental research	11	ICLR 2026
ClawSafety	120	Prompt injection	—	Attack success rate	2	Active

Star counts are GitHub snapshots as of 2026-04-17. Top scores reflect the best frontier model reported by each benchmark.

Information Gain: What Each Benchmark Uniquely Tests

Claw-Eval — Trajectory-Aware Comprehensive Evaluation

Core attributes:

Tasks: 300 human-verified
Categories: 9 (service orchestration, multimodal perception, multi-turn dialogue)
Rubrics: 2,159 individual checks
Modalities: Text, images, PDFs, video
Evaluation axes: Completion + Safety + Robustness
Used by: Qwen, GLM, MiniMax (production model evaluation)

Unique signal: Their experiments showed trajectory-opaque grading misses 44% of safety violations and 13% of robustness failures. Watching how the agent gets to the answer is not optional.

PinchBench — The Practical Coding Agent Leaderboard

Core attributes:

Tasks: 23 real-world
Coverage: Productivity, research, writing, coding, analysis, email, memory, skills
Leaderboard: Public at pinchbench.com
Scoring: Auto-grade + LLM judge

Unique signal: PinchBench prioritizes practical outcomes over benchmark theatrics — it tests whether the agent actually completed the task you'd ask in real work.

ClawMark — Multi-Day Enterprise Workflows

Core attributes:

Tasks: 100 across 13 professional domains
Domains: Insurance, legal, EDA, finance, others
Format: Multi-day, multimodal, dynamic environment
Twist: New emails arrive, files update, schedules shift mid-task
Best score: 55% (frontier models)
Affiliation: NUS, Evolvent AI, HKU, MIT, UW, UC Berkeley, CUHK, HKUST (40+ scholars)

Unique signal: Most benchmarks freeze the environment. ClawMark mutates it — testing whether the agent notices, adapts, and recovers.

ClawBench — Real Web Tasks on Live Sites

Core attributes:

Tasks: 153
Categories: 15 life categories
Sites: 144 live websites
Top scores: Claude Sonnet 4.6 = 33.3%, GPT-5.4 = 6.5%
Affiliation: UBC, Vector Institute, CMU, SJTU, Tsinghua

Unique signal: Sandbox-to-real-world performance gap is enormous. A model that passes 90%+ in synthetic web environments may still fail two-thirds of real tasks.

ResearchClawBench — Re-Discovery to New-Discovery

Core attributes:

Tasks: 40 real-science tasks
Disciplines: 10 (Astronomy, Chemistry, Physics, Life sciences, others)
Pipeline: Two-stage — autonomous research + LLM peer-review scoring
Scoring: Score 50 = match the original paper; 70+ = surpass it
Supported agents: Claude Code, Codex CLI, OpenClaw, NanoBot, EvoScientist, ResearchClaw, ARIS Codex
Affiliation: Shanghai Jiao Tong University (InternScience)

Unique signal: Tasks are sourced from published papers with expert-curated checklists. Above 50 means the agent matched human-published work; above 70 means it produced something better.

BioAgent Bench — Bioinformatics-Specific Tasks

Core attributes:

Domain: Bioinformatics agents
Coverage: Sequence analysis, genomics workflows, computational biology pipelines
Format: Real bioinformatics tasks (not toy problems)

Unique signal: Domain-specific scoring tied to bioinformatics output correctness — not generic agent metrics.

HeurekaBench — AI Co-Scientist Framework

Core attributes:

Venue: ICLR 2026
Affiliation: EPFL Machine Learning & Bioinformatics Lab
Focus: Experimental data-driven scientific research
Format: Framework for creating benchmarks, not just one fixed benchmark

Unique signal: HeurekaBench is meta — it provides infrastructure to spin up new evaluation tasks for AI co-scientists across domains.

ClawSafety — Prompt Injection Under Realistic Conditions

Core attributes:

Test cases: 120 adversarial
Harm domains: 5
Attack vectors: 3
Harmful action types: 5
Models tested: Claude, Gemini, GPT-5.1, DeepSeek
Scaffolds tested: OpenClaw, Nanobot, NemoClaw

Unique signal: Chat safety ≠ agent safety. A model that refuses harmful chat requests can still be tricked when wrapped in an agent loop.

Scope vs Depth: How the 8 Benchmarks Trade Off

Tradeoff	Wide Scope (Many Tasks)	Deep Scope (Few Tasks)
General agent	Claw-Eval (300)	PinchBench (23)
Domain-specific	ClawBench (153 web)	ResearchClawBench (40 science)
Specialized risk	ClawSafety (120 attacks)	HeurekaBench (40 experiments)
Enterprise	ClawMark (100 multi-day)	BioAgent Bench (~100)

Pattern: Wide-scope benchmarks measure breadth; deep-scope benchmarks measure ceiling capability. Most labs need both.

Selecting a Benchmark by Use Case

Pick Claw-Eval if:

You ship a general-purpose agent and need to evaluate Completion + Safety + Robustness together.
You care about how the agent reaches its answer, not just the final output.
You want a benchmark already trusted by major Chinese model labs.

Pick PinchBench if:

You build coding agents and want a public leaderboard for credibility.
You prefer practical, real-world tasks over synthetic problem sets.
You want fast feedback — 23 tasks runs faster than 300.

Pick ResearchClawBench if:

Your agent claims to conduct scientific research independently.
You need scoring grounded in real published papers, not synthetic tasks.
You want a clear bar: 50 = match human work, 70 = exceed it.

Pick ClawBench if:

Your agent operates on live websites, not sandboxed copies.
You need to measure the sandbox-to-production capability gap.
You care about real-world web navigation breadth across 15 categories.

Pick ClawMark if:

Your agent must operate in enterprise environments with dynamic state.
Tasks span multiple days and require multimodal context.
You evaluate insurance, legal, EDA, or other professional workflows.

Pick BioAgent Bench if:

Your domain is bioinformatics specifically.
You need scoring tied to genomics pipeline correctness.

Pick HeurekaBench if:

You're building an AI co-scientist for experimental research.
You need a framework to generate new benchmarks, not just run an existing one.

Pick ClawSafety if:

You need to know how your agent fares under prompt injection attacks.
You operate in regulated or high-trust environments where safety is non-negotiable.

Common Confusions

ResearchClawBench vs SGI-Bench

Both are from Shanghai Jiao Tong University (lead author Wanghan Xu). They are different benchmarks.

Attribute	ResearchClawBench	SGI-Bench
Subject under test	AI agent	Base LLM
Tasks	40 real-science	1,000+ cross-disciplinary
Source	Published papers	Science's 125 Big Questions
Scoring	Expert checklist + paper match	Practical Inquiry Model + TTRL
arXiv	(No paper as of 2026-04-17)	arxiv.org/abs/2512.16969

If you're evaluating an autonomous agent running real analyses, use ResearchClawBench. If you're probing a base model's scientific reasoning capacity, use SGI-Bench.

ClawSafety (Benchmark) vs ClawSafety (Scanner)

Two different projects share the name. The benchmark tests prompt injection across 120 cases. The scanner is a runtime safety tool. See /samename/clawsafety for full disambiguation.

LLM benchmarks (MMLU, GPQA, HumanEval) test what a model knows in single-shot prompts. AI agent benchmarks test what an autonomous system can do end-to-end — including tool use, planning, multi-turn execution, error recovery, and final artifact quality.

Why are top scores so low (33%, 55%) on these benchmarks?

Real-world tasks are harder than benchmark tasks designed for raw capability. ClawBench at 33% and ClawMark at 55% reveal the gap between sandbox performance and live execution. This gap is the core signal these benchmarks exist to measure.

Which benchmark should I use to compare frontier models?

For general capability: Claw-Eval (used by Qwen, GLM, MiniMax). For coding agents: PinchBench. For scientific research agents: ResearchClawBench. For real-world web execution: ClawBench. Most teams run two or three to triangulate.

Are these benchmarks open-source?

Yes — all eight are on GitHub with permissive licenses. Most include leaderboards, scoring scripts, and instructions to add your own agent. PinchBench, Claw-Eval, ClawBench, and ClawMark also publish public leaderboards.

Can I submit my own agent for evaluation?

Yes. PinchBench, ClawBench, ClawMark, and ResearchClawBench all accept community submissions — typically by opening a pull request or using a Hugging Face submission Space. ResearchClawBench moved task submissions to its HF Space.

Does any benchmark cover multi-agent systems?

Partial coverage. Claw-Eval handles multi-turn dialogue and orchestration. ClawMark explicitly tests multi-day workflows. None are dedicated to multi-agent evaluation yet — that gap is the next frontier.

How fast is the benchmark landscape changing?

Fast. The OpenClaw ecosystem went from five active benchmarks in early April 2026 to eight by mid-April. Expect 12+ by mid-2026 as more labs publish their evaluation suites.

Where do I track new benchmarks as they appear?

Claw4Science's benchmark group maintains a curated list with live GitHub stats, descriptions, and direct links. The group is updated within days of new benchmark releases.

Bottom Line

The AI science agent evaluation landscape now has eight production-grade benchmarks covering general capability, coding, web tasks, scientific research, bioinformatics, safety, multi-day enterprise work, and AI co-scientist frameworks.

No single benchmark is sufficient. Frontier models score below 60% on every agent-style suite, and each benchmark uniquely surfaces a different failure mode — trajectory opacity (Claw-Eval), sandbox-to-real-world gap (ClawBench), environment drift (ClawMark), paper-level reproduction (ResearchClawBench), or prompt injection (ClawSafety).

The field has moved past "does the model know things" into "can the system do things." That's the right question. We're still bad at it. These benchmarks are how we know.

AI Science Agent Benchmarks 2026: 8 Suites Compared

Table of Contents

Key Takeaways

What Is an AI Science Agent Benchmark?

The 8 Benchmarks at a Glance

Information Gain: What Each Benchmark Uniquely Tests

Claw-Eval — Trajectory-Aware Comprehensive Evaluation

PinchBench — The Practical Coding Agent Leaderboard

ClawMark — Multi-Day Enterprise Workflows

ClawBench — Real Web Tasks on Live Sites

ResearchClawBench — Re-Discovery to New-Discovery

BioAgent Bench — Bioinformatics-Specific Tasks

HeurekaBench — AI Co-Scientist Framework

ClawSafety — Prompt Injection Under Realistic Conditions

Scope vs Depth: How the 8 Benchmarks Trade Off

Selecting a Benchmark by Use Case

Pick Claw-Eval if:

Pick PinchBench if:

Pick ResearchClawBench if:

Pick ClawBench if:

Pick ClawMark if:

Pick BioAgent Bench if:

Pick HeurekaBench if:

Pick ClawSafety if:

Common Confusions

ResearchClawBench vs SGI-Bench

ClawSafety (Benchmark) vs ClawSafety (Scanner)

FAQ

What's the difference between an LLM benchmark and an AI agent benchmark?

Why are top scores so low (33%, 55%) on these benchmarks?

Which benchmark should I use to compare frontier models?

Are these benchmarks open-source?

Can I submit my own agent for evaluation?

Does any benchmark cover multi-agent systems?

How fast is the benchmark landscape changing?

Where do I track new benchmarks as they appear?

Bottom Line