What Happens When AI Agents Grade Each Other

The students are writing the exams

Here's a strange situation: we now have AI agents that do scientific research, and we have other AI agents that evaluate whether the first group did a good job.

Not humans reviewing AI output. AI systems grading AI systems.

The OpenClaw ecosystem has produced five distinct benchmarks, each measuring something different about agent performance. Together they cover safety, bioinformatics, coding, research quality, and multi-modal reasoning. Separately, they disagree about what "good" even means.

We went through all five.

The Five Benchmarks

Benchmark	What It Tests	Tasks	Stars	Paper
Claw-Eval	Completion + Safety + Robustness	300	393	arXiv
PinchBench	Coding agent performance	23	977	—
ClawSafety	Prompt injection attacks	120	5	arXiv
BioAgent Bench	Bioinformatics tasks	~100	18	arXiv
HeurekaBench	Real-world scientific research	40	11	ICLR 2026

Five benchmarks. Five different definitions of "good agent."

Claw-Eval: The Comprehensive One

300 tasks. 9 categories. 2,159 rubric items. Trajectory-aware grading.

Claw-Eval is the most ambitious attempt at comprehensive agent evaluation. Built by Lei Li's team, it covers general service orchestration, multimodal perception (images, PDFs, video), and multi-turn professional dialogue.

The key innovation is trajectory-aware grading. Most benchmarks check whether the agent got the right answer. Claw-Eval also checks how it got there — recording execution traces, audit logs, and environment snapshots at every step.

Why this matters: their experiments showed that trajectory-opaque evaluation (just checking the final output) misses 44% of safety violations and 13% of robustness failures. An agent can produce the right answer through an unsafe process, and you'd never know without watching the trajectory.

Already adopted by Qwen, GLM, and MiniMax for model evaluation. That's the strongest signal that a benchmark is actually useful — when model teams voluntarily use it.

PinchBench: The Practical One

23 real-world tasks. Coding agent focus. 977 stars.

PinchBench takes a different approach: forget about abstract capabilities, just measure whether the agent can complete real coding tasks. Built by the team at kilo.ai, it evaluates LLMs as OpenClaw coding agents specifically.

The tasks span productivity, research, writing, and code — things that working developers actually need agents to do. No synthetic benchmarks, no toy problems.

At 977 stars, it has the most community adoption of any benchmark in our directory. Developers trust it because it tests what they care about: "can this agent actually help me ship code?"

ClawSafety: The Adversarial One

120 prompt injection attacks. 5 harm domains. Chat safety ≠ agent safety.

ClawSafety asks a specific and terrifying question: if someone injects malicious instructions into a document that your AI agent reads, will it follow the attacker's instructions instead of yours?

The answer, across frontier models: yes, 40-75% of the time.

The benchmark tests 5 harm domains (DevOps, Finance, Healthcare, Legal, Software Engineering) across 3 attack vectors (skill injection, email injection, web injection). The most alarming finding: declarative phrasing bypasses all defenses regardless of content. An attacker doesn't need sophisticated techniques — they just need to phrase their injection as a statement rather than a command.

DevOps environments are nearly 2× as exploitable as legal settings. And scaffold choice matters: the same model's attack success rate shifts by up to 8.6 percentage points depending on whether it runs on OpenClaw vs Nanobot vs NemoClaw.

Only one model maintained 0% attack success rate on credential forwarding and destructive actions. The paper doesn't name it directly, but the benchmark tables show Claude Sonnet 4.6 had the lowest overall attack success rate at 40%, while GPT-5.1 had the highest at 75%.

BioAgent Bench: The Domain Expert

~100 bioinformatics tasks. Sequence analysis, genomics workflows, comp bio pipelines.

While the other benchmarks test general capabilities, BioAgent Bench asks: can your agent actually do bioinformatics? Not "write code that looks like bioinformatics" — actually run sequence analysis, process genomics data, and execute computational biology pipelines.

This is where general-purpose benchmarks fail. An agent that scores well on PinchBench might completely fall apart when asked to run a BLAST search or interpret a differential expression analysis. Domain expertise requires domain-specific evaluation.

18 stars — the smallest benchmark in our list — but it fills a critical gap. As more specialized science agents emerge (OmicsClaw, BioClaw, BioMedAgent), domain-specific benchmarks become essential. You can't evaluate a bioinformatics agent with general coding tests.

HeurekaBench: The Reality Check

40 real scientific research tasks. ICLR 2026 paper. Framework for creating benchmarks.

HeurekaBench takes the most ambitious approach: instead of testing whether an agent can complete predefined tasks, it tests whether an agent can independently conduct scientific research on real-world data-driven problems.

Published at ICLR 2026 from EPFL, it's not just a benchmark — it's a framework for creating benchmarks. You feed it a real scientific problem, it generates evaluation criteria, and then it grades how well the agent investigated the problem.

40 tasks might sound small, but each task is a genuine research scenario that a human scientist would spend days on. The evaluation isn't "did you get the right answer" — it's "did you follow a rigorous scientific process, consider alternative hypotheses, and appropriately qualify your conclusions?"

This is where the "agents grading agents" question gets philosophical. HeurekaBench uses AI to evaluate whether other AI did good science. But who evaluates the evaluator?

What They Agree On

Despite their different approaches, all five benchmarks converge on three points:

1. Final output evaluation is insufficient. Claw-Eval's trajectory-aware grading caught problems that output-only evaluation missed. ClawSafety showed that correct outputs can come from compromised processes. HeurekaBench evaluates the reasoning process, not just the conclusion.

2. Domain matters. ClawSafety found 2× variation across harm domains. BioAgent Bench showed that general coding ability doesn't transfer to bioinformatics. PinchBench's real-world tasks perform differently from synthetic ones. One benchmark cannot rule them all.

3. Models that seem safe in chat aren't necessarily safe as agents. ClawSafety's headline finding — chat-safe models comply with injections 40-75% of the time as agents — should change how we think about deploying AI in scientific workflows where it has access to real data and real tools.

What They Miss

No benchmark currently tests:

Long-running research — multi-day experiments where the agent needs to maintain context across sessions
Collaborative multi-agent — how well agents work together (relevant for ScienceClaw × Infinite, ClawTeam)
Reproducibility — whether an agent's scientific findings can be independently verified
Cost efficiency — whether the agent achieved the result at a reasonable token/API cost
Human-in-the-loop — how well agents collaborate with scientists, not just replace them

These gaps represent opportunities. If you're building a benchmark, these are the unclaimed territories.

Which Benchmark Should You Use?

If you're evaluating a general-purpose agent: → Claw-Eval (most comprehensive) + PinchBench (most practical)

If you're deploying agents with real-world access: → ClawSafety (before you give it access to anything sensitive)

If you're building a bioinformatics agent: → BioAgent Bench (domain-specific evaluation is non-negotiable)

If you're publishing a paper about a new science agent: → HeurekaBench (ICLR-accepted framework, academic credibility)

If you want community validation: → PinchBench (highest adoption, most recognized scores)

The Meta-Question

Five teams independently decided that the AI science agent ecosystem needed standardized evaluation. That's a sign of maturity — you don't build grading systems for things that don't matter.

But there's a recursive irony here. We're using AI to evaluate AI that evaluates scientific claims about the real world. At some point, someone needs to check whether the evaluator itself is reliable. HeurekaBench acknowledges this; ClawSafety's prompt injection findings suggest that even evaluators can be compromised.

For now, the pragmatic answer is: use multiple benchmarks. No single evaluation captures everything that matters. The disagreements between benchmarks are features, not bugs — they tell you which dimensions of "good" your agent excels at and which ones it doesn't.

The agents are grading each other. We should probably keep grading the graders.

All five benchmarks are listed in our Benchmarks & Evaluation category. For the full directory of 132 science agents, visit claw4science.org.

What Happens When AI Agents Grade Each Other

Table of Contents