39 benchmarks
Graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult.
Humanity's Last Exam — extremely challenging questions designed to test the upper limits of AI capability across diverse domains.
Science coding benchmark measuring AI ability to solve scientific computing tasks.
Real-world coding benchmark with problems from competitive programming contests, testing code generation and problem-solving abilities.
Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains.
Instruction Following Benchmark measuring LLM ability to adhere to nuanced writing constraints and formatting requirements.
Long Context Retrieval benchmark testing ability to find and use information in very long documents.
Tau2 benchmark testing multi-turn agent capabilities in airline and retail domains.
Terminal-based benchmark testing AI ability to interact with command-line interfaces and solve system tasks.
American Invitational Mathematics Examination 2025 problems testing olympiad-level mathematical reasoning.
Competition mathematics problems requiring multi-step reasoning, covering algebra, geometry, number theory, and calculus.
American Invitational Mathematics Examination 2024 problems testing olympiad-level mathematical reasoning.
AGIEval English — human-level reasoning tasks from standardized exams like SAT, LSAT, and civil service exams.
Big-Bench Hard — challenging subset of BIG-Bench focusing on tasks where language models previously underperformed.
OpenAI HumanEval benchmark measuring Python code generation from function docstrings.
Mostly Basic Python Problems Plus — tests Python code generation with enhanced test cases.
Accounting and audit benchmark testing financial reasoning capabilities.
Multimodal Understanding benchmark testing vision-language models on expert-level tasks.
Grade School Math 8K — 8,500 high quality grade school math word problems.
Software Engineering benchmark testing ability to resolve real GitHub issues.
Simple question answering benchmark testing factual accuracy and knowledge retrieval.
AI2 Reasoning Challenge (Challenge set) — grade-school science questions requiring complex reasoning.
AI2 Reasoning Challenge (Easy set) — grade-school science questions.
Big Bench Extra Hard — even more challenging reasoning tasks pushing the limits of language model capabilities.
Medical question answering benchmark from USMLE-style questions.
AGIEval Chinese — reasoning tasks from Chinese standardized exams (Gaokao, civil service).
Mathematics benchmark covering algebra, geometry, number theory, and calculus problems.
Massive Multitask Language Understanding — tests knowledge across 57 subjects.
Stock market benchmark testing financial analysis capabilities.
BIRD-CRITIC — multi-turn benchmark testing SQL generation and database interaction.
Logic puzzle benchmark based on knights (truth-tellers) and knaves (liars) puzzles.
Instruction Following Evaluation benchmark testing how well LLMs follow detailed formatting and content constraints.
Berkeley Function Calling Leaderboard v3 — testing function/tool calling accuracy.
Extended formal logic benchmark testing deductive and propositional reasoning.
Weapons of Mass Destruction Proxy — benchmark testing knowledge safety boundaries.
Workshop on Machine Translation 2014 — multilingual translation quality benchmark.
DarkBench — benchmark testing model safety and resistance to adversarial attacks.
GAIA — General AI Assistants benchmark testing multi-step real-world tasks.
MultiChallenge — general knowledge benchmark with diverse challenge types.

Deploy OpenClaw in Under 1 Minute— We handle hosting, scaling, and maintenance
93 out of our 301 tracked models have had a price change in March.
Get our weekly newsletter on pricing changes, new releases, and tools.
Built by @aellman
2026 68 Ventures, LLC. All rights reserved.