39 benchmarks
Graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult.
Humanity's Last Exam — extremely challenging questions designed to test the upper limits of AI capability across diverse domains.
Science coding benchmark measuring AI ability to solve scientific computing tasks.
Instruction Following Benchmark measuring LLM ability to adhere to nuanced writing constraints and formatting requirements.
Long Context Retrieval benchmark testing ability to find and use information in very long documents.
Tau2 benchmark testing multi-turn agent capabilities in airline and retail domains.
Terminal-based benchmark testing AI ability to interact with command-line interfaces and solve system tasks.
Real-world coding benchmark with problems from competitive programming contests, testing code generation and problem-solving abilities.
Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains.
American Invitational Mathematics Examination 2025 problems testing olympiad-level mathematical reasoning.
AGIEval English — human-level reasoning tasks from standardized exams like SAT, LSAT, and civil service exams.
Competition mathematics problems requiring multi-step reasoning, covering algebra, geometry, number theory, and calculus.
American Invitational Mathematics Examination 2024 problems testing olympiad-level mathematical reasoning.
Big-Bench Hard — challenging subset of BIG-Bench focusing on tasks where language models previously underperformed.
OpenAI HumanEval benchmark measuring Python code generation from function docstrings.
Accounting and audit benchmark testing financial reasoning capabilities.
Mostly Basic Python Problems Plus — tests Python code generation with enhanced test cases.
Multimodal Understanding benchmark testing vision-language models on expert-level tasks.
Software Engineering benchmark testing ability to resolve real GitHub issues.
Grade School Math 8K — 8,500 high quality grade school math word problems.
Simple question answering benchmark testing factual accuracy and knowledge retrieval.
AI2 Reasoning Challenge (Challenge set) — grade-school science questions requiring complex reasoning.
AI2 Reasoning Challenge (Easy set) — grade-school science questions.
Big Bench Extra Hard — even more challenging reasoning tasks pushing the limits of language model capabilities.
Massive Multitask Language Understanding — tests knowledge across 57 subjects.
Mathematics benchmark covering algebra, geometry, number theory, and calculus problems.
Medical question answering benchmark from USMLE-style questions.
AGIEval Chinese — reasoning tasks from Chinese standardized exams (Gaokao, civil service).
BIRD-CRITIC — multi-turn benchmark testing SQL generation and database interaction.
Stock market benchmark testing financial analysis capabilities.
Instruction Following Evaluation benchmark testing how well LLMs follow detailed formatting and content constraints.
Logic puzzle benchmark based on knights (truth-tellers) and knaves (liars) puzzles.
Berkeley Function Calling Leaderboard v3 — testing function/tool calling accuracy.
Extended formal logic benchmark testing deductive and propositional reasoning.
Weapons of Mass Destruction Proxy — benchmark testing knowledge safety boundaries.
DarkBench — benchmark testing model safety and resistance to adversarial attacks.
GAIA — General AI Assistants benchmark testing multi-step real-world tasks.
Workshop on Machine Translation 2014 — multilingual translation quality benchmark.
MultiChallenge — general knowledge benchmark with diverse challenge types.
Get our weekly newsletter on pricing changes, new releases, and tools.
Built by @aellman
2026 68 Ventures, LLC. All rights reserved.