39 benchmarks
Graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult.
Humanity's Last Exam — extremely challenging questions designed to test the upper limits of AI capability across diverse domains.
Science coding benchmark measuring AI ability to solve scientific computing tasks.
Instruction Following Benchmark measuring LLM ability to adhere to nuanced writing constraints and formatting requirements.
Long Context Retrieval benchmark testing ability to find and use information in very long documents.
Tau2 benchmark testing multi-turn agent capabilities in airline and retail domains.
Terminal-based benchmark testing AI ability to interact with command-line interfaces and solve system tasks.
Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains.
Real-world coding benchmark with problems from competitive programming contests, testing code generation and problem-solving abilities.
American Invitational Mathematics Examination 2025 problems testing olympiad-level mathematical reasoning.
AGIEval English — human-level reasoning tasks from standardized exams like SAT, LSAT, and civil service exams.
Competition mathematics problems requiring multi-step reasoning, covering algebra, geometry, number theory, and calculus.
American Invitational Mathematics Examination 2024 problems testing olympiad-level mathematical reasoning.
Big-Bench Hard — challenging subset of BIG-Bench focusing on tasks where language models previously underperformed.
OpenAI HumanEval benchmark measuring Python code generation from function docstrings.
Accounting and audit benchmark testing financial reasoning capabilities.
Mostly Basic Python Problems Plus — tests Python code generation with enhanced test cases.
Multimodal Understanding benchmark testing vision-language models on expert-level tasks.
Software Engineering benchmark testing ability to resolve real GitHub issues.
Grade School Math 8K — 8,500 high quality grade school math word problems.
Simple question answering benchmark testing factual accuracy and knowledge retrieval.
AI2 Reasoning Challenge (Challenge set) — grade-school science questions requiring complex reasoning.
Big Bench Extra Hard — even more challenging reasoning tasks pushing the limits of language model capabilities.
AI2 Reasoning Challenge (Easy set) — grade-school science questions.
Medical question answering benchmark from USMLE-style questions.
Mathematics benchmark covering algebra, geometry, number theory, and calculus problems.
Massive Multitask Language Understanding — tests knowledge across 57 subjects.
AGIEval Chinese — reasoning tasks from Chinese standardized exams (Gaokao, civil service).
Stock market benchmark testing financial analysis capabilities.
BIRD-CRITIC — multi-turn benchmark testing SQL generation and database interaction.
Logic puzzle benchmark based on knights (truth-tellers) and knaves (liars) puzzles.
Instruction Following Evaluation benchmark testing how well LLMs follow detailed formatting and content constraints.
Berkeley Function Calling Leaderboard v3 — testing function/tool calling accuracy.
Extended formal logic benchmark testing deductive and propositional reasoning.
Weapons of Mass Destruction Proxy — benchmark testing knowledge safety boundaries.
DarkBench — benchmark testing model safety and resistance to adversarial attacks.
Workshop on Machine Translation 2014 — multilingual translation quality benchmark.
GAIA — General AI Assistants benchmark testing multi-step real-world tasks.
MultiChallenge — general knowledge benchmark with diverse challenge types.
108 out of our 483 tracked models have had a price change in March.
Get our weekly newsletter on pricing changes, new releases, and tools.
Built by @aellman
2026 68 Ventures, LLC. All rights reserved.