|Follow:

LLM Benchmarks

Discover and compare AI model performance across 39 benchmarks.

Data from Artificial Analysis and LayerLens

39 benchmarks

GPQA

Graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult.

Gemini 3.1 Pro Preview

Humanity's Last Exam

Reasoning and Logic

Humanity's Last Exam — extremely challenging questions designed to test the upper limits of AI capability across diverse domains.

Gemini 3.1 Pro Preview

SciCode

Computer Science and Programming

Science coding benchmark measuring AI ability to solve scientific computing tasks.

Gemini 3.1 Pro Preview

IFBench

Instruction Following

Instruction Following Benchmark measuring LLM ability to adhere to nuanced writing constraints and formatting requirements.

DeepSeek V4 Flash (Non-Reasoning)

79.2

Qwen3.5 397B A17B

78.8

LCR

Reasoning and Logic

Long Context Retrieval benchmark testing ability to find and use information in very long documents.

Tau2

Multi-turn

Tau2 benchmark testing multi-turn agent capabilities in airline and retail domains.

TerminalBench

Multi-turn

Terminal-based benchmark testing AI ability to interact with command-line interfaces and solve system tasks.

Gemini 3.1 Pro Preview

53.8

Claude Sonnet 4.6

53.0

LiveCodeBench

Computer Science and Programming

Real-world coding benchmark with problems from competitive programming contests, testing code generation and problem-solving abilities.

Gemini 3 Pro Preview

91.7

Gemini 3 Flash Preview

90.8

DeepSeek V3.2 Speciale

89.6

GPT-5.2

89.4

MMLU-Pro

General Knowledge

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains.

Gemini 3 Flash Preview

89.0

AIME 2025

Mathematical Problem Solving

American Invitational Mathematics Examination 2025 problems testing olympiad-level mathematical reasoning.

Gemini 3 Flash Preview

97.0

DeepSeek V3.2 Speciale

96.7

AGIEval English

Reasoning and Logic

AGIEval English — human-level reasoning tasks from standardized exams like SAT, LSAT, and civil service exams.

Gemini 3.1 Pro Preview

MATH-500

Mathematical Problem Solving

Competition mathematics problems requiring multi-step reasoning, covering algebra, geometry, number theory, and calculus.

AIME 2024

Mathematical Problem Solving

American Invitational Mathematics Examination 2024 problems testing olympiad-level mathematical reasoning.

Qwen3 235B A22B Thinking 2507

94.0

BBH

Reasoning and Logic

Big-Bench Hard — challenging subset of BIG-Bench focusing on tasks where language models previously underperformed.

HumanEval

Computer Science and Programming

OpenAI HumanEval benchmark measuring Python code generation from function docstrings.

Accounting Audit

Financial Reasoning

Accounting and audit benchmark testing financial reasoning capabilities.

Claude 3.7 Sonnet

86.7

Gemini 2.5 Pro Preview 05-06

MBPP Plus

Computer Science and Programming

Mostly Basic Python Problems Plus — tests Python code generation with enhanced test cases.

MMMU

Multimodal

Multimodal Understanding benchmark testing vision-language models on expert-level tasks.

SWE-bench Lite

Multi-turn

Software Engineering benchmark testing ability to resolve real GitHub issues.

Massive Multitask Language Understanding — tests knowledge across 57 subjects.

Mathematics

Mathematical Problem Solving

Mathematics benchmark covering algebra, geometry, number theory, and calculus problems.

MedQA

Reasoning and Logic

Medical question answering benchmark from USMLE-style questions.

AGIEval Chinese

Reasoning and Logic

AGIEval Chinese — reasoning tasks from Chinese standardized exams (Gaokao, civil service).

BIRD-CRITIC

Multi-turn

BIRD-CRITIC — multi-turn benchmark testing SQL generation and database interaction.

Stock BCS

Financial Reasoning

Stock market benchmark testing financial analysis capabilities.

IFEval

Instruction Following

Instruction Following Evaluation benchmark testing how well LLMs follow detailed formatting and content constraints.

Knights and Knaves

Reasoning and Logic

Logic puzzle benchmark based on knights (truth-tellers) and knaves (liars) puzzles.

BFCL v3

Multi-turn

Berkeley Function Calling Leaderboard v3 — testing function/tool calling accuracy.

Formal Logic Extended

Reasoning and Logic

Extended formal logic benchmark testing deductive and propositional reasoning.

WMDP

Reasoning and Logic

Weapons of Mass Destruction Proxy — benchmark testing knowledge safety boundaries.

Gemini 3 Flash Preview

86.8

Gemini 3 Flash Preview

DarkBench

Reasoning and Logic

DarkBench — benchmark testing model safety and resistance to adversarial attacks.

GAIA

Multi-turn

GAIA — General AI Assistants benchmark testing multi-step real-world tasks.

WMT 2014

Multilingual

Workshop on Machine Translation 2014 — multilingual translation quality benchmark.

Gemini 2.0 Flash

38.9

Llama 3.1 405B Instruct

MultiChallenge

General Knowledge

MultiChallenge — general knowledge benchmark with diverse challenge types.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community

8 Ways to Use Fewer Tokens

Built by @aellman

Tools

Directories

Models & Pricing

Endpoints

Rankings

News

Advertise | Terms of Service | Privacy Policy