Price Per TokenPrice Per Token

LLM Benchmarks

Discover and compare AI model performance across 39 benchmarks.

Data from Artificial Analysis and LayerLens

39 benchmarks

GPQA

Reasoning and Logic

Graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult.

1
Google
Gemini 3 Pro Preview
90.8
2
OpenAI
GPT-5.2 Pro
90.3
3
Google
Gemini 3 Flash Preview
89.8
4
Anthropic
Claude Opus 4.6
89.6

Humanity's Last Exam

Reasoning and Logic

Humanity's Last Exam — extremely challenging questions designed to test the upper limits of AI capability across diverse domains.

1
Google
Gemini 3 Pro Preview
37.2
2
Anthropic
Claude Opus 4.6
36.7
3
OpenAI
GPT-5.2 Pro
35.4
4
Google
Gemini 3 Flash Preview
34.7

SciCode

Computer Science and Programming

Science coding benchmark measuring AI ability to solve scientific computing tasks.

1
Google
Gemini 3 Pro Preview
56.1
2
OpenAI
GPT-5.2 Pro
52.1
3
Anthropic
Claude Opus 4.6
51.9
4
Google
Gemini 3 Flash Preview
50.6

LiveCodeBench

Computer Science and Programming

Real-world coding benchmark with problems from competitive programming contests, testing code generation and problem-solving abilities.

1
Google
Gemini 3 Pro Preview
91.7
2
Google
Gemini 3 Flash Preview
90.8
3
DeepSeek
DeepSeek V3.2 Speciale
89.6
4
Z
GLM 4.7
89.4

MMLU-Pro

General Knowledge

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains.

1
Google
Gemini 3 Pro Preview
89.8
2
Anthropic
Claude Opus 4.5
89.5
3
Google
Gemini 3 Flash Preview
89.0
4
Anthropic
Claude Opus 4.5
88.9

IFBench

Instruction Following

Instruction Following Benchmark measuring LLM ability to adhere to nuanced writing constraints and formatting requirements.

1
Google
Gemini 3 Flash Preview
78.0
2
OpenAI
GPT-5 Mini
75.4
3
OpenAI
GPT-5.2 Pro
75.4
4
OpenAI
GPT-5 Codex
74.1

LCR

Reasoning and Logic

Long Context Retrieval benchmark testing ability to find and use information in very long documents.

1
OpenAI
GPT-5
75.6
2
K
KAT-Coder-Pro V1
74.0
3
Anthropic
Claude Opus 4.5
74.0
4
OpenAI
GPT-5.2 Pro
72.7

Tau2

Multi-turn

Tau2 benchmark testing multi-turn agent capabilities in airline and retail domains.

1
Z
GLM-4.7-Flash
98.8
2
Z
GLM 5
98.2
3
Z
GLM 5
97.4
4
Z
GLM 4.7
95.9

TerminalBench

Multi-turn

Terminal-based benchmark testing AI ability to interact with command-line interfaces and solve system tasks.

1
Anthropic
Claude Opus 4.6
48.5
2
Anthropic
Claude Opus 4.5
47.0
3
OpenAI
GPT-5.2 Pro
47.0
4
Anthropic
Claude Opus 4.6
46.2

AIME 2025

Mathematical Problem Solving

American Invitational Mathematics Examination 2025 problems testing olympiad-level mathematical reasoning.

1
OpenAI
GPT-5.2 Pro
99.0
2
OpenAI
GPT-5 Codex
98.7
3
Google
Gemini 3 Flash Preview
97.0
4
DeepSeek
DeepSeek V3.2 Speciale
96.7

MATH-500

Mathematical Problem Solving

Competition mathematics problems requiring multi-step reasoning, covering algebra, geometry, number theory, and calculus.

1
OpenAI
GPT-5
99.4
2
OpenAI
o3
99.2
3
Anthropic
Claude Sonnet 4
99.1
4
xAI
Grok 4
99.0

AIME 2024

Mathematical Problem Solving

American Invitational Mathematics Examination 2024 problems testing olympiad-level mathematical reasoning.

1
OpenAI
GPT-5
95.7
2
xAI
Grok 4
94.3
3
OpenAI
o4 Mini
94.0
4
A
Qwen3 235B A22B Thinking 2507
94.0

AGIEval English

Reasoning and Logic

AGIEval English — human-level reasoning tasks from standardized exams like SAT, LSAT, and civil service exams.

1
Google
Gemini 3.1 Pro Preview
94.0
2
Google
Gemini 3 Pro Preview
93.2
3
Qwen
Qwen3.5 397B A17B
91.4
4
OpenAI
GPT-5
91.4

BBH

Reasoning and Logic

Big-Bench Hard — challenging subset of BIG-Bench focusing on tasks where language models previously underperformed.

1
Anthropic
Claude Sonnet 4.5
94.4
2
Anthropic
Claude Sonnet 4.5
94.3
3
Google
Gemini 3 Pro Preview
93.8
4
Qwen
Qwen3.5-122B-A10B
92.7

HumanEval

Computer Science and Programming

OpenAI HumanEval benchmark measuring Python code generation from function docstrings.

1
Anthropic
Claude Sonnet 4.5
97.6
2
DeepSeek
R1
97.4
3
xAI
Grok 4
97.0
4
Anthropic
Claude Sonnet 4.5
97.0

MBPP Plus

Computer Science and Programming

Mostly Basic Python Problems Plus — tests Python code generation with enhanced test cases.

1
A
Qwen3 235B A22B
66.2
2
A
Qwen3 235B A22B
66.2
3
DeepSeek
R1
64.7
4
OpenAI
o4 Mini High
64.3

Accounting Audit

Financial Reasoning

Accounting and audit benchmark testing financial reasoning capabilities.

1
OpenAI
GPT-4o
90.0
2
Anthropic
Claude 3.7 Sonnet
86.7
3
Google
Gemini 2.5 Pro Preview 05-06
86.7
4
Anthropic
Claude 3.7 Sonnet
83.3

MMMU

Multimodal

Multimodal Understanding benchmark testing vision-language models on expert-level tasks.

1
OpenAI
o4 Mini High
79.2
2
OpenAI
GPT-5
79.1
3
Qwen
Qwen3.5-122B-A10B
78.1
4
Qwen
Qwen3.5-27B
77.6

GSM8K

Mathematical Problem Solving

Grade School Math 8K — 8,500 high quality grade school math word problems.

1
Anthropic
Claude Opus 4
96.2
2
Anthropic
Claude Opus 4
96.2
3
DeepSeek
R1
96.2
4
OpenAI
o4 Mini High
96.0

SWE-bench Lite

Multi-turn

Software Engineering benchmark testing ability to resolve real GitHub issues.

1
Anthropic
Claude Opus 4.6
62.7
2
Anthropic
Claude Opus 4.6
62.7
3
OpenAI
GPT-5
54.3
4
Anthropic
Claude Haiku 4.5
54.3

SimpleQA

Reasoning and Logic

Simple question answering benchmark testing factual accuracy and knowledge retrieval.

1
Google
Gemini 2.5 Pro
53.0
2
A
Qwen3 235B A22B Instruct 2507
50.6
3
A
Qwen3 VL 235B A22B Instruct
46.7
4
OpenAI
GPT-4.1
40.4

ARC Challenge

Reasoning and Logic

AI2 Reasoning Challenge (Challenge set) — grade-school science questions requiring complex reasoning.

1
OpenAI
GPT-5
96.3
2
MiniMax
MiniMax M1
95.3
3
Xai
Grok 3 Mini Beta
95.2
4
OpenAI
GPT-4.1
95.1

ARC Easy

Reasoning and Logic

AI2 Reasoning Challenge (Easy set) — grade-school science questions.

1
Anthropic
Claude Opus 4
99.7
2
Anthropic
Claude Opus 4
99.7
3
A
Qwen3 32B
99.1
4
A
Qwen3 32B
99.1

BBEH

Reasoning and Logic

Big Bench Extra Hard — even more challenging reasoning tasks pushing the limits of language model capabilities.

1
OpenAI
GPT-5
64.1
2
OpenAI
GPT-5 Mini
54.9
3
Anthropic
Claude Opus 4.6
49.2
4
Anthropic
Claude Opus 4.6
49.2

MedQA

Reasoning and Logic

Medical question answering benchmark from USMLE-style questions.

1
OpenAI
o4 Mini High
95.2
2
Google
Gemini 2.5 Pro
94.6
3
Anthropic
Claude 3.7 Sonnet
92.3
4
DeepSeek
R1
92.1

AGIEval Chinese

Reasoning and Logic

AGIEval Chinese — reasoning tasks from Chinese standardized exams (Gaokao, civil service).

1
DeepSeek
DeepSeek V3.2 Exp
90.1
2
A
Qwen3 235B A22B
89.4
3
A
Qwen3 235B A22B
89.4
4
Baidu
ERNIE 4.5 300B A47B
89.0

Mathematics

Mathematical Problem Solving

Mathematics benchmark covering algebra, geometry, number theory, and calculus problems.

1
Anthropic
Claude Opus 4.6
95.6
2
Anthropic
Claude Opus 4.6
95.6
3
OpenAI
o4 Mini High
94.6
4
OpenAI
o3 Mini
93.1

MMLU

General Knowledge

Massive Multitask Language Understanding — tests knowledge across 57 subjects.

1
DeepSeek
R1 0528
90.5
2
Xai
Grok 3 Mini Beta
89.2
3
OpenAI
o3 Mini
88.9
4
K
Kimi K2 0711
88.3

Stock BCS

Financial Reasoning

Stock market benchmark testing financial analysis capabilities.

1
A
Qwen2.5 72B Instruct
100.0
2
DeepSeek
R1
91.7
3
OpenAI
GPT-4.1
91.7
4
OpenAI
o3 Mini
83.3

BIRD-CRITIC

Multi-turn

BIRD-CRITIC — multi-turn benchmark testing SQL generation and database interaction.

1
Anthropic
Claude Opus 4.6
34.0
2
Anthropic
Claude Opus 4.6
34.0
3
Z
GLM 4.7
33.0
4
Z
GLM 4.7
33.0

Knights and Knaves

Reasoning and Logic

Logic puzzle benchmark based on knights (truth-tellers) and knaves (liars) puzzles.

1
OpenAI
o3 Mini
99.7
2
OpenAI
o4 Mini High
99.7
3
DeepSeek
R1 0528
97.9
4
DeepSeek
R1
97.3

IFEval

Instruction Following

Instruction Following Evaluation benchmark testing how well LLMs follow detailed formatting and content constraints.

1
K
Kimi K2.5
92.6
2
K
Kimi K2.5
92.6
3
Google
Gemini 2.5 Pro
90.8
4
Z
GLM 4.7
90.8

BFCL v3

Multi-turn

Berkeley Function Calling Leaderboard v3 — testing function/tool calling accuracy.

1
Z
GLM 4.5
76.7
2
A
Qwen3 32B
75.7
3
A
Qwen3 32B
75.7
4
A
Qwen3 Max
74.9

Formal Logic Extended

Reasoning and Logic

Extended formal logic benchmark testing deductive and propositional reasoning.

1
OpenAI
o3 Mini
99.8
2
DeepSeek
R1 0528
99.2
3
DeepSeek
R1
98.4
4
Anthropic
Claude 3.7 Sonnet
95.6

WMDP

Reasoning and Logic

Weapons of Mass Destruction Proxy — benchmark testing knowledge safety boundaries.

1
Google
Gemini 3 Flash Preview
86.8
2
Google
Gemini 3 Flash Preview
86.8
3
OpenAI
o3 Mini
80.5
4
Anthropic
Claude 3.7 Sonnet
78.2

WMT 2014

Multilingual

Workshop on Machine Translation 2014 — multilingual translation quality benchmark.

1
Google
Gemini 2.0 Flash
38.9
2
Meta
Llama 3.1 405B Instruct
38.0
3
Meta
Llama 4 Maverick
38.0
4
OpenAI
GPT-4.1
37.6

DarkBench

Reasoning and Logic

DarkBench — benchmark testing model safety and resistance to adversarial attacks.

1
xAI
Grok 4
51.1
2
Microsoft Azure
Phi 4
49.5
3
A
Qwen3 32B
49.2
4
A
Qwen3 32B
49.2

GAIA

Multi-turn

GAIA — General AI Assistants benchmark testing multi-step real-world tasks.

1
OpenAI
GPT-5 Mini
44.8
2
Anthropic
Claude 3.7 Sonnet
43.9
3
Anthropic
Claude 3.7 Sonnet
43.9
4
Google
Gemini 2.5 Pro
33.3

MultiChallenge

General Knowledge

MultiChallenge — general knowledge benchmark with diverse challenge types.

1
Mistral
Mistral Medium 3.1
37.4
2
Amazon
Nova Pro 1.0
19.0
OpenClaw

Deploy OpenClaw in Under 1 Minute We handle hosting, scaling, and maintenance

93 out of our 301 tracked models have had a price change in March.

Get our weekly newsletter on pricing changes, new releases, and tools.