BBEH Leaderboard

Big Bench Extra Hard — even more challenging reasoning tasks pushing the limits of language model capabilities.

As of June 2, 2026, the top-scoring model on BBEH is GPT-5 at 64.1%, followed by GPT-5 at 64.1% and GPT-5 at 64.1%. 40 models have been evaluated on this benchmark.

Last updated: June 2, 2026

Models

Best Score

64.1

Average

28.3

Std Dev

17.0

Categories

Reasoning and Logic

SourceLayerLens

Provider	Model	Input $/M	Output $/M	BBEH	Actions
O OpenAI	GPT-5	$1.250	$10.000	64.1	Try
O OpenAI	GPT-5	$1.250	$10.000	64.1	Try
O OpenAI	GPT-5	$1.250	$10.000	64.1	Try
O OpenAI	GPT-5	$1.250	$10.000	64.1	Try
O OpenAI	GPT-5 Mini	$0.250	$2.000	54.9	Try
O OpenAI	GPT-5 Mini	$0.250	$2.000	54.9	Try
A Anthropic	Claude Opus 4.6 Thinking	$5.000	$25.000	49.2	Try
A Anthropic	Claude Opus 4.6	$5.000	$25.000	49.2	Try
A Anthropic	Claude Opus 4 Thinking	$15.000	$75.000	38.4	Try
A Anthropic	Claude Opus 4	$15.000	$75.000	38.4	Try
A Anthropic	Claude Sonnet 4.5 Thinking	$3.000	$15.000	37.1	Try
A Anthropic	Claude Sonnet 4 Thinking	$3.000	$15.000	34.7	Try
A Anthropic	Claude Sonnet 4	$3.000	$15.000	34.7	Try
O OpenAI	GPT-5 Nano	$0.050	$0.400	29.3	Try
O OpenAI	GPT-5 Nano	$0.050	$0.400	29.3	Try
O OpenAI	GPT-5 Nano	$0.050	$0.400	29.3	Try
Z Z AI	GLM-4.7-Flash Thinking	$0.060	$0.400	26.6	Try
Z Z AI	GLM-4.7-Flash	$0.060	$0.400	26.6	Try
M Meta	Llama 4 Maverick	$0.150	$0.600	25.6	Try
M Meta	Llama 4 Scout	$0.080	$0.300	19.5	Try
M Meta	Llama 3.1 405B Instruct	$0.900	$0.900	19.1	Try
MI Mistral AI	Mistral Large 3 2512	$0.500	$1.500	18.7	Try
G Google	Gemini 2.0 Flash	$0.100	$0.400	18.5	Try
X xAI	Grok 3	$3.000	$15.000	18.0	Try
X xAI	Grok 3	$3.000	$15.000	18.0	Try
O OpenAI	GPT-4.1	$2.000	$8.000	17.7	Try
CO Cohere	Command A	$2.500	$10.000	17.1	Try
A Anthropic	Claude 3.5 Haiku	$0.800	$4.000	16.7	Try
AM Amazon	Nova Pro 1.0	$0.800	$3.200	15.9	Try
DS Deepseek	DeepSeek V3	$0.014	$0.028	15.4	Try
MI Mistral	Pixtral Large 2411	$2.000	$6.000	14.8	Try
MI Mistral	Mistral Small 3.2 24B	$0.075	$0.200	14.2	Try
MI Mistral	Mistral Medium 3	$0.400	$2.000	14.1	Try
M Meta	Llama 3.1 70B Instruct	$0.340	$0.390	14.0	Try
AM Amazon	Nova Micro 1.0	$0.035	$0.140	13.2	Try
MI Mistral	Devstral Small 1.1	$0.070	$0.280	11.9	Try
NO Nous Research	Hermes 4 405B Thinking	$1.000	$3.000	11.2	Try
NO Nous Research	Hermes 4 405B	$1.000	$3.000	11.2	Try
MS Microsoft	Phi 4	$0.065	$0.140	10.1	Try
IF Inflection	Inflection 3 Pi	$2.500	$10.000	10.0	Try

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community

8 Ways to Use Fewer Tokens

About BBEH

Big Bench Extra Hard — even more challenging reasoning tasks pushing the limits of language model capabilities.

This leaderboard shows all models with BBEH benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.