Price Per TokenPrice Per Token

BBEH Leaderboard

Big Bench Extra Hard — even more challenging reasoning tasks pushing the limits of language model capabilities.

Data from LayerLens

As of April 18, 2026, the top-scoring model on BBEH is GPT-5 at 64.1%, followed by GPT-5 at 64.1% and GPT-5 at 64.1%. 40 models have been evaluated on this benchmark.

Last updated: April 18, 2026

Models

40

Best Score

64.1

Average

28.3

Std Dev

17.0

Categories
Reasoning and Logic
Provider
Model
Input $/M
Output $/M
BBEH
Actions
$1.250
$10.000
64.1
$1.250
$10.000
64.1
$1.250
$10.000
64.1
$1.250
$10.000
64.1
$0.125
$1.000
54.9
$0.125
$1.000
54.9
$5.000
$25.000
49.2
$5.000
$25.000
49.2
$15.000
$75.000
38.4
$15.000
$75.000
38.4
$3.000
$15.000
37.1
$3.000
$15.000
34.7
$3.000
$15.000
34.7
$0.050
$0.400
29.3
$0.050
$0.400
29.3
$0.050
$0.400
29.3
$0.060
$0.400
26.6
$0.060
$0.400
26.6
$0.150
$0.600
25.6
$0.080
$0.300
19.5
$0.900
$0.900
19.1
$0.500
$1.500
18.7
$0.100
$0.400
18.5
$3.000
$15.000
18.0
$3.000
$15.000
18.0
$2.000
$8.000
17.7
$2.500
$10.000
17.1
$0.800
$4.000
16.7
$0.800
$3.200
15.9
$0.014
$0.028
15.4
$2.000
$6.000
14.8
$0.075
$0.200
14.2
$0.400
$2.000
14.1
$0.340
$0.390
14.0
$0.035
$0.140
13.2
$0.070
$0.280
11.9
$1.000
$3.000
11.2
$1.000
$3.000
11.2
$0.065
$0.140
10.1
$2.500
$10.000
10.0

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community
8 Ways to Use Fewer Tokens

About BBEH

Big Bench Extra Hard — even more challenging reasoning tasks pushing the limits of language model capabilities.

This leaderboard shows all models with BBEH benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.

Frequently Asked Questions

Big Bench Extra Hard — even more challenging reasoning tasks pushing the limits of language model capabilities.
As of April 18, 2026, GPT-5 leads the BBEH leaderboard with a score of 64.1. Rankings change as new models are released and evaluated.
Currently 40 models have been evaluated on BBEH, with an average score of 28.3 and standard deviation of 17.0.
Benchmark scores are updated when new evaluations are published by our data sources (Artificial Analysis and LayerLens). Pricing data is refreshed daily from OpenRouter.