Price Per TokenPrice Per Token

BBH Leaderboard

Big-Bench Hard — challenging subset of BIG-Bench focusing on tasks where language models previously underperformed.

Data from LayerLens

As of April 4, 2026, the top-scoring model on BBH is Claude Sonnet 4.5 at 94.4%, followed by Claude Sonnet 4.5 at 94.3% and GLM 5 at 94.3%. 109 models have been evaluated on this benchmark.

Last updated: April 4, 2026

Models

109

Best Score

94.4

Average

80.7

Std Dev

13.3

Categories
Reasoning and Logic
Provider
Model
Input $/M
Output $/M
BBH
Actions
$3.000
$15.000
94.4
$3.000
$15.000
94.3
$0.720
$2.300
94.3
$0.720
$2.300
94.3
$2.000
$12.000
93.8
$0.260
$2.080
92.7
$0.260
$2.080
92.7
$0.780
$3.900
92.6
$0.780
$3.900
92.6
$0.195
$0.900
92.4
$0.195
$0.900
92.4
$0.383
$1.720
91.0
$0.383
$1.720
91.0
$3.000
$15.000
91.0
$3.000
$15.000
91.0
$0.625
$5.000
90.9
$0.625
$5.000
90.9
$0.625
$5.000
90.9
$0.625
$5.000
90.9
$0.163
$0.900
90.8
$0.163
$0.900
90.8
$0.065
$0.260
90.6
$0.260
$1.560
90.6
$3.000
$15.000
90.3
$3.000
$15.000
90.3
$3.000
$15.000
90.3
$0.200
$0.500
89.6
$2.000
$8.000
89.6
$0.080
$0.280
89.5
$0.080
$0.280
89.5
$0.390
$0.900
89.4
$0.390
$0.900
89.4
$0.200
$0.500
89.3
$0.900
$0.900
89.0
$1.000
$10.000
88.7
$0.260
$0.380
88.5
$0.390
$1.750
88.4
$0.390
$1.750
88.4
$0.100
$0.400
88.3
$0.270
$0.410
88.0
$0.450
$2.150
87.9
$0.125
$1.000
87.7
$0.071
$0.100
87.4
$0.550
$2.200
87.1
$0.550
$2.200
87.1
$0.050
$0.400
86.9
$0.050
$0.400
86.9
$0.050
$0.400
86.9
$0.400
$2.000
86.7
$0.080
$0.240
85.8
$0.080
$0.240
85.8
$0.280
$0.900
85.7
$0.390
$1.700
85.5
$0.390
$1.700
85.5
$0.270
$0.950
85.0
$0.060
$0.200
84.6
$0.060
$0.200
84.6
$0.118
$0.950
84.5
$1.100
$4.400
84.4
$0.255
$1.000
84.4
$0.470
$2.000
84.4
$0.400
$0.800
84.2
$0.400
$0.800
84.2
$0.030
$0.100
84.1
$0.800
$3.200
83.9
$0.600
$2.200
83.8
$0.550
$2.000
83.5
$0.039
$0.100
82.1
$0.039
$0.100
82.1
$1.100
$4.400
82.0
$1.000
$5.000
82.0
$1.000
$5.000
82.0
$0.150
$0.600
81.8
$0.400
$2.000
81.5
$0.150
$0.580
81.2
$0.150
$0.580
81.2
$0.400
$2.000
80.5
$2.500
$10.000
79.6
$0.080
$0.300
79.3
$0.130
$0.850
77.9
$0.800
$4.000
77.0
$0.080
$0.160
76.6
$0.075
$0.200
76.5
$0.400
$1.760
74.3
$0.020
$0.040
72.7
$0.120
$0.390
72.4
$0.200
$0.500
72.1
$0.200
$0.500
71.6
$2.000
$6.000
70.3
$0.014
$0.028
69.8
$0.625
$5.000
69.8
$0.625
$5.000
69.8
$1.750
$14.000
67.6
$0.070
$0.280
64.8
$1.000
$1.000
64.2
$0.300
$2.500
61.6
$0.300
$2.500
61.6
$0.035
$0.140
61.5
$0.210
$0.790
60.5
$0.210
$0.790
60.5
$0.060
$0.240
59.0
$3.000
$15.000
53.8
$3.000
$15.000
53.4
$3.000
$15.000
53.4
$0.270
$0.410
52.8
$0.400
$2.000
48.1
$0.065
$0.140
39.4
$0.030
$0.050
36.0
$2.500
$10.000
28.2

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

OpenClaw

Deploy OpenClaw in Under 1 Minute We handle hosting, scaling, and maintenance

8 Ways to Use Fewer Tokens

About BBH

Big-Bench Hard — challenging subset of BIG-Bench focusing on tasks where language models previously underperformed.

This leaderboard shows all models with BBH benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.

Frequently Asked Questions

Big-Bench Hard — challenging subset of BIG-Bench focusing on tasks where language models previously underperformed.
As of April 4, 2026, Claude Sonnet 4.5 leads the BBH leaderboard with a score of 94.4. Rankings change as new models are released and evaluated.
Currently 109 models have been evaluated on BBH, with an average score of 80.7 and standard deviation of 13.3.
Benchmark scores are updated when new evaluations are published by our data sources (Artificial Analysis and LayerLens). Pricing data is refreshed daily from OpenRouter.