Price Per TokenPrice Per Token

BBH Leaderboard

Big-Bench Hard — challenging subset of BIG-Bench focusing on tasks where language models previously underperformed.

Data from LayerLens

As of May 19, 2026, the top-scoring model on BBH is Claude Sonnet 4.5 at 94.4%, followed by Claude Sonnet 4.5 at 94.3% and GLM 5 at 94.3%. 116 models have been evaluated on this benchmark.

Last updated: May 19, 2026

Models

116

Best Score

94.4

Average

80.4

Std Dev

13.3

Categories
Reasoning and Logic
Provider
Model
Input $/M
Output $/M
BBH
Actions
$3.000
$15.000
94.4
$3.000
$15.000
94.3
$0.600
$1.920
94.3
$0.600
$1.920
94.3
$2.000
$12.000
93.8
$2.000
$12.000
93.8
$0.260
$0.900
92.7
$0.260
$0.900
92.7
$0.780
$3.900
92.6
$0.780
$3.900
92.6
$0.195
$0.900
92.4
$0.195
$0.900
92.4
$0.400
$1.900
91.0
$0.400
$1.900
91.0
$3.000
$15.000
91.0
$3.000
$15.000
91.0
$1.250
$10.000
90.9
$1.250
$10.000
90.9
$1.250
$10.000
90.9
$1.250
$10.000
90.9
$0.140
$0.900
90.8
$0.140
$0.900
90.8
$0.065
$0.260
90.6
$0.260
$1.560
90.6
$3.000
$15.000
90.3
$3.000
$15.000
90.3
$3.000
$15.000
90.3
$0.200
$0.500
89.6
$2.000
$8.000
89.6
$0.080
$0.280
89.5
$0.080
$0.280
89.5
$0.390
$0.900
89.4
$0.390
$0.900
89.4
$0.200
$0.500
89.3
$0.900
$0.900
89.0
$1.000
$10.000
88.7
$0.252
$0.378
88.5
$0.400
$1.750
88.4
$0.400
$1.750
88.4
$0.100
$0.400
88.3
$0.270
$0.410
88.0
$0.500
$2.150
87.9
$0.250
$2.000
87.7
$0.250
$2.000
87.7
$0.071
$0.100
87.4
$0.550
$2.200
87.1
$0.550
$2.200
87.1
$0.050
$0.400
86.9
$0.050
$0.400
86.9
$0.050
$0.400
86.9
$0.400
$2.000
86.7
$0.080
$0.280
85.8
$0.080
$0.280
85.8
$0.280
$0.900
85.7
$0.390
$1.740
85.5
$0.390
$1.740
85.5
$0.290
$0.950
85.0
$0.060
$0.200
84.6
$0.060
$0.200
84.6
$0.150
$1.150
84.5
$1.100
$4.400
84.4
$0.255
$1.000
84.4
$0.600
$2.500
84.4
$0.455
$0.900
84.2
$0.455
$0.900
84.2
$0.030
$0.140
84.1
$0.030
$0.140
84.1
$0.800
$3.200
83.9
$0.600
$2.200
83.8
$0.550
$2.000
83.5
$0.039
$0.180
82.1
$0.039
$0.180
82.1
$1.100
$4.400
82.0
$1.000
$5.000
82.0
$1.000
$5.000
82.0
$0.150
$0.600
81.8
$0.400
$2.000
81.5
$0.900
$0.900
81.2
$0.900
$0.900
81.2
$0.400
$2.000
80.5
$2.500
$10.000
79.6
$0.080
$0.300
79.3
$0.130
$0.850
77.9
$0.800
$4.000
77.0
$0.080
$0.160
76.6
$0.075
$0.200
76.5
$0.400
$2.200
74.3
$0.400
$2.200
74.3
$0.060
$0.120
72.7
$0.360
$0.400
72.4
$0.200
$0.500
72.1
$0.200
$0.500
71.6
$2.000
$6.000
70.3
$0.014
$0.028
69.8
$0.625
$5.000
69.8
$0.625
$5.000
69.8
$0.875
$7.000
67.6
$0.875
$7.000
67.6
$0.070
$0.280
64.8
$1.000
$1.000
64.2
$0.300
$2.500
61.6
$0.300
$2.500
61.6
$0.300
$2.500
61.6
$0.300
$2.500
61.6
$0.035
$0.140
61.5
$0.270
$0.950
60.5
$0.270
$0.950
60.5
$0.060
$0.240
59.0
$3.000
$15.000
53.8
$3.000
$15.000
53.4
$3.000
$15.000
53.4
$0.270
$0.410
52.8
$0.400
$2.000
48.1
$0.065
$0.140
39.4
$0.030
$0.050
36.0
$2.500
$10.000
28.2

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community
8 Ways to Use Fewer Tokens

About BBH

Big-Bench Hard — challenging subset of BIG-Bench focusing on tasks where language models previously underperformed.

This leaderboard shows all models with BBH benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.

Frequently Asked Questions

Big-Bench Hard — challenging subset of BIG-Bench focusing on tasks where language models previously underperformed.
As of May 19, 2026, Claude Sonnet 4.5 leads the BBH leaderboard with a score of 94.4. Rankings change as new models are released and evaluated.
Currently 116 models have been evaluated on BBH, with an average score of 80.4 and standard deviation of 13.3.
Benchmark scores are updated when new evaluations are published by our data sources (Artificial Analysis and LayerLens). Pricing data is refreshed daily from OpenRouter.