Big-Bench Hard — challenging subset of BIG-Bench focusing on tasks where language models previously underperformed.
Data from LayerLens
As of May 19, 2026, the top-scoring model on BBH is Claude Sonnet 4.5 at 94.4%, followed by Claude Sonnet 4.5 at 94.3% and GLM 5 at 94.3%. 116 models have been evaluated on this benchmark.
Last updated: May 19, 2026
Models
116
Best Score
94.4
Average
80.4
Std Dev
13.3
Provider | Model | Input $/M | Output $/M | BBH | Actions |
|---|---|---|---|---|---|
$3.000 | $15.000 | 94.4 | |||
$3.000 | $15.000 | 94.3 | |||
$0.600 | $1.920 | 94.3 | |||
$0.600 | $1.920 | 94.3 | |||
$2.000 | $12.000 | 93.8 | |||
$2.000 | $12.000 | 93.8 | |||
$0.260 | $0.900 | 92.7 | |||
$0.260 | $0.900 | 92.7 | |||
$0.780 | $3.900 | 92.6 | |||
$0.780 | $3.900 | 92.6 | |||
$0.195 | $0.900 | 92.4 | |||
$0.195 | $0.900 | 92.4 | |||
$0.400 | $1.900 | 91.0 | |||
$0.400 | $1.900 | 91.0 | |||
$3.000 | $15.000 | 91.0 | |||
$3.000 | $15.000 | 91.0 | |||
$1.250 | $10.000 | 90.9 | |||
$1.250 | $10.000 | 90.9 | |||
$1.250 | $10.000 | 90.9 | |||
$1.250 | $10.000 | 90.9 | |||
$0.140 | $0.900 | 90.8 | |||
$0.140 | $0.900 | 90.8 | |||
$0.065 | $0.260 | 90.6 | |||
$0.260 | $1.560 | 90.6 | |||
$3.000 | $15.000 | 90.3 | |||
$3.000 | $15.000 | 90.3 | |||
$3.000 | $15.000 | 90.3 | |||
$0.200 | $0.500 | 89.6 | |||
$2.000 | $8.000 | 89.6 | |||
$0.080 | $0.280 | 89.5 | |||
$0.080 | $0.280 | 89.5 | |||
$0.390 | $0.900 | 89.4 | |||
$0.390 | $0.900 | 89.4 | |||
$0.200 | $0.500 | 89.3 | |||
$0.900 | $0.900 | 89.0 | |||
$1.000 | $10.000 | 88.7 | |||
$0.252 | $0.378 | 88.5 | |||
$0.400 | $1.750 | 88.4 | |||
$0.400 | $1.750 | 88.4 | |||
$0.100 | $0.400 | 88.3 | |||
$0.270 | $0.410 | 88.0 | |||
$0.500 | $2.150 | 87.9 | |||
$0.250 | $2.000 | 87.7 | |||
$0.250 | $2.000 | 87.7 | |||
$0.071 | $0.100 | 87.4 | |||
$0.550 | $2.200 | 87.1 | |||
$0.550 | $2.200 | 87.1 | |||
$0.050 | $0.400 | 86.9 | |||
$0.050 | $0.400 | 86.9 | |||
$0.050 | $0.400 | 86.9 | |||
$0.400 | $2.000 | 86.7 | |||
$0.080 | $0.280 | 85.8 | |||
$0.080 | $0.280 | 85.8 | |||
$0.280 | $0.900 | 85.7 | |||
$0.390 | $1.740 | 85.5 | |||
$0.390 | $1.740 | 85.5 | |||
$0.290 | $0.950 | 85.0 | |||
$0.060 | $0.200 | 84.6 | |||
$0.060 | $0.200 | 84.6 | |||
$0.150 | $1.150 | 84.5 | |||
$1.100 | $4.400 | 84.4 | |||
$0.255 | $1.000 | 84.4 | |||
$0.600 | $2.500 | 84.4 | |||
$0.455 | $0.900 | 84.2 | |||
$0.455 | $0.900 | 84.2 | |||
$0.030 | $0.140 | 84.1 | |||
$0.030 | $0.140 | 84.1 | |||
$0.800 | $3.200 | 83.9 | |||
$0.600 | $2.200 | 83.8 | |||
$0.550 | $2.000 | 83.5 | |||
$0.039 | $0.180 | 82.1 | |||
$0.039 | $0.180 | 82.1 | |||
$1.100 | $4.400 | 82.0 | |||
$1.000 | $5.000 | 82.0 | |||
$1.000 | $5.000 | 82.0 | |||
$0.150 | $0.600 | 81.8 | |||
$0.400 | $2.000 | 81.5 | |||
$0.900 | $0.900 | 81.2 | |||
$0.900 | $0.900 | 81.2 | |||
$0.400 | $2.000 | 80.5 | |||
$2.500 | $10.000 | 79.6 | |||
$0.080 | $0.300 | 79.3 | |||
$0.130 | $0.850 | 77.9 | |||
$0.800 | $4.000 | 77.0 | |||
$0.080 | $0.160 | 76.6 | |||
$0.075 | $0.200 | 76.5 | |||
$0.400 | $2.200 | 74.3 | |||
$0.400 | $2.200 | 74.3 | |||
$0.060 | $0.120 | 72.7 | |||
$0.360 | $0.400 | 72.4 | |||
$0.200 | $0.500 | 72.1 | |||
$0.200 | $0.500 | 71.6 | |||
$2.000 | $6.000 | 70.3 | |||
$0.014 | $0.028 | 69.8 | |||
$0.625 | $5.000 | 69.8 | |||
$0.625 | $5.000 | 69.8 | |||
$0.875 | $7.000 | 67.6 | |||
$0.875 | $7.000 | 67.6 | |||
$0.070 | $0.280 | 64.8 | |||
$1.000 | $1.000 | 64.2 | |||
$0.300 | $2.500 | 61.6 | |||
$0.300 | $2.500 | 61.6 | |||
$0.300 | $2.500 | 61.6 | |||
$0.300 | $2.500 | 61.6 | |||
$0.035 | $0.140 | 61.5 | |||
$0.270 | $0.950 | 60.5 | |||
$0.270 | $0.950 | 60.5 | |||
$0.060 | $0.240 | 59.0 | |||
$3.000 | $15.000 | 53.8 | |||
$3.000 | $15.000 | 53.4 | |||
$3.000 | $15.000 | 53.4 | |||
$0.270 | $0.410 | 52.8 | |||
$0.400 | $2.000 | 48.1 | |||
$0.065 | $0.140 | 39.4 | |||
$0.030 | $0.050 | 36.0 | |||
$2.500 | $10.000 | 28.2 |
Pricing from OpenRouter. Benchmarks from Artificial Analysis.
Get our weekly newsletter on pricing changes, new releases, and tools.
Big-Bench Hard — challenging subset of BIG-Bench focusing on tasks where language models previously underperformed.
This leaderboard shows all models with BBH benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.
Built by @aellman
2026 68 Ventures, LLC. All rights reserved.