Price Per TokenPrice Per Token

BBEH Leaderboard

Big Bench Extra Hard — even more challenging reasoning tasks pushing the limits of language model capabilities.

Data from LayerLens

Models

34

Best Score

64.1

Average

24.3

Std Dev

13.8

Categories
Reasoning and Logic
Provider
Model
Input $/M
Output $/M
BBEH
Actions
$1.250
$10.000
64.1
$0.250
$2.000
54.9
$5.000
$25.000
49.2
$5.000
$25.000
49.2
$15.000
$75.000
38.4
$15.000
$75.000
38.4
$3.000
$15.000
37.1
$3.000
$15.000
34.7
$3.000
$15.000
34.7
$0.050
$0.400
29.3
$0.060
$0.400
26.6
$0.060
$0.400
26.6
$0.150
$0.600
25.6
$0.080
$0.300
19.5
$4.000
$4.000
19.1
$0.500
$1.500
18.7
$0.100
$0.400
18.5
$3.000
$15.000
18.0
$2.000
$8.000
17.7
$2.500
$10.000
17.1
$2.500
$10.000
17.1
$0.800
$4.000
16.7
$0.800
$3.200
15.9
$0.320
$0.890
15.4
$2.000
$6.000
14.8
$0.060
$0.180
14.2
$0.400
$2.000
14.1
$0.400
$0.400
14.0
$0.035
$0.140
13.2
$0.100
$0.300
11.9
$1.000
$3.000
11.2
$1.000
$3.000
11.2
$0.060
$0.140
10.1
$2.500
$10.000
10.0

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

OpenClaw

Deploy OpenClaw in Under 1 Minute We handle hosting, scaling, and maintenance

93 out of our 301 tracked models have had a price change in March.

Get our weekly newsletter on pricing changes, new releases, and tools.

About BBEH

Big Bench Extra Hard — even more challenging reasoning tasks pushing the limits of language model capabilities.

This leaderboard shows all models with BBEH benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.

Advertise with us