Price Per TokenPrice Per Token

HumanEval Leaderboard

OpenAI HumanEval benchmark measuring Python code generation from function docstrings.

Data from LayerLens

As of April 18, 2026, the top-scoring model on HumanEval is Claude Sonnet 4.5 at 97.6%, followed by R1 at 97.4% and Grok 4 at 97.0%. 75 models have been evaluated on this benchmark.

Last updated: April 18, 2026

Models

75

Best Score

97.6

Average

89.7

Std Dev

10.6

Categories
Computer Science and Programming
Provider
Model
Input $/M
Output $/M
HumanEval
Actions
$3.000
$15.000
97.6
$0.550
$2.000
97.4
$3.000
$15.000
97.0
$3.000
$15.000
97.0
$2.000
$12.000
97.0
$2.000
$12.000
97.0
$5.000
$25.000
97.0
$5.000
$25.000
97.0
$5.000
$25.000
97.0
$5.000
$25.000
97.0
$0.720
$2.300
97.0
$0.720
$2.300
97.0
$1.100
$4.400
96.3
$3.000
$15.000
96.3
$3.000
$15.000
96.3
$15.000
$75.000
96.3
$15.000
$75.000
96.3
$1.000
$5.000
96.3
$0.390
$0.900
96.3
$0.390
$0.900
96.3
$3.000
$15.000
96.2
$0.300
$0.500
95.7
$0.080
$0.240
95.7
$0.080
$0.240
95.7
$15.000
$75.000
95.7
$15.000
$75.000
95.7
$0.065
$0.140
95.5
$0.150
$0.580
95.5
$0.150
$0.580
95.5
$0.550
$2.200
95.1
$1.250
$10.000
95.1
$0.200
$1.100
95.1
$1.250
$10.000
94.5
$1.250
$10.000
94.5
$1.000
$5.000
93.9
$0.260
$0.380
93.9
$2.000
$8.000
93.3
$0.080
$0.280
93.3
$0.080
$0.280
93.3
$0.500
$2.150
93.3
$0.050
$0.200
93.3
$0.050
$0.200
93.3
$0.260
$1.560
93.3
$0.220
$0.900
92.7
$3.000
$15.000
92.1
$0.071
$0.100
92.1
$0.060
$0.400
92.1
$0.060
$0.400
92.1
$3.000
$15.000
91.5
$3.000
$15.000
90.9
$3.000
$15.000
90.9
$0.455
$0.900
90.2
$0.455
$0.900
90.2
$0.500
$1.500
90.2
$0.400
$2.000
88.4
$0.100
$0.400
87.8
$0.014
$0.028
87.2
$0.280
$0.900
86.6
$0.080
$0.160
85.4
$0.400
$2.000
85.4
$0.150
$0.600
84.8
$0.075
$0.200
83.5
$2.500
$10.000
82.9
$0.120
$0.390
82.3
$2.000
$6.000
82.3
$0.080
$0.300
81.1
$0.060
$0.240
78.0
$0.070
$0.280
77.4
$0.800
$3.200
76.8
$0.800
$4.000
75.6
$0.060
$0.120
73.2
$0.900
$0.900
67.1
$1.000
$1.000
51.2
$2.500
$10.000
48.2
$0.030
$0.050
47.6

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community
8 Ways to Use Fewer Tokens

About HumanEval

OpenAI HumanEval benchmark measuring Python code generation from function docstrings.

This leaderboard shows all models with HumanEval benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.

Frequently Asked Questions

OpenAI HumanEval benchmark measuring Python code generation from function docstrings.
As of April 18, 2026, Claude Sonnet 4.5 leads the HumanEval leaderboard with a score of 97.6. Rankings change as new models are released and evaluated.
Currently 75 models have been evaluated on HumanEval, with an average score of 89.7 and standard deviation of 10.6.
Benchmark scores are updated when new evaluations are published by our data sources (Artificial Analysis and LayerLens). Pricing data is refreshed daily from OpenRouter.