Price Per TokenPrice Per Token

HumanEval Leaderboard

OpenAI HumanEval benchmark measuring Python code generation from function docstrings.

Data from LayerLens

As of March 15, 2026, the top-scoring model on HumanEval is Claude Sonnet 4.5 at 97.6%, followed by R1 at 97.4% and Grok 4 at 97.0%. 75 models have been evaluated on this benchmark.

Last updated: March 15, 2026

Models

75

Best Score

97.6

Average

89.6

Std Dev

10.5

Categories
Computer Science and Programming
Provider
Model
Input $/M
Output $/M
HumanEval
Actions
$3.000
$15.000
97.6
$0.550
$2.190
97.4
$3.000
$15.000
97.0
$3.000
$15.000
97.0
$2.000
$12.000
97.0
$5.000
$25.000
97.0
$5.000
$25.000
97.0
$5.000
$25.000
97.0
$5.000
$25.000
97.0
$0.720
$2.300
97.0
$0.720
$2.300
97.0
$1.100
$4.400
96.3
$3.000
$15.000
96.3
$3.000
$15.000
96.3
$15.000
$75.000
96.3
$15.000
$75.000
96.3
$1.000
$5.000
96.3
$0.390
$0.900
96.3
$0.390
$0.900
96.3
$3.000
$15.000
96.2
$0.300
$0.500
95.7
$0.080
$0.240
95.7
$0.080
$0.240
95.7
$15.000
$75.000
95.7
$15.000
$75.000
95.7
$0.060
$0.140
95.5
$0.150
$0.400
95.5
$0.150
$0.400
95.5
$0.550
$2.200
95.1
$1.250
$10.000
95.1
$0.200
$1.100
95.1
$1.250
$10.000
94.5
$1.250
$10.000
94.5
$1.000
$5.000
93.9
$0.260
$0.380
93.9
$2.000
$8.000
93.3
$0.080
$0.280
93.3
$0.080
$0.280
93.3
$0.450
$2.150
93.3
$0.050
$0.200
93.3
$0.050
$0.200
93.3
$0.260
$1.560
93.3
$0.220
$0.900
92.7
$3.000
$15.000
92.1
$0.071
$0.100
92.1
$0.060
$0.400
92.1
$0.060
$0.400
92.1
$3.000
$15.000
91.5
$3.000
$15.000
90.9
$3.000
$15.000
90.9
$0.400
$0.800
90.2
$0.400
$0.800
90.2
$0.500
$1.500
90.2
$2.500
$10.000
88.4
$0.400
$2.000
88.4
$0.100
$0.400
87.8
$0.014
$0.028
87.2
$0.280
$0.900
86.6
$0.030
$0.110
85.4
$0.400
$2.000
85.4
$0.150
$0.600
84.8
$0.060
$0.180
83.5
$2.500
$10.000
82.9
$0.120
$0.390
82.3
$2.000
$6.000
82.3
$0.080
$0.300
81.1
$0.060
$0.240
78.0
$0.070
$0.280
77.4
$0.800
$3.200
76.8
$0.800
$4.000
75.6
$0.020
$0.040
73.2
$0.900
$0.900
67.1
$1.000
$1.000
51.2
$2.500
$10.000
48.2
$0.030
$0.050
47.6

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

108 out of our 483 tracked models have had a price change in March.

Get our weekly newsletter on pricing changes, new releases, and tools.

About HumanEval

OpenAI HumanEval benchmark measuring Python code generation from function docstrings.

This leaderboard shows all models with HumanEval benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.

Frequently Asked Questions

OpenAI HumanEval benchmark measuring Python code generation from function docstrings.
As of March 15, 2026, Claude Sonnet 4.5 leads the HumanEval leaderboard with a score of 97.6. Rankings change as new models are released and evaluated.
Currently 75 models have been evaluated on HumanEval, with an average score of 89.6 and standard deviation of 10.5.
Benchmark scores are updated when new evaluations are published by our data sources (Artificial Analysis and LayerLens). Pricing data is refreshed daily from OpenRouter.