OpenAI HumanEval benchmark measuring Python code generation from function docstrings.
Data from LayerLens
As of April 18, 2026, the top-scoring model on HumanEval is Claude Sonnet 4.5 at 97.6%, followed by R1 at 97.4% and Grok 4 at 97.0%. 75 models have been evaluated on this benchmark.
Last updated: April 18, 2026
Models
75
Best Score
97.6
Average
89.7
Std Dev
10.6
Provider | Model | Input $/M | Output $/M | HumanEval | Actions |
|---|---|---|---|---|---|
$3.000 | $15.000 | 97.6 | |||
$0.550 | $2.000 | 97.4 | |||
$3.000 | $15.000 | 97.0 | |||
$3.000 | $15.000 | 97.0 | |||
$2.000 | $12.000 | 97.0 | |||
$2.000 | $12.000 | 97.0 | |||
$5.000 | $25.000 | 97.0 | |||
$5.000 | $25.000 | 97.0 | |||
$5.000 | $25.000 | 97.0 | |||
$5.000 | $25.000 | 97.0 | |||
$0.720 | $2.300 | 97.0 | |||
$0.720 | $2.300 | 97.0 | |||
$1.100 | $4.400 | 96.3 | |||
$3.000 | $15.000 | 96.3 | |||
$3.000 | $15.000 | 96.3 | |||
$15.000 | $75.000 | 96.3 | |||
$15.000 | $75.000 | 96.3 | |||
$1.000 | $5.000 | 96.3 | |||
$0.390 | $0.900 | 96.3 | |||
$0.390 | $0.900 | 96.3 | |||
$3.000 | $15.000 | 96.2 | |||
$0.300 | $0.500 | 95.7 | |||
$0.080 | $0.240 | 95.7 | |||
$0.080 | $0.240 | 95.7 | |||
$15.000 | $75.000 | 95.7 | |||
$15.000 | $75.000 | 95.7 | |||
$0.065 | $0.140 | 95.5 | |||
$0.150 | $0.580 | 95.5 | |||
$0.150 | $0.580 | 95.5 | |||
$0.550 | $2.200 | 95.1 | |||
$1.250 | $10.000 | 95.1 | |||
$0.200 | $1.100 | 95.1 | |||
$1.250 | $10.000 | 94.5 | |||
$1.250 | $10.000 | 94.5 | |||
$1.000 | $5.000 | 93.9 | |||
$0.260 | $0.380 | 93.9 | |||
$2.000 | $8.000 | 93.3 | |||
$0.080 | $0.280 | 93.3 | |||
$0.080 | $0.280 | 93.3 | |||
$0.500 | $2.150 | 93.3 | |||
$0.050 | $0.200 | 93.3 | |||
$0.050 | $0.200 | 93.3 | |||
$0.260 | $1.560 | 93.3 | |||
$0.220 | $0.900 | 92.7 | |||
$3.000 | $15.000 | 92.1 | |||
$0.071 | $0.100 | 92.1 | |||
$0.060 | $0.400 | 92.1 | |||
$0.060 | $0.400 | 92.1 | |||
$3.000 | $15.000 | 91.5 | |||
$3.000 | $15.000 | 90.9 | |||
$3.000 | $15.000 | 90.9 | |||
$0.455 | $0.900 | 90.2 | |||
$0.455 | $0.900 | 90.2 | |||
$0.500 | $1.500 | 90.2 | |||
$0.400 | $2.000 | 88.4 | |||
$0.100 | $0.400 | 87.8 | |||
$0.014 | $0.028 | 87.2 | |||
$0.280 | $0.900 | 86.6 | |||
$0.080 | $0.160 | 85.4 | |||
$0.400 | $2.000 | 85.4 | |||
$0.150 | $0.600 | 84.8 | |||
$0.075 | $0.200 | 83.5 | |||
$2.500 | $10.000 | 82.9 | |||
$0.120 | $0.390 | 82.3 | |||
$2.000 | $6.000 | 82.3 | |||
$0.080 | $0.300 | 81.1 | |||
$0.060 | $0.240 | 78.0 | |||
$0.070 | $0.280 | 77.4 | |||
$0.800 | $3.200 | 76.8 | |||
$0.800 | $4.000 | 75.6 | |||
$0.060 | $0.120 | 73.2 | |||
$0.900 | $0.900 | 67.1 | |||
$1.000 | $1.000 | 51.2 | |||
$2.500 | $10.000 | 48.2 | |||
$0.030 | $0.050 | 47.6 |
Pricing from OpenRouter. Benchmarks from Artificial Analysis.
Get our weekly newsletter on pricing changes, new releases, and tools.
OpenAI HumanEval benchmark measuring Python code generation from function docstrings.
This leaderboard shows all models with HumanEval benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.
Built by @aellman
2026 68 Ventures, LLC. All rights reserved.