HumanEval Leaderboard

OpenAI HumanEval benchmark measuring Python code generation from function docstrings.

As of June 2, 2026, the top-scoring model on HumanEval is Claude Sonnet 4.5 at 97.6%, followed by R1 at 97.4% and Grok 4 at 97.0%. 75 models have been evaluated on this benchmark.

Last updated: June 2, 2026

Models

Best Score

97.6

Average

89.7

Std Dev

10.6

Categories

Computer Science and Programming

SourceLayerLens

Provider	Model	Input $/M	Output $/M	HumanEval	Actions
A Anthropic	Claude Sonnet 4.5 Thinking	$3.000	$15.000	97.6	Try
DS DeepSeek	R1	$0.550	$2.000	97.4	Try
X xAI	Grok 4	$3.000	$15.000	97.0	Try
A Anthropic	Claude Sonnet 4.5	$3.000	$15.000	97.0	Try
G Google	Gemini 3 Pro Preview	$2.000	$12.000	97.0	Try
G Google	Gemini 3 Pro Preview	$2.000	$12.000	97.0	Try
A Anthropic	Claude Opus 4.5 Thinking	$5.000	$25.000	97.0	Try
A Anthropic	Claude Opus 4.5	$5.000	$25.000	97.0	Try
A Anthropic	Claude Opus 4.6 Thinking	$5.000	$25.000	97.0	Try
A Anthropic	Claude Opus 4.6	$5.000	$25.000	97.0	Try
Z Z AI	GLM 5 Thinking	$0.600	$2.080	97.0	Try
Z Z AI	GLM 5	$0.600	$2.080	97.0	Try
O OpenAI	o4 Mini High	$1.100	$4.400	96.3	Try
A Anthropic	Claude Sonnet 4 Thinking	$3.000	$15.000	96.3	Try
A Anthropic	Claude Sonnet 4	$3.000	$15.000	96.3	Try
A Anthropic	Claude Opus 4 Thinking	$15.000	$75.000	96.3	Try
A Anthropic	Claude Opus 4	$15.000	$75.000	96.3	Try
A Anthropic	Claude Haiku 4.5 Thinking	$1.000	$5.000	96.3	Try
AL Alibaba	Qwen3.5 397B A17B	$0.390	$0.900	96.3	Try
AL Alibaba	Qwen3.5 397B A17B	$0.390	$0.900	96.3	Try
A Anthropic	Claude 3.7 Sonnet Thinking	$3.000	$15.000	96.2	Try
X Xai	Grok 3 Mini Beta	$0.300	$0.500	95.7	Try
AL Alibaba	Qwen3 32B Thinking	$0.080	$0.280	95.7	Try
AL Alibaba	Qwen3 32B	$0.080	$0.280	95.7	Try
A Anthropic	Claude Opus 4.1 Thinking	$15.000	$75.000	95.7	Try
A Anthropic	Claude Opus 4.1	$15.000	$75.000	95.7	Try
MS Microsoft	Phi 4	$0.065	$0.140	95.5	Try
AL Alibaba	QwQ 32B	$0.900	$0.900	95.5	Try
AL Alibaba	QwQ 32B	$0.900	$0.900	95.5	Try
O OpenAI	o3 Mini	$0.550	$2.200	95.1	Try
G Google	Gemini 2.5 Pro Preview 05-06	$1.250	$10.000	95.1	Try
P Prime Intellect	INTELLECT-3	$0.200	$1.100	95.1	Try
O OpenAI	GPT-5.1	$1.250	$10.000	94.5	Try
O OpenAI	GPT-5.1	$1.250	$10.000	94.5	Try
A Anthropic	Claude Haiku 4.5	$1.000	$5.000	93.9	Try
DS DeepSeek	DeepSeek V3.2 Thinking	$0.229	$0.343	93.9	Try
O OpenAI	GPT-4.1	$2.000	$8.000	93.3	Try
AL Alibaba	Qwen3 30B A3B Thinking	$0.080	$0.280	93.3	Try
AL Alibaba	Qwen3 30B A3B	$0.080	$0.280	93.3	Try
DS DeepSeek	R1 0528	$0.500	$2.150	93.3	Try
NV NVIDIA	Nemotron 3 Nano 30B A3B Thinking	$0.050	$0.200	93.3	Try
NV NVIDIA	Nemotron 3 Nano 30B A3B	$0.050	$0.200	93.3	Try
QW Qwen	Qwen3.5 Plus	$0.260	$1.560	93.3	Try
AL Alibaba	Qwen3 Coder 480B A35B (exacto)	$0.220	$0.900	92.7	Try
A Anthropic	Claude 3.7 Sonnet	$3.000	$15.000	92.1	Try
AL Alibaba	Qwen3 235B A22B Instruct 2507	$0.071	$0.100	92.1	Try
Z Z AI	GLM-4.7-Flash Thinking	$0.060	$0.400	92.1	Try
Z Z AI	GLM-4.7-Flash	$0.060	$0.400	92.1	Try
X Xai	Grok 3 Beta	$3.000	$15.000	91.5	Try
X xAI	Grok 3	$3.000	$15.000	90.9	Try
X xAI	Grok 3	$3.000	$15.000	90.9	Try
AL Alibaba	Qwen3 235B A22B Thinking	$0.455	$0.900	90.2	Try
AL Alibaba	Qwen3 235B A22B	$0.455	$0.900	90.2	Try
MI Mistral AI	Mistral Large 3 2512	$0.500	$1.500	90.2	Try
MI Mistral	Mistral Medium 3.1	$0.400	$2.000	88.4	Try
G Google	Gemini 2.0 Flash	$0.100	$0.400	87.8	Try
DS Deepseek	DeepSeek V3	$0.014	$0.028	87.2	Try
BD Baidu	ERNIE 4.5 300B A47B	$0.900	$0.900	86.6	Try
G Google	Gemma 3 27B	$0.080	$0.160	85.4	Try
MI Mistral	Mistral Medium 3	$0.400	$2.000	85.4	Try
M Meta	Llama 4 Maverick	$0.150	$0.600	84.8	Try
MI Mistral	Mistral Small 3.2 24B	$0.075	$0.200	83.5	Try
CO Cohere	Command A	$2.500	$10.000	82.9	Try
AL Alibaba	Qwen2.5 72B Instruct	$0.360	$0.400	82.3	Try
MI Mistral	Pixtral Large 2411	$2.000	$6.000	82.3	Try
M Meta	Llama 4 Scout	$0.080	$0.300	81.1	Try
AM Amazon	Nova Lite 1.0	$0.060	$0.240	78.0	Try
MI Mistral	Devstral Small 1.1	$0.070	$0.280	77.4	Try
AM Amazon	Nova Pro 1.0	$0.800	$3.200	76.8	Try
A Anthropic	Claude 3.5 Haiku	$0.800	$4.000	75.6	Try
G Google	Gemma 3n 4B	$0.060	$0.120	73.2	Try
M Meta	Llama 3.1 405B Instruct	$0.900	$0.900	67.1	Try
NO Nousresearch	Hermes 3 405B Instruct	$1.000	$1.000	51.2	Try
IF Inflection	Inflection 3 Pi	$2.500	$10.000	48.2	Try
M Meta	Llama 3.2 3B Instruct	$0.030	$0.050	47.6	Try

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community

8 Ways to Use Fewer Tokens

About HumanEval

OpenAI HumanEval benchmark measuring Python code generation from function docstrings.

This leaderboard shows all models with HumanEval benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.