MBPP Plus Leaderboard

Mostly Basic Python Problems Plus — tests Python code generation with enhanced test cases.

As of June 2, 2026, the top-scoring model on MBPP Plus is Qwen3 235B A22B at 66.2%, followed by Qwen3 235B A22B at 66.2% and R1 at 64.7%. 65 models have been evaluated on this benchmark.

Last updated: June 2, 2026

Models

Best Score

66.2

Average

57.2

Std Dev

9.5

Categories

Computer Science and Programming

SourceLayerLens

Provider	Model	Input $/M	Output $/M	MBPP Plus	Actions
AL Alibaba	Qwen3 235B A22B Thinking	$0.455	$0.900	66.2	Try
AL Alibaba	Qwen3 235B A22B	$0.455	$0.900	66.2	Try
DS DeepSeek	R1	$0.550	$2.000	64.7	Try
O OpenAI	o4 Mini High	$1.100	$4.400	64.3	Try
AL Alibaba	Qwen3 32B Thinking	$0.080	$0.280	63.4	Try
AL Alibaba	Qwen3 32B	$0.080	$0.280	63.4	Try
A Anthropic	Claude Opus 4 Thinking	$15.000	$75.000	63.4	Try
A Anthropic	Claude Opus 4	$15.000	$75.000	63.4	Try
O OpenAI	o3 Mini	$0.550	$2.200	63.2	Try
AL Alibaba	Qwen3 Coder 480B A35B (exacto)	$0.220	$0.900	63.2	Try
O OpenAI	GPT-5 Mini	$0.250	$2.000	63.2	Try
O OpenAI	GPT-5 Mini	$0.250	$2.000	63.2	Try
AL Alibaba	Qwen3 Next 80B A3B Instruct	$0.090	$0.780	63.2	Try
X xAI	Grok 4	$3.000	$15.000	63.2	Try
O OpenAI	GPT-4.1	$2.000	$8.000	63.0	Try
G Google	Gemini 2.5 Pro Preview 05-06	$1.250	$10.000	63.0	Try
G Google	Gemini 2.5 Pro	$1.000	$10.000	63.0	Try
O OpenAI	GPT-5 Nano	$0.050	$0.400	63.0	Try
O OpenAI	GPT-5 Nano	$0.050	$0.400	63.0	Try
O OpenAI	GPT-5 Nano	$0.050	$0.400	63.0	Try
AL Alibaba	Qwen3 235B A22B Instruct 2507	$0.071	$0.100	62.7	Try
AL Alibaba	Qwen3 30B A3B Thinking	$0.080	$0.280	62.6	Try
AL Alibaba	Qwen3 30B A3B	$0.080	$0.280	62.6	Try
A Anthropic	Claude Sonnet 4 Thinking	$3.000	$15.000	62.2	Try
A Anthropic	Claude Sonnet 4	$3.000	$15.000	62.2	Try
A Anthropic	Claude 3.7 Sonnet Thinking	$3.000	$15.000	62.1	Try
AL Alibaba	QwQ 32B	$0.900	$0.900	62.1	Try
AL Alibaba	QwQ 32B	$0.900	$0.900	62.1	Try
MS Microsoft	Phi 4	$0.065	$0.140	61.9	Try
MM MiniMax	MiniMax M1	$0.400	$2.200	61.9	Try
MM MiniMax	MiniMax M1	$0.400	$2.200	61.9	Try
O OpenAI	o3	$2.000	$8.000	61.6	Try
G Google	Gemini 2.0 Flash	$0.100	$0.400	61.1	Try
X Xai	Grok 3 Mini Beta	$0.300	$0.500	61.1	Try
AL Alibaba	Qwen3 Coder 30B A3B Instruct	$0.070	$0.270	61.1	Try
A Anthropic	Claude 3.7 Sonnet	$3.000	$15.000	60.6	Try
M Meta	Llama 4 Maverick	$0.150	$0.600	60.1	Try
G Google	Gemma 3 27B	$0.080	$0.160	59.8	Try
DS DeepSeek	R1 0528	$0.500	$2.150	59.5	Try
X xAI	Grok 3	$3.000	$15.000	58.7	Try
X xAI	Grok 3	$3.000	$15.000	58.7	Try
X Xai	Grok 3 Beta	$3.000	$15.000	58.5	Try
G Google	Gemini 2.5 Flash Thinking	$0.300	$2.500	57.7	Try
G Google	Gemini 2.5 Flash Thinking	$0.300	$2.500	57.7	Try
G Google	Gemini 2.5 Flash	$0.300	$2.500	57.7	Try
G Google	Gemini 2.5 Flash	$0.300	$2.500	57.7	Try
MI Mistral	Mistral Small 3.2 24B	$0.075	$0.200	56.6	Try
G Google	Gemma 3n 4B	$0.060	$0.120	55.8	Try
MI Mistral	Mistral Medium 3	$0.400	$2.000	54.5	Try
MI Mistral	Pixtral Large 2411	$2.000	$6.000	54.0	Try
M Meta	Llama 4 Scout	$0.080	$0.300	54.0	Try
A Anthropic	Claude Haiku 4.5 Thinking	$1.000	$5.000	53.4	Try
A Anthropic	Claude Haiku 4.5	$1.000	$5.000	53.4	Try
CO Cohere	Command A	$2.500	$10.000	53.2	Try
DS Deepseek	DeepSeek V3	$0.014	$0.028	51.1	Try
AM Amazon	Nova Lite 1.0	$0.060	$0.240	50.8	Try
BD Baidu	ERNIE 4.5 300B A47B	$0.900	$0.900	49.7	Try
AM Amazon	Nova Pro 1.0	$0.800	$3.200	49.2	Try
AM Amazon	Nova Micro 1.0	$0.035	$0.140	47.9	Try
M Meta	Llama 3.1 405B Instruct	$0.900	$0.900	36.8	Try
A Anthropic	Claude 3.5 Haiku	$0.800	$4.000	33.9	Try
IF Inflection	Inflection 3 Productivity	$2.500	$10.000	32.3	Try
IF Inflection	Inflection 3 Pi	$2.500	$10.000	30.5	Try
M Meta	Llama 3.2 3B Instruct	$0.030	$0.050	26.5	Try
NO Nous Research	Hermes 3 70B Instruct	$0.300	$0.300	24.1	Try

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community

8 Ways to Use Fewer Tokens

About MBPP Plus

Mostly Basic Python Problems Plus — tests Python code generation with enhanced test cases.

This leaderboard shows all models with MBPP Plus benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.