Mostly Basic Python Problems Plus — tests Python code generation with enhanced test cases.
Data from LayerLens
As of April 18, 2026, the top-scoring model on MBPP Plus is Qwen3 235B A22B at 66.2%, followed by Qwen3 235B A22B at 66.2% and R1 at 64.7%. 65 models have been evaluated on this benchmark.
Last updated: April 18, 2026
Models
65
Best Score
66.2
Average
57.2
Std Dev
9.5
Provider | Model | Input $/M | Output $/M | MBPP Plus | Actions |
|---|---|---|---|---|---|
$0.455 | $0.900 | 66.2 | |||
$0.455 | $0.900 | 66.2 | |||
$0.550 | $2.000 | 64.7 | |||
$1.100 | $4.400 | 64.3 | |||
$0.080 | $0.240 | 63.4 | |||
$0.080 | $0.240 | 63.4 | |||
$15.000 | $75.000 | 63.4 | |||
$15.000 | $75.000 | 63.4 | |||
$0.550 | $2.200 | 63.2 | |||
$0.220 | $0.900 | 63.2 | |||
$0.125 | $1.000 | 63.2 | |||
$0.125 | $1.000 | 63.2 | |||
$0.090 | $0.780 | 63.2 | |||
$3.000 | $15.000 | 63.2 | |||
$2.000 | $8.000 | 63.0 | |||
$1.250 | $10.000 | 63.0 | |||
$1.000 | $10.000 | 63.0 | |||
$0.050 | $0.400 | 63.0 | |||
$0.050 | $0.400 | 63.0 | |||
$0.050 | $0.400 | 63.0 | |||
$0.071 | $0.100 | 62.7 | |||
$0.080 | $0.280 | 62.6 | |||
$0.080 | $0.280 | 62.6 | |||
$3.000 | $15.000 | 62.2 | |||
$3.000 | $15.000 | 62.2 | |||
$3.000 | $15.000 | 62.1 | |||
$0.150 | $0.580 | 62.1 | |||
$0.150 | $0.580 | 62.1 | |||
$0.065 | $0.140 | 61.9 | |||
$0.400 | $1.760 | 61.9 | |||
$0.400 | $1.760 | 61.9 | |||
$2.000 | $8.000 | 61.6 | |||
$0.100 | $0.400 | 61.1 | |||
$0.300 | $0.500 | 61.1 | |||
$0.070 | $0.270 | 61.1 | |||
$3.000 | $15.000 | 60.6 | |||
$0.150 | $0.600 | 60.1 | |||
$0.080 | $0.160 | 59.8 | |||
$0.500 | $2.150 | 59.5 | |||
$3.000 | $15.000 | 58.7 | |||
$3.000 | $15.000 | 58.7 | |||
$3.000 | $15.000 | 58.5 | |||
$0.300 | $2.500 | 57.7 | |||
$0.300 | $2.500 | 57.7 | |||
$0.300 | $2.500 | 57.7 | |||
$0.300 | $2.500 | 57.7 | |||
$0.075 | $0.200 | 56.6 | |||
$0.060 | $0.120 | 55.8 | |||
$0.400 | $2.000 | 54.5 | |||
$2.000 | $6.000 | 54.0 | |||
$0.080 | $0.300 | 54.0 | |||
$1.000 | $5.000 | 53.4 | |||
$1.000 | $5.000 | 53.4 | |||
$2.500 | $10.000 | 53.2 | |||
$0.014 | $0.028 | 51.1 | |||
$0.060 | $0.240 | 50.8 | |||
$0.280 | $0.900 | 49.7 | |||
$0.800 | $3.200 | 49.2 | |||
$0.035 | $0.140 | 47.9 | |||
$0.900 | $0.900 | 36.8 | |||
$0.800 | $4.000 | 33.9 | |||
$2.500 | $10.000 | 32.3 | |||
$2.500 | $10.000 | 30.5 | |||
$0.030 | $0.050 | 26.5 | |||
$0.300 | $0.300 | 24.1 |
Pricing from OpenRouter. Benchmarks from Artificial Analysis.
Get our weekly newsletter on pricing changes, new releases, and tools.
Mostly Basic Python Problems Plus — tests Python code generation with enhanced test cases.
This leaderboard shows all models with MBPP Plus benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.
Built by @aellman
2026 68 Ventures, LLC. All rights reserved.