Price Per TokenPrice Per Token

ARC Challenge Leaderboard

AI2 Reasoning Challenge (Challenge set) — grade-school science questions requiring complex reasoning.

Data from LayerLens

As of April 18, 2026, the top-scoring model on ARC Challenge is GPT-5 at 96.3%, followed by GPT-5 at 96.3% and GPT-5 at 96.3%. 43 models have been evaluated on this benchmark.

Last updated: April 18, 2026

Models

43

Best Score

96.3

Average

92.6

Std Dev

6.3

Categories
Reasoning and Logic
Provider
Model
Input $/M
Output $/M
ARC Challenge
Actions
$1.250
$10.000
96.3
$1.250
$10.000
96.3
$1.250
$10.000
96.3
$1.250
$10.000
96.3
$0.720
$2.300
96.0
$0.720
$2.300
96.0
$0.400
$1.760
95.3
$0.400
$1.760
95.3
$0.300
$0.500
95.2
$2.000
$8.000
95.1
$1.100
$4.400
95.1
$0.500
$2.150
95.1
$0.150
$0.600
95.0
$0.071
$0.100
94.8
$3.000
$15.000
94.7
$3.000
$15.000
94.7
$0.080
$0.240
94.7
$0.080
$0.240
94.7
$0.150
$0.580
94.7
$0.150
$0.580
94.7
$0.065
$0.140
94.6
$0.100
$0.400
94.2
$0.280
$0.900
94.0
$0.014
$0.028
94.0
$0.080
$0.300
93.9
$0.400
$2.000
93.9
$3.000
$15.000
93.9
$3.000
$15.000
93.9
$2.500
$10.000
93.7
$3.000
$15.000
93.7
$0.400
$2.000
93.4
$0.800
$3.200
92.7
$2.000
$6.000
92.2
$0.075
$0.200
91.8
$0.550
$2.000
91.5
$0.080
$0.160
91.0
$0.070
$0.280
90.4
$0.800
$4.000
90.4
$2.500
$10.000
89.2
$2.500
$10.000
89.2
$0.035
$0.140
86.9
$0.060
$0.120
81.7
$0.030
$0.050
56.3

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community
8 Ways to Use Fewer Tokens

About ARC Challenge

AI2 Reasoning Challenge (Challenge set) — grade-school science questions requiring complex reasoning.

This leaderboard shows all models with ARC Challenge benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.

Frequently Asked Questions

AI2 Reasoning Challenge (Challenge set) — grade-school science questions requiring complex reasoning.
As of April 18, 2026, GPT-5 leads the ARC Challenge leaderboard with a score of 96.3. Rankings change as new models are released and evaluated.
Currently 43 models have been evaluated on ARC Challenge, with an average score of 92.6 and standard deviation of 6.3.
Benchmark scores are updated when new evaluations are published by our data sources (Artificial Analysis and LayerLens). Pricing data is refreshed daily from OpenRouter.