ARC Challenge Leaderboard

AI2 Reasoning Challenge (Challenge set) — grade-school science questions requiring complex reasoning.

As of June 2, 2026, the top-scoring model on ARC Challenge is GPT-5 at 96.3%, followed by GPT-5 at 96.3% and GPT-5 at 96.3%. 43 models have been evaluated on this benchmark.

Last updated: June 2, 2026

Models

Best Score

96.3

Average

92.6

Std Dev

6.3

Categories

Reasoning and Logic

SourceLayerLens

Provider	Model	Input $/M	Output $/M	ARC Challenge	Actions
O OpenAI	GPT-5	$1.250	$10.000	96.3	Try
O OpenAI	GPT-5	$1.250	$10.000	96.3	Try
O OpenAI	GPT-5	$1.250	$10.000	96.3	Try
O OpenAI	GPT-5	$1.250	$10.000	96.3	Try
Z Z AI	GLM 5 Thinking	$0.600	$2.080	96.0	Try
Z Z AI	GLM 5	$0.600	$2.080	96.0	Try
MM MiniMax	MiniMax M1	$0.400	$2.200	95.3	Try
MM MiniMax	MiniMax M1	$0.400	$2.200	95.3	Try
X Xai	Grok 3 Mini Beta	$0.300	$0.500	95.2	Try
O OpenAI	GPT-4.1	$2.000	$8.000	95.1	Try
O OpenAI	o4 Mini High	$1.100	$4.400	95.1	Try
DS DeepSeek	R1 0528	$0.500	$2.150	95.1	Try
M Meta	Llama 4 Maverick	$0.150	$0.600	95.0	Try
AL Alibaba	Qwen3 235B A22B Instruct 2507	$0.071	$0.100	94.8	Try
A Anthropic	Claude 3.7 Sonnet Thinking	$3.000	$15.000	94.7	Try
A Anthropic	Claude 3.7 Sonnet	$3.000	$15.000	94.7	Try
AL Alibaba	Qwen3 32B Thinking	$0.080	$0.280	94.7	Try
AL Alibaba	Qwen3 32B	$0.080	$0.280	94.7	Try
AL Alibaba	QwQ 32B	$0.900	$0.900	94.7	Try
AL Alibaba	QwQ 32B	$0.900	$0.900	94.7	Try
MS Microsoft	Phi 4	$0.065	$0.140	94.6	Try
G Google	Gemini 2.0 Flash	$0.100	$0.400	94.2	Try
BD Baidu	ERNIE 4.5 300B A47B	$0.900	$0.900	94.0	Try
DS Deepseek	DeepSeek V3	$0.014	$0.028	94.0	Try
M Meta	Llama 4 Scout	$0.080	$0.300	93.9	Try
MI Mistral	Mistral Medium 3.1	$0.400	$2.000	93.9	Try
X xAI	Grok 3	$3.000	$15.000	93.9	Try
X xAI	Grok 3	$3.000	$15.000	93.9	Try
CO Cohere	Command A	$2.500	$10.000	93.7	Try
X Xai	Grok 3 Beta	$3.000	$15.000	93.7	Try
MI Mistral	Mistral Medium 3	$0.400	$2.000	93.4	Try
AM Amazon	Nova Pro 1.0	$0.800	$3.200	92.7	Try
MI Mistral	Pixtral Large 2411	$2.000	$6.000	92.2	Try
MI Mistral	Mistral Small 3.2 24B	$0.075	$0.200	91.8	Try
DS DeepSeek	R1	$0.550	$2.000	91.5	Try
G Google	Gemma 3 27B	$0.080	$0.160	91.0	Try
MI Mistral	Devstral Small 1.1	$0.070	$0.280	90.4	Try
A Anthropic	Claude 3.5 Haiku	$0.800	$4.000	90.4	Try
IF Inflection	Inflection 3 Productivity	$2.500	$10.000	89.2	Try
IF Inflection	Inflection 3 Pi	$2.500	$10.000	89.2	Try
AM Amazon	Nova Micro 1.0	$0.035	$0.140	86.9	Try
G Google	Gemma 3n 4B	$0.060	$0.120	81.7	Try
M Meta	Llama 3.2 3B Instruct	$0.030	$0.050	56.3	Try

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community

8 Ways to Use Fewer Tokens

About ARC Challenge

AI2 Reasoning Challenge (Challenge set) — grade-school science questions requiring complex reasoning.

This leaderboard shows all models with ARC Challenge benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.