MedQA Leaderboard

Medical question answering benchmark from USMLE-style questions.

As of June 2, 2026, the top-scoring model on MedQA is o4 Mini High at 95.2%, followed by Gemini 2.5 Pro at 94.6% and Claude 3.7 Sonnet at 92.3%. 34 models have been evaluated on this benchmark.

Last updated: June 2, 2026

Models

Best Score

95.2

Average

79.4

Std Dev

11.0

Categories

Reasoning and Logic

SourceLayerLens

Provider	Model	Input $/M	Output $/M	MedQA	Actions
O OpenAI	o4 Mini High	$1.100	$4.400	95.2	Try
G Google	Gemini 2.5 Pro	$1.000	$10.000	94.6	Try
A Anthropic	Claude 3.7 Sonnet Thinking	$3.000	$15.000	92.3	Try
DS DeepSeek	R1	$0.550	$2.000	92.1	Try
O OpenAI	o3 Mini	$0.550	$2.200	91.4	Try
O OpenAI	GPT-4.1	$2.000	$8.000	89.7	Try
A Anthropic	Claude 3.7 Sonnet	$3.000	$15.000	87.6	Try
AL Alibaba	QwQ 32B	$0.900	$0.900	86.5	Try
AL Alibaba	QwQ 32B	$0.900	$0.900	86.5	Try
X Xai	Grok 3 Beta	$3.000	$15.000	86.1	Try
AL Alibaba	Qwen3 Max	$0.780	$3.900	85.5	Try
AL Alibaba	Qwen3 Max	$0.780	$3.900	85.5	Try
AL Alibaba	Qwen3 30B A3B Thinking	$0.080	$0.280	85.3	Try
AL Alibaba	Qwen3 30B A3B	$0.080	$0.280	85.3	Try
AL Alibaba	Qwen3 235B A22B Instruct 2507	$0.071	$0.100	84.8	Try
G Google	Gemini 2.0 Flash	$0.100	$0.400	83.2	Try
M Meta	Llama 3.1 405B Instruct	$0.900	$0.900	82.9	Try
DS Deepseek	DeepSeek V3	$0.014	$0.028	80.3	Try
AM Amazon	Nova Pro 1.0	$0.800	$3.200	79.3	Try
MI Mistral	Mistral Medium 3	$0.400	$2.000	79.1	Try
M Meta	Llama 4 Maverick	$0.150	$0.600	78.4	Try
MI Mistral	Pixtral Large 2411	$2.000	$6.000	78.3	Try
MS Microsoft	Phi 4	$0.065	$0.140	77.8	Try
A Anthropic	Claude 3.5 Haiku	$0.800	$4.000	77.8	Try
CO Cohere	Command A	$2.500	$10.000	73.3	Try
NO Nous Research	Hermes 4 405B Thinking	$1.000	$3.000	72.8	Try
NO Nous Research	Hermes 4 405B	$1.000	$3.000	72.8	Try
MI Mistral	Mistral Small 3.2 24B	$0.075	$0.200	70.5	Try
NO Nous Research	Hermes 4 70B Thinking	$0.130	$0.400	70.1	Try
NO Nous Research	Hermes 4 70B	$0.130	$0.400	70.1	Try
G Google	Gemma 3 27B	$0.080	$0.160	67.5	Try
G Google	Gemma 3n 4B	$0.060	$0.120	54.2	Try
M Meta	Llama 3.2 3B Instruct	$0.030	$0.050	52.6	Try
M Meta	Llama 4 Scout	$0.080	$0.300	52.0	Try

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community

8 Ways to Use Fewer Tokens

About MedQA

Medical question answering benchmark from USMLE-style questions.

This leaderboard shows all models with MedQA benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.