WMDP Leaderboard

Weapons of Mass Destruction Proxy — benchmark testing knowledge safety boundaries.

As of June 2, 2026, the top-scoring model on WMDP is Gemini 3 Flash Preview at 86.8%, followed by Gemini 3 Flash Preview at 86.8% and o3 Mini at 80.5%. 17 models have been evaluated on this benchmark.

Last updated: June 2, 2026

Models

Best Score

86.8

Average

68.2

Std Dev

17.3

Categories

Reasoning and Logic

SourceLayerLens

Provider	Model	Input $/M	Output $/M	WMDP	Actions
G Google	Gemini 3 Flash Preview Thinking	$0.500	$3.000	86.8	Try
G Google	Gemini 3 Flash Preview	$0.500	$3.000	86.8	Try
O OpenAI	o3 Mini	$0.550	$2.200	80.5	Try
A Anthropic	Claude 3.7 Sonnet Thinking	$3.000	$15.000	78.2	Try
A Anthropic	Claude 3.7 Sonnet	$3.000	$15.000	78.2	Try
X Xai	Grok 3 Beta	$3.000	$15.000	74.3	Try
G Google	Gemini 2.0 Flash	$0.100	$0.400	72.0	Try
DS Deepseek	DeepSeek V3	$0.014	$0.028	71.8	Try
O OpenAI	GPT-4.1	$2.000	$8.000	71.4	Try
CO Cohere	Command A	$2.500	$10.000	69.6	Try
M Meta	Llama 3.1 405B Instruct	$0.900	$0.900	67.7	Try
AM Amazon	Nova Pro 1.0	$0.800	$3.200	67.0	Try
MI Mistral	Pixtral Large 2411	$2.000	$6.000	65.3	Try
A Anthropic	Claude 3.5 Haiku	$0.800	$4.000	64.3	Try
MI Mistral	Devstral Small 1.1	$0.070	$0.280	61.0	Try
MS Microsoft	Phi 4	$0.065	$0.140	58.0	Try
M Meta	Llama 3.2 3B Instruct	$0.030	$0.050	6.7	Try

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community

8 Ways to Use Fewer Tokens

About WMDP

Weapons of Mass Destruction Proxy — benchmark testing knowledge safety boundaries.

This leaderboard shows all models with WMDP benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.