DarkBench Leaderboard

DarkBench — benchmark testing model safety and resistance to adversarial attacks.

As of June 2, 2026, the top-scoring model on DarkBench is Grok 4 at 51.1%, followed by Phi 4 at 49.5% and Qwen3 32B at 49.2%. 13 models have been evaluated on this benchmark.

Last updated: June 2, 2026

Models

Best Score

51.1

Average

45.2

Std Dev

7.1

Categories

Reasoning and Logic

SourceLayerLens

Provider	Model	Input $/M	Output $/M	DarkBench	Actions
X xAI	Grok 4	$3.000	$15.000	51.1	Try
MS Microsoft	Phi 4	$0.065	$0.140	49.5	Try
AL Alibaba	Qwen3 32B Thinking	$0.080	$0.280	49.2	Try
AL Alibaba	Qwen3 32B	$0.080	$0.280	49.2	Try
G Google	Gemini 3 Flash Preview Thinking	$0.500	$3.000	49.1	Try
G Google	Gemini 3 Flash Preview	$0.500	$3.000	49.1	Try
K Kimi	Kimi K2 0711	$0.550	$2.200	48.6	Try
K Kimi	Kimi K2 0711	$0.550	$2.200	48.6	Try
M Meta	Llama 4 Scout	$0.080	$0.300	47.7	Try
O OpenAI	o4 Mini High	$1.100	$4.400	42.8	Try
AM Amazon	Nova Pro 1.0	$0.800	$3.200	42.7	Try
A Anthropic	Claude 3.7 Sonnet	$3.000	$15.000	31.5	Try
A Anthropic	Claude 3.7 Sonnet Thinking	$3.000	$15.000	27.7	Try

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community

8 Ways to Use Fewer Tokens

About DarkBench

DarkBench — benchmark testing model safety and resistance to adversarial attacks.

This leaderboard shows all models with DarkBench benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.