DarkBench — benchmark testing model safety and resistance to adversarial attacks.
Data from LayerLens
As of April 18, 2026, the top-scoring model on DarkBench is Grok 4 at 51.1%, followed by Phi 4 at 49.5% and Qwen3 32B at 49.2%. 13 models have been evaluated on this benchmark.
Last updated: April 18, 2026
Models
13
Best Score
51.1
Average
45.2
Std Dev
7.1
Provider | Model | Input $/M | Output $/M | DarkBench | Actions |
|---|---|---|---|---|---|
$3.000 | $15.000 | 51.1 | |||
$0.065 | $0.140 | 49.5 | |||
$0.080 | $0.240 | 49.2 | |||
$0.080 | $0.240 | 49.2 | |||
$0.500 | $3.000 | 49.1 | |||
$0.500 | $3.000 | 49.1 | |||
$0.550 | $2.200 | 48.6 | |||
$0.550 | $2.200 | 48.6 | |||
$0.080 | $0.300 | 47.7 | |||
$1.100 | $4.400 | 42.8 | |||
$0.800 | $3.200 | 42.7 | |||
$3.000 | $15.000 | 31.5 | |||
$3.000 | $15.000 | 27.7 |
Pricing from OpenRouter. Benchmarks from Artificial Analysis.
Get our weekly newsletter on pricing changes, new releases, and tools.
DarkBench — benchmark testing model safety and resistance to adversarial attacks.
This leaderboard shows all models with DarkBench benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.
Built by @aellman
2026 68 Ventures, LLC. All rights reserved.