Price Per TokenPrice Per Token

DarkBench Leaderboard

DarkBench — benchmark testing model safety and resistance to adversarial attacks.

Data from LayerLens

As of April 18, 2026, the top-scoring model on DarkBench is Grok 4 at 51.1%, followed by Phi 4 at 49.5% and Qwen3 32B at 49.2%. 13 models have been evaluated on this benchmark.

Last updated: April 18, 2026

Models

13

Best Score

51.1

Average

45.2

Std Dev

7.1

Categories
Reasoning and Logic
Provider
Model
Input $/M
Output $/M
DarkBench
Actions
$3.000
$15.000
51.1
$0.065
$0.140
49.5
$0.080
$0.240
49.2
$0.080
$0.240
49.2
$0.500
$3.000
49.1
$0.500
$3.000
49.1
$0.550
$2.200
48.6
$0.550
$2.200
48.6
$0.080
$0.300
47.7
$1.100
$4.400
42.8
$0.800
$3.200
42.7
$3.000
$15.000
31.5
$3.000
$15.000
27.7

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community
8 Ways to Use Fewer Tokens

About DarkBench

DarkBench — benchmark testing model safety and resistance to adversarial attacks.

This leaderboard shows all models with DarkBench benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.

Frequently Asked Questions

DarkBench — benchmark testing model safety and resistance to adversarial attacks.
As of April 18, 2026, Grok 4 leads the DarkBench leaderboard with a score of 51.1. Rankings change as new models are released and evaluated.
Currently 13 models have been evaluated on DarkBench, with an average score of 45.2 and standard deviation of 7.1.
Benchmark scores are updated when new evaluations are published by our data sources (Artificial Analysis and LayerLens). Pricing data is refreshed daily from OpenRouter.