WMT 2014 Leaderboard

Workshop on Machine Translation 2014 — multilingual translation quality benchmark.

As of June 2, 2026, the top-scoring model on WMT 2014 is Gemini 2.0 Flash at 38.9%, followed by Llama 3.1 405B Instruct at 38.0% and Llama 4 Maverick at 38.0%. 12 models have been evaluated on this benchmark.

Last updated: June 2, 2026

Models

Best Score

38.9

Average

35.5

Std Dev

3.8

Categories

Multilingual

SourceLayerLens

Provider	Model	Input $/M	Output $/M	WMT 2014	Actions
G Google	Gemini 2.0 Flash	$0.100	$0.400	38.9	Try
M Meta	Llama 3.1 405B Instruct	$0.900	$0.900	38.0	Try
M Meta	Llama 4 Maverick	$0.150	$0.600	38.0	Try
O OpenAI	GPT-4.1	$2.000	$8.000	37.6	Try
A Anthropic	Claude 3.7 Sonnet Thinking	$3.000	$15.000	37.4	Try
M Meta	Llama 4 Scout	$0.080	$0.300	37.1	Try
AM Amazon	Nova Pro 1.0	$0.800	$3.200	36.9	Try
DS Deepseek	DeepSeek V3	$0.014	$0.028	36.6	Try
A Anthropic	Claude 3.5 Haiku	$0.800	$4.000	35.6	Try
MI Mistral	Devstral Small 1.1	$0.070	$0.280	34.1	Try
IF Inflection	Inflection 3 Pi	$2.500	$10.000	30.0	Try
M Meta	Llama 3.2 3B Instruct	$0.030	$0.050	25.2	Try

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community

8 Ways to Use Fewer Tokens

About WMT 2014

Workshop on Machine Translation 2014 — multilingual translation quality benchmark.

This leaderboard shows all models with WMT 2014 benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.