Workshop on Machine Translation 2014 — multilingual translation quality benchmark.
Data from LayerLens
As of March 15, 2026, the top-scoring model on WMT 2014 is Gemini 2.0 Flash at 38.9%, followed by Llama 3.1 405B Instruct at 38.0% and Llama 4 Maverick at 38.0%. 13 models have been evaluated on this benchmark.
Last updated: March 15, 2026
Models
13
Best Score
38.9
Average
35.6
Std Dev
3.7
Provider | Model | Input $/M | Output $/M | WMT 2014 | Actions |
|---|---|---|---|---|---|
$0.100 | $0.400 | 38.9 | |||
$0.900 | $0.900 | 38.0 | |||
$0.150 | $0.600 | 38.0 | |||
$2.000 | $8.000 | 37.6 | |||
$3.000 | $15.000 | 37.4 | |||
$2.500 | $10.000 | 37.3 | |||
$0.080 | $0.300 | 37.1 | |||
$0.800 | $3.200 | 36.9 | |||
$0.014 | $0.028 | 36.6 | |||
$0.800 | $4.000 | 35.6 | |||
$0.070 | $0.280 | 34.1 | |||
$2.500 | $10.000 | 30.0 | |||
$0.030 | $0.050 | 25.2 |
Pricing from OpenRouter. Benchmarks from Artificial Analysis.
108 out of our 483 tracked models have had a price change in March.
Get our weekly newsletter on pricing changes, new releases, and tools.
Workshop on Machine Translation 2014 — multilingual translation quality benchmark.
This leaderboard shows all models with WMT 2014 benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.
Built by @aellman
2026 68 Ventures, LLC. All rights reserved.