IFEval Leaderboard

Instruction Following Evaluation benchmark testing how well LLMs follow detailed formatting and content constraints.

As of May 20, 2026, the top-scoring model on IFEval is Kimi K2.5 at 92.6%, followed by Kimi K2.5 at 92.6% and Gemini 2.5 Pro at 90.8%. 26 models have been evaluated on this benchmark.

Last updated: May 20, 2026

Models

Best Score

92.6

Average

83.8

Std Dev

6.3

Categories

Instruction Following

SourceLayerLens

Provider	Model	Input $/M	Output $/M	IFEval	Actions
K Kimi	Kimi K2.5 Thinking	$0.400	$1.900	92.6	Try
K Kimi	Kimi K2.5	$0.400	$1.900	92.6	Try
G Google	Gemini 2.5 Pro	$1.000	$10.000	90.8	Try
Z Z AI	GLM 4.7 Thinking	$0.400	$1.750	90.8	Try
Z Z AI	GLM 4.7	$0.400	$1.750	90.8	Try
A Anthropic	Claude 3.7 Sonnet Thinking	$3.000	$15.000	88.5	Try
DS DeepSeek	DeepSeek V3.2 Exp Thinking	$0.270	$0.410	88.1	Try
DS DeepSeek	DeepSeek V3.2 Exp	$0.270	$0.410	88.1	Try
K Kimi	Kimi K2 0711	$0.550	$2.200	87.6	Try
K Kimi	Kimi K2 0711	$0.550	$2.200	87.6	Try
BD Baidu	ERNIE 4.5 300B A47B	$0.280	$0.900	85.4	Try
G Google	Gemini 2.5 Flash Thinking	$0.300	$2.500	84.3	Try
G Google	Gemini 2.5 Flash Thinking	$0.300	$2.500	84.3	Try
G Google	Gemini 2.5 Flash	$0.300	$2.500	84.3	Try
G Google	Gemini 2.5 Flash	$0.300	$2.500	84.3	Try
M Meta	Llama 4 Scout	$0.080	$0.300	83.9	Try
AL Alibaba	Qwen3 Coder 480B A35B (exacto)	$0.220	$0.900	82.8	Try
AM Amazon	Nova Pro 1.0	$0.800	$3.200	81.9	Try
G Google	Gemma 3 27B	$0.080	$0.160	81.9	Try
AL Alibaba	Qwen3 32B Thinking	$0.080	$0.280	81.3	Try
AL Alibaba	Qwen3 32B	$0.080	$0.280	81.3	Try
AL Alibaba	Qwen3 Coder 30B A3B Instruct	$0.070	$0.270	78.7	Try
NO Nous Research	Hermes 4 405B Thinking	$1.000	$3.000	75.4	Try
NO Nous Research	Hermes 4 405B	$1.000	$3.000	75.4	Try
NO Nous Research	Hermes 4 70B Thinking	$0.130	$0.400	68.6	Try
NO Nous Research	Hermes 4 70B	$0.130	$0.400	68.6	Try

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community

8 Ways to Use Fewer Tokens

About IFEval

Instruction Following Evaluation benchmark testing how well LLMs follow detailed formatting and content constraints.

This leaderboard shows all models with IFEval benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.