AGIEval Chinese Leaderboard

AGIEval Chinese — reasoning tasks from Chinese standardized exams (Gaokao, civil service).

As of June 4, 2026, the top-scoring model on AGIEval Chinese is DeepSeek V3.2 Exp at 90.1%, followed by Qwen3 235B A22B at 89.4% and Qwen3 235B A22B at 89.4%. 31 models have been evaluated on this benchmark.

Last updated: June 4, 2026

Models

Best Score

90.1

Average

76.8

Std Dev

11.4

Categories

Reasoning and LogicMultilingual

SourceLayerLens

Provider	Model	Input $/M	Output $/M	AGIEval Chinese	Actions
DS DeepSeek	DeepSeek V3.2 Exp Thinking	$0.270	$0.410	90.1	Try
AL Alibaba	Qwen3 235B A22B Thinking	$0.455	$0.900	89.4	Try
AL Alibaba	Qwen3 235B A22B	$0.455	$0.900	89.4	Try
BD Baidu	ERNIE 4.5 300B A47B	$0.900	$0.900	89.0	Try
Z Z AI	GLM 4.6 Thinking	$0.430	$1.740	88.2	Try
Z Z AI	GLM 4.6	$0.430	$1.740	88.2	Try
DS DeepSeek	R1	$0.550	$2.000	87.8	Try
AL Alibaba	QwQ 32B	$0.900	$0.900	87.3	Try
AL Alibaba	Qwen3 32B Thinking	$0.080	$0.280	86.7	Try
AL Alibaba	Qwen3 32B	$0.080	$0.280	86.7	Try
DS DeepSeek	DeepSeek V3.2 Exp	$0.270	$0.410	85.8	Try
AL Alibaba	Qwen3 30B A3B Thinking	$0.080	$0.280	85.5	Try
AL Alibaba	Qwen3 30B A3B	$0.080	$0.280	85.5	Try
AL Alibaba	Qwen3 235B A22B Instruct 2507	$0.071	$0.100	84.4	Try
AL Alibaba	Qwen2.5 72B Instruct	$0.360	$0.400	78.4	Try
X Xai	Grok 3 Mini Beta	$0.300	$0.500	77.4	Try
DS Deepseek	DeepSeek V3	$0.014	$0.028	75.8	Try
O OpenAI	o3 Mini	$0.550	$2.200	74.8	Try
MI Mistral	Mistral Medium 3	$0.400	$2.000	74.2	Try
MI Mistral	Mistral Medium 3.1	$0.400	$2.000	74.2	Try
CO Cohere	Command A	$2.500	$10.000	73.6	Try
G Google	Gemini 2.0 Flash	$0.100	$0.400	73.3	Try
X Xai	Grok 3 Beta	$3.000	$15.000	71.4	Try
O OpenAI	GPT-4.1	$2.000	$8.000	69.3	Try
AM Amazon	Nova Pro 1.0	$0.800	$3.200	64.8	Try
MI Mistral	Pixtral Large 2411	$2.000	$6.000	63.1	Try
MS Microsoft	Phi 4	$0.065	$0.140	60.8	Try
G Google	Gemma 3 27B	$0.080	$0.160	60.3	Try
MI Mistral	Devstral Small 1.1	$0.070	$0.280	59.8	Try
AM Amazon	Nova Micro 1.0	$0.035	$0.140	56.2	Try
IF Inflection	Inflection 3 Pi	$2.500	$10.000	49.7	Try

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community

8 Ways to Use Fewer Tokens

About AGIEval Chinese

AGIEval Chinese — reasoning tasks from Chinese standardized exams (Gaokao, civil service).

This leaderboard shows all models with AGIEval Chinese benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.