Price Per TokenPrice Per Token

AGIEval Chinese Leaderboard

AGIEval Chinese — reasoning tasks from Chinese standardized exams (Gaokao, civil service).

Data from LayerLens

As of April 18, 2026, the top-scoring model on AGIEval Chinese is DeepSeek V3.2 Exp at 90.1%, followed by Qwen3 235B A22B at 89.4% and Qwen3 235B A22B at 89.4%. 32 models have been evaluated on this benchmark.

Last updated: April 18, 2026

Models

32

Best Score

90.1

Average

77.1

Std Dev

11.4

Categories
Reasoning and LogicMultilingual
Provider
Model
Input $/M
Output $/M
AGIEval Chinese
Actions
$0.270
$0.410
90.1
$0.455
$0.900
89.4
$0.455
$0.900
89.4
$0.280
$0.900
89.0
$0.390
$1.740
88.2
$0.390
$1.740
88.2
$0.550
$2.000
87.8
$0.150
$0.580
87.3
$0.150
$0.580
87.3
$0.080
$0.240
86.7
$0.080
$0.240
86.7
$0.270
$0.410
85.8
$0.080
$0.280
85.5
$0.080
$0.280
85.5
$0.071
$0.100
84.4
$0.120
$0.390
78.4
$0.300
$0.500
77.4
$0.014
$0.028
75.8
$0.550
$2.200
74.8
$0.400
$2.000
74.2
$0.400
$2.000
74.2
$2.500
$10.000
73.6
$0.100
$0.400
73.3
$3.000
$15.000
71.4
$2.000
$8.000
69.3
$0.800
$3.200
64.8
$2.000
$6.000
63.1
$0.065
$0.140
60.8
$0.080
$0.160
60.3
$0.070
$0.280
59.8
$0.035
$0.140
56.2
$2.500
$10.000
49.7

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community
8 Ways to Use Fewer Tokens

About AGIEval Chinese

AGIEval Chinese — reasoning tasks from Chinese standardized exams (Gaokao, civil service).

This leaderboard shows all models with AGIEval Chinese benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.

Frequently Asked Questions

AGIEval Chinese — reasoning tasks from Chinese standardized exams (Gaokao, civil service).
As of April 18, 2026, DeepSeek V3.2 Exp leads the AGIEval Chinese leaderboard with a score of 90.1. Rankings change as new models are released and evaluated.
Currently 32 models have been evaluated on AGIEval Chinese, with an average score of 77.1 and standard deviation of 11.4.
Benchmark scores are updated when new evaluations are published by our data sources (Artificial Analysis and LayerLens). Pricing data is refreshed daily from OpenRouter.