Price Per TokenPrice Per Token

AGIEval Chinese Leaderboard

AGIEval Chinese — reasoning tasks from Chinese standardized exams (Gaokao, civil service).

Data from LayerLens

As of June 4, 2026, the top-scoring model on AGIEval Chinese is DeepSeek V3.2 Exp at 90.1%, followed by Qwen3 235B A22B at 89.4% and Qwen3 235B A22B at 89.4%. 31 models have been evaluated on this benchmark.

Last updated: June 4, 2026

Models

31

Best Score

90.1

Average

76.8

Std Dev

11.4

Categories
Reasoning and LogicMultilingual
Provider
Model
Input $/M
Output $/M
AGIEval Chinese
Actions
$0.270
$0.410
90.1
$0.455
$0.900
89.4
$0.455
$0.900
89.4
$0.900
$0.900
89.0
$0.430
$1.740
88.2
$0.430
$1.740
88.2
$0.550
$2.000
87.8
$0.900
$0.900
87.3
$0.080
$0.280
86.7
$0.080
$0.280
86.7
$0.270
$0.410
85.8
$0.080
$0.280
85.5
$0.080
$0.280
85.5
$0.071
$0.100
84.4
$0.360
$0.400
78.4
$0.300
$0.500
77.4
$0.014
$0.028
75.8
$0.550
$2.200
74.8
$0.400
$2.000
74.2
$0.400
$2.000
74.2
$2.500
$10.000
73.6
$0.100
$0.400
73.3
$3.000
$15.000
71.4
$2.000
$8.000
69.3
$0.800
$3.200
64.8
$2.000
$6.000
63.1
$0.065
$0.140
60.8
$0.080
$0.160
60.3
$0.070
$0.280
59.8
$0.035
$0.140
56.2
$2.500
$10.000
49.7

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community
8 Ways to Use Fewer Tokens

About AGIEval Chinese

AGIEval Chinese — reasoning tasks from Chinese standardized exams (Gaokao, civil service).

This leaderboard shows all models with AGIEval Chinese benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.

Frequently Asked Questions

AGIEval Chinese — reasoning tasks from Chinese standardized exams (Gaokao, civil service).
As of June 4, 2026, DeepSeek V3.2 Exp leads the AGIEval Chinese leaderboard with a score of 90.1. Rankings change as new models are released and evaluated.
Currently 31 models have been evaluated on AGIEval Chinese, with an average score of 76.8 and standard deviation of 11.4.
Benchmark scores are updated when new evaluations are published by our data sources (Artificial Analysis and LayerLens). Pricing data is refreshed daily from OpenRouter.