Accounting Audit Leaderboard

Accounting and audit benchmark testing financial reasoning capabilities.

As of June 2, 2026, the top-scoring model on Accounting Audit is Claude 3.7 Sonnet at 86.7%, followed by Gemini 2.5 Pro Preview 05-06 at 86.7% and Claude 3.7 Sonnet at 83.3%. 66 models have been evaluated on this benchmark.

Last updated: June 2, 2026

Models

Best Score

86.7

Average

71.6

Std Dev

18.9

Categories

Financial Reasoning

SourceLayerLens

Provider	Model	Input $/M	Output $/M	Accounting Audit	Actions
A Anthropic	Claude 3.7 Sonnet Thinking	$3.000	$15.000	86.7	Try
G Google	Gemini 2.5 Pro Preview 05-06	$1.250	$10.000	86.7	Try
A Anthropic	Claude 3.7 Sonnet	$3.000	$15.000	83.3	Try
AL Alibaba	QwQ 32B	$0.900	$0.900	83.3	Try
AL Alibaba	QwQ 32B	$0.900	$0.900	83.3	Try
O OpenAI	GPT-4.1	$2.000	$8.000	83.3	Try
G Google	Gemini 2.5 Flash Thinking	$0.300	$2.500	83.3	Try
G Google	Gemini 2.5 Flash Thinking	$0.300	$2.500	83.3	Try
G Google	Gemini 2.5 Flash	$0.300	$2.500	83.3	Try
G Google	Gemini 2.5 Flash	$0.300	$2.500	83.3	Try
X xAI	Grok 4	$3.000	$15.000	83.3	Try
O OpenAI	GPT-5	$1.250	$10.000	83.3	Try
O OpenAI	GPT-5	$1.250	$10.000	83.3	Try
O OpenAI	GPT-5	$1.250	$10.000	83.3	Try
O OpenAI	GPT-5	$1.250	$10.000	83.3	Try
DS Deepseek	DeepSeek V3	$0.014	$0.028	80.0	Try
DS DeepSeek	R1	$0.550	$2.000	80.0	Try
M Meta	Llama 4 Maverick	$0.150	$0.600	80.0	Try
O OpenAI	o4 Mini High	$1.100	$4.400	80.0	Try
AL Alibaba	Qwen3 14B Thinking	$0.080	$0.200	80.0	Try
AL Alibaba	Qwen3 14B	$0.080	$0.200	80.0	Try
AL Alibaba	Qwen3 30B A3B Thinking	$0.080	$0.280	80.0	Try
AL Alibaba	Qwen3 30B A3B	$0.080	$0.280	80.0	Try
A Anthropic	Claude Sonnet 4 Thinking	$3.000	$15.000	80.0	Try
A Anthropic	Claude Sonnet 4	$3.000	$15.000	80.0	Try
A Anthropic	Claude Opus 4 Thinking	$15.000	$75.000	80.0	Try
A Anthropic	Claude Opus 4	$15.000	$75.000	80.0	Try
DS DeepSeek	R1 0528	$0.500	$2.150	80.0	Try
Z Z AI	GLM 4.7 Thinking	$0.400	$1.540	80.0	Try
Z Z AI	GLM 4.7	$0.400	$1.540	80.0	Try
NO Nousresearch	Hermes 3 405B Instruct	$1.000	$1.000	76.7	Try
A Anthropic	Claude 3.5 Haiku	$0.800	$4.000	76.7	Try
AM Amazon	Nova Pro 1.0	$0.800	$3.200	76.7	Try
O OpenAI	o3 Mini	$0.550	$2.200	76.7	Try
AL Alibaba	Qwen3 235B A22B Thinking	$0.455	$0.900	76.7	Try
AL Alibaba	Qwen3 235B A22B	$0.455	$0.900	76.7	Try
AL Alibaba	Qwen3 32B Thinking	$0.080	$0.280	76.7	Try
AL Alibaba	Qwen3 32B	$0.080	$0.280	76.7	Try
X xAI	Grok 3	$3.000	$15.000	76.7	Try
X xAI	Grok 3	$3.000	$15.000	76.7	Try
MM MiniMax	MiniMax M1	$0.400	$2.200	76.7	Try
MM MiniMax	MiniMax M1	$0.400	$2.200	76.7	Try
AL Alibaba	Qwen3 Max	$0.780	$3.900	76.7	Try
AL Alibaba	Qwen3 Max	$0.780	$3.900	76.7	Try
X Xai	Grok 3 Beta	$3.000	$15.000	73.3	Try
MI Mistral	Mistral Medium 3	$0.400	$2.000	73.3	Try
NV NVIDIA	Llama 3.3 Nemotron Super 49B V1.5 Thinking	$0.100	$0.400	73.3	Try
NV NVIDIA	Llama 3.3 Nemotron Super 49B V1.5 Thinking	$0.100	$0.400	73.3	Try
NV NVIDIA	Llama 3.3 Nemotron Super 49B V1.5	$0.100	$0.400	73.3	Try
NV NVIDIA	Llama 3.3 Nemotron Super 49B V1.5	$0.100	$0.400	73.3	Try
A Anthropic	Claude Haiku 4.5 Thinking	$1.000	$5.000	73.3	Try
A Anthropic	Claude Haiku 4.5	$1.000	$5.000	73.3	Try
NO Nous Research	Hermes 3 70B Instruct	$0.300	$0.300	70.0	Try
CO Cohere	Command A	$2.500	$10.000	70.0	Try
M Meta	Llama 4 Scout	$0.080	$0.300	70.0	Try
MS Microsoft	Phi 4	$0.065	$0.140	66.7	Try
MI Mistral	Devstral Small 1.1	$0.070	$0.280	63.3	Try
AM Amazon	Nova Micro 1.0	$0.035	$0.140	53.3	Try
M Meta	Llama 3.2 3B Instruct	$0.030	$0.050	50.0	Try
MI Mistral	Pixtral Large 2411	$2.000	$6.000	50.0	Try
G Google	Gemma 3n 4B	$0.060	$0.120	43.3	Try
M Meta	Llama 3.1 405B Instruct	$0.900	$0.900	40.0	Try
G Google	Gemma 3 27B	$0.080	$0.160	33.3	Try
IF Inflection	Inflection 3 Productivity	$2.500	$10.000	-	Try
IF Inflection	Inflection 3 Pi	$2.500	$10.000	-	Try
G Google	Gemini 2.0 Flash	$0.100	$0.400	-	Try

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community

8 Ways to Use Fewer Tokens

About Accounting Audit

Accounting and audit benchmark testing financial reasoning capabilities.

This leaderboard shows all models with Accounting Audit benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.