GAIA Leaderboard

GAIA — General AI Assistants benchmark testing multi-step real-world tasks.

As of June 2, 2026, the top-scoring model on GAIA is GPT-5 Mini at 44.8%, followed by GPT-5 Mini at 44.8% and Claude 3.7 Sonnet at 43.9%. 12 models have been evaluated on this benchmark.

Last updated: June 2, 2026

Models

Best Score

44.8

Average

27.5

Std Dev

13.6

Categories

Multi-turn

SourceLayerLens

Provider	Model	Input $/M	Output $/M	GAIA	Actions
O OpenAI	GPT-5 Mini	$0.250	$2.000	44.8	Try
O OpenAI	GPT-5 Mini	$0.250	$2.000	44.8	Try
A Anthropic	Claude 3.7 Sonnet Thinking	$3.000	$15.000	43.9	Try
A Anthropic	Claude 3.7 Sonnet	$3.000	$15.000	43.9	Try
G Google	Gemini 2.5 Pro	$1.000	$10.000	33.3	Try
DS DeepSeek	R1 0528	$0.500	$2.150	27.9	Try
MI Mistral	Mistral Medium 3.1	$0.400	$2.000	23.3	Try
AL Alibaba	Tongyi DeepResearch 30B A3B	$0.090	$0.400	20.6	Try
AL Alibaba	Qwen3 32B Thinking	$0.080	$0.280	12.3	Try
AL Alibaba	Qwen3 32B	$0.080	$0.280	12.3	Try
DS DeepSeek	DeepSeek V3.1	$0.210	$0.790	11.5	Try
DS DeepSeek	DeepSeek V3.1 Thinking	$0.210	$0.790	11.5	Try

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community

8 Ways to Use Fewer Tokens

About GAIA

GAIA — General AI Assistants benchmark testing multi-step real-world tasks.

This leaderboard shows all models with GAIA benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.