BFCL v3 Leaderboard

Berkeley Function Calling Leaderboard v3 — testing function/tool calling accuracy.

As of June 2, 2026, the top-scoring model on BFCL v3 is GLM 4.5 at 76.7%, followed by Qwen3 32B at 75.7% and Qwen3 32B at 75.7%. 23 models have been evaluated on this benchmark.

Last updated: June 2, 2026

Models

Best Score

76.7

Average

55.9

Std Dev

18.3

Categories

Multi-turn

SourceLayerLens

Provider	Model	Input $/M	Output $/M	BFCL v3	Actions
Z Z AI	GLM 4.5 Thinking	$0.600	$2.200	76.7	Try
AL Alibaba	Qwen3 32B Thinking	$0.080	$0.280	75.7	Try
AL Alibaba	Qwen3 32B	$0.080	$0.280	75.7	Try
AL Alibaba	Qwen3 Max	$0.780	$3.900	74.9	Try
AL Alibaba	Qwen3 Max	$0.780	$3.900	74.9	Try
Z Z AI	GLM-4.7-Flash Thinking	$0.060	$0.400	74.6	Try
Z Z AI	GLM-4.7-Flash	$0.060	$0.400	74.6	Try
Z Z AI	GLM 4.5 Air	$0.125	$0.850	69.1	Try
AM Amazon	Nova Pro 1.0	$0.800	$3.200	67.9	Try
K Kimi	Kimi K2.5 Thinking	$0.400	$1.900	64.5	Try
K Kimi	Kimi K2.5	$0.400	$1.900	64.5	Try
P Prime Intellect	INTELLECT-3	$0.200	$1.100	63.5	Try
M Meta	Llama 4 Scout	$0.080	$0.300	55.7	Try
G Google	Gemini 3 Flash Preview Thinking	$0.500	$3.000	53.5	Try
MM MiniMax	MiniMax M1	$0.400	$2.200	47.8	Try
MM MiniMax	MiniMax M1	$0.400	$2.200	47.8	Try
NV NVIDIA	Nemotron 3 Nano 30B A3B Thinking	$0.050	$0.200	41.6	Try
NV NVIDIA	Nemotron 3 Nano 30B A3B	$0.050	$0.200	41.6	Try
MS Microsoft	Phi 4	$0.065	$0.140	40.8	Try
A Anthropic	Claude Opus 4 Thinking	$15.000	$75.000	25.3	Try
A Anthropic	Claude Opus 4	$15.000	$75.000	25.3	Try
K Kimi	Kimi K2 0711	$0.550	$2.200	25.3	Try
K Kimi	Kimi K2 0711	$0.550	$2.200	25.3	Try

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

Get our weekly newsletter on pricing changes, new releases, and tools.

Join the Price Per Token Community

8 Ways to Use Fewer Tokens

About BFCL v3

Berkeley Function Calling Leaderboard v3 — testing function/tool calling accuracy.

This leaderboard shows all models with BFCL v3 benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.