Price Per TokenPrice Per Token

BFCL v3 Leaderboard

Berkeley Function Calling Leaderboard v3 — testing function/tool calling accuracy.

Data from LayerLens

As of March 15, 2026, the top-scoring model on BFCL v3 is GLM 4.5 at 76.7%, followed by Qwen3 32B at 75.7% and Qwen3 32B at 75.7%. 23 models have been evaluated on this benchmark.

Last updated: March 15, 2026

Models

23

Best Score

76.7

Average

57.1

Std Dev

18.6

Categories
Multi-turn
Provider
Model
Input $/M
Output $/M
BFCL v3
Actions
$0.600
$2.200
76.7
$0.080
$0.240
75.7
$0.080
$0.240
75.7
$1.200
$6.000
74.9
$1.200
$6.000
74.9
$0.060
$0.400
74.6
$0.060
$0.400
74.6
$2.500
$10.000
74.2
$0.130
$0.850
69.1
$0.800
$3.200
67.9
$0.450
$2.200
64.5
$0.450
$2.200
64.5
$0.200
$1.100
63.5
$0.080
$0.300
55.7
$0.500
$3.000
53.5
$0.400
$1.760
47.8
$0.050
$0.200
41.6
$0.050
$0.200
41.6
$0.060
$0.140
40.8
$15.000
$75.000
25.3
$15.000
$75.000
25.3
$0.550
$2.200
25.3
$0.550
$2.200
25.3

Pricing from OpenRouter. Benchmarks from Artificial Analysis.

108 out of our 483 tracked models have had a price change in March.

Get our weekly newsletter on pricing changes, new releases, and tools.

About BFCL v3

Berkeley Function Calling Leaderboard v3 — testing function/tool calling accuracy.

This leaderboard shows all models with BFCL v3 benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.

Frequently Asked Questions

Berkeley Function Calling Leaderboard v3 — testing function/tool calling accuracy.
As of March 15, 2026, GLM 4.5 leads the BFCL v3 leaderboard with a score of 76.7. Rankings change as new models are released and evaluated.
Currently 23 models have been evaluated on BFCL v3, with an average score of 57.1 and standard deviation of 18.6.
Benchmark scores are updated when new evaluations are published by our data sources (Artificial Analysis and LayerLens). Pricing data is refreshed daily from OpenRouter.