Berkeley Function Calling Leaderboard v3 — testing function/tool calling accuracy.
Data from LayerLens
As of March 15, 2026, the top-scoring model on BFCL v3 is GLM 4.5 at 76.7%, followed by Qwen3 32B at 75.7% and Qwen3 32B at 75.7%. 23 models have been evaluated on this benchmark.
Last updated: March 15, 2026
Models
23
Best Score
76.7
Average
57.1
Std Dev
18.6
Provider | Model | Input $/M | Output $/M | BFCL v3 | Actions |
|---|---|---|---|---|---|
$0.600 | $2.200 | 76.7 | |||
$0.080 | $0.240 | 75.7 | |||
$0.080 | $0.240 | 75.7 | |||
$1.200 | $6.000 | 74.9 | |||
$1.200 | $6.000 | 74.9 | |||
$0.060 | $0.400 | 74.6 | |||
$0.060 | $0.400 | 74.6 | |||
$2.500 | $10.000 | 74.2 | |||
$0.130 | $0.850 | 69.1 | |||
$0.800 | $3.200 | 67.9 | |||
$0.450 | $2.200 | 64.5 | |||
$0.450 | $2.200 | 64.5 | |||
$0.200 | $1.100 | 63.5 | |||
$0.080 | $0.300 | 55.7 | |||
$0.500 | $3.000 | 53.5 | |||
$0.400 | $1.760 | 47.8 | |||
$0.050 | $0.200 | 41.6 | |||
$0.050 | $0.200 | 41.6 | |||
$0.060 | $0.140 | 40.8 | |||
$15.000 | $75.000 | 25.3 | |||
$15.000 | $75.000 | 25.3 | |||
$0.550 | $2.200 | 25.3 | |||
$0.550 | $2.200 | 25.3 |
Pricing from OpenRouter. Benchmarks from Artificial Analysis.
108 out of our 483 tracked models have had a price change in March.
Get our weekly newsletter on pricing changes, new releases, and tools.
Berkeley Function Calling Leaderboard v3 — testing function/tool calling accuracy.
This leaderboard shows all models with BFCL v3 benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.
Built by @aellman
2026 68 Ventures, LLC. All rights reserved.