Challenging subset of BIG-Bench focusing on tasks where language models previously underperformed, testing advanced reasoning capabilities.
Challenging subset of BIG-Bench focusing on tasks where language models previously underperformed, testing advanced reasoning capabilities.
This leaderboard shows all models with Big-Bench Hard benchmark scores, ranked from highest to lowest. Pricing data is included to help you compare performance against cost.
Built by @aellman
2026 68 Ventures, LLC. All rights reserved.