Compare AI model performance across industry-standard benchmarks with pricing data.
Performance evaluations across domains
Graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult.
Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains.
American Invitational Mathematics Examination problems testing olympiad-level mathematical reasoning with integer answers from 000-999.
Real-world coding benchmark with problems from competitive programming contests, testing code generation and problem-solving abilities.
Aider code editing benchmark measuring LLM ability to modify existing code based on natural language instructions.
Challenging subset of BIG-Bench focusing on tasks where language models previously underperformed, testing advanced reasoning capabilities.
Competition mathematics problems requiring multi-step reasoning, covering algebra, geometry, number theory, and calculus.
Built by @aellman
2026 68 Ventures, LLC. All rights reserved.