Stay updated on LLM benchmarks and evaluations. MMLU, GPQA, coding benchmarks, and model comparisons. Daily updates.
These days, large language models can handle increasingly complex tasks, writing complex code and engaging in sophisticated reasoning. But when it comes to four-digit multiplication, a task taught in elementary school, even state-of-the-art systems fail. Why?
In this post, we demonstrate how to optimize large language model (LLM) inference on Amazon SageMaker AI using BentoML's LLM-Optimizer to systematically identify the best serving configurations for your workload.
In this post, we demonstrate how to use Foundation Models (FMs) from Amazon Bedrock and the newly launched Amazon Bedrock AgentCore alongside W&B Weave to help build, evaluate, and monitor enterprise AI solutions. We cover the complete development lifecycle from tracking individual FM calls to monitoring complex agent workflows in production.
In this post, we share how dLocal worked closely with the AWS team to help shape the product roadmap, reinforce its role as an industry innovator, and set new benchmarks for operational excellence in the global fintech landscape.
I've been having an absurd amount of fun recently using LLMs for cooking. I started out using them for basic recipes, but as I've grown more confident in their culinary abilities I've leaned into them for more advanced tasks. Today I tried something new: having Claude vibe-code up a custom application to help with the timing for a complicated meal preparation. It worked really well! A custom timing app for two recipes at once We have family staying at the moment, which means cooking for four. We subscribe to a meal delivery service called Green Chef, mainly because it takes the thinking out of cooking three times a week: grab a bag from the fridge, follow the instructions, eat. Each bag serves two portions, so cooking for four means preparing two bags at once. I have done this a few times now and it is always a mad flurry of pans and ingredients and timers and desperately trying to figure out what should happen when and how to get both recipes finished at the same time. It's fun but it's also chaotic and error-prone. This time I decided to try something different, and potentially even more chaotic and error-prone: I outsourced the planning entirely to Claude. I took this single photo of the two recipe cards side-by-side and fed it to Claude Opus 4.5 (in the Claude iPhone app) with this prompt: Extract both of these recipes in as much detail as possible This is a moderately challenging vision task in that there quite a lot of small text in the photo. I wasn't confident Opus could handle it. I hadn't read the recipe cards myself. The responsible thing to do here would be a thorough review or at least a spot-check - I chose to keep things chaotic and didn't do any more than quickly eyeball the result. I asked what pots I'd need: Give me a full list of pots I would need if I was cooking both of them at once Then I prompted it to build a custom application to help me with the cooking process itself: I am going to cook them both at the same time. Build me a no react, mobile, friendly, interactive, artifact that spells out the process with exact timing on when everything needs to happen have a start setting at the top, which starts a timer and persists when I hit start in localStorage in case the page reloads. The next steps should show prominently with countdowns to when they open. The full combined timeline should be shown slow with calculated times tor when each thing should happen I copied the result out onto my own hosting (you can try it here) because I wasn't sure if localStorage would work inside the Claude app and I really didn't want it to forget my times! Then I clicked "start cooking"! Here's the full Claude transcript. There was just one notable catch: our dog, Cleo, knows exactly when her dinner time is, at 6pm sharp. I forgot to mention this to Claude, which had scheduled several key steps colliding with Cleo's meal. I got woofed at. I deserved it. To my great surprise, it worked. I followed the recipe guide to the minute and served up both meals exactly 44 minutes after I started cooking. The best way to learn the capabilities of LLMs is to throw tasks at them that may be beyond their abilities and see what happens. In this case I fully expected that something would get forgotten or a detail would be hallucinated and I'd end up scrambling to fix things half way through the process. I was surprised and impressed that it worked so well. Some credit for the app idea should go to my fellow hackers at /dev/fort 2 in 2009, when we rented Knockbrex Castle in Dumfries, Scotland for a week and attempted to build a cooking timer application for complex meals. Generating recipes from scratch Most of my other cooking experiments with LLMs have been a whole lot simpler than this: I ask for a recipe, ask for some variations and then cook one of them and see what happens. This works remarkably well considering LLMs have no taste buds. I've started to think of this as asking LLMs for the average recipe for a dish, based on all of the recipes they have hoovered up during their training. It turns out the mean version of every guacamole recipe on the internet is a decent guacamole! Here's an example of a recipe I tried recently that worked out really well. I was helping Natalie run her ceramic stall at the farmers market and the stall next to us sold excellent dried beans. I've never used dried beans before, so I took a photo of their selection and asked Claude what I could do with them: Identify these beans It took a guess at the beans, then I said: Get me excited about cooking with these! If I bought two varietiew what could I make "Get me excited" switches Claude into a sort of hype-man mode, which is kind of entertaining: Oh, you're about to enter the wonderful world of bean cooking! Let me get you pumped about some killer two-bean combos: [...] Mixed bean salad with lemon, olive oil, fresh herbs, cherry tomatoes - light but satisfying [...] I replied: OK Bean salad has me interested - these are dried beans. Give me some salad options I can make that would last a long time in the fridge ... and after some back and forth we arrived on the recipe in this transcript, which I cooked the following day (asking plenty of follow-up questions) and thoroughly enjoyed. I've done this a bunch of times with a bunch of different recipes across both Claude and ChatGPT and honestly I've not had a notable miss yet. Being able to say "make it vegan" or "I don't have coriander, what can I use instead?" or just "make it tastier" is a really fun way to explore cooking. It's also fun to repeat "make it tastier" multiple times to see how absurd you can get. I really want someone to turn this into a benchmark! Cooking with LLMs is a lot of fun. There's an opportunity here for a really neat benchmark: take a bunch of leading models, prompt them for recipes, follow those recipes and taste-test the results! The logistics of running this are definitely too much for me to handle myself. I have enough trouble cooking two meals at once, for a solid benchmark you'd ideally have several models serving meals up at the same time to a panel of tasters. If someone else wants to try this please let me know how it goes! Tags: cooking, devfort, tools, ai, generative-ai, llms, anthropic, claude, vision-llms, vibe-coding
Introducing GPT-5.2-Codex The latest in OpenAI's Codex family of models (not the same thing as their Codex CLI or Codex Cloud coding agent tools). GPT‑5.2-Codex is a version of GPT‑5.2 further optimized for agentic coding in Codex, including improvements on long-horizon work through context compaction, stronger performance on large code changes like refactors and migrations, improved performance in Windows environments, and significantly stronger cybersecurity capabilities. As with some previous Codex models this one is available via their Codex coding agents now and will be coming to the API "in the coming weeks". Unlike previous models there's a new invite-only preview process for vetted cybersecurity professionals for "more permissive models". I've been very impressed recently with GPT 5.2's ability to tackle multi-hour agentic coding challenges. 5.2 Codex scores 64% on the Terminal-Bench 2.0 benchmark that GPT-5.2 scored 62.2% on. I'm not sure how concrete that 1.8% improvement will be! I didn't hack API access together this time (see previous attempts), instead opting to just ask Codex CLI to "Generate an SVG of a pelican riding a bicycle" while running the new model (effort medium). Here's the transcript in my new Codex CLI timeline viewer, and here's the pelican it drew: Tags: ai, openai, generative-ai, llms, pelican-riding-a-bicycle, llm-release, codex-cli, gpt-codex
swift-justhtml First there was Emil Stenström's JustHTML in Python, then my justjshtml in JavaScript, then Anil Madhavapeddy's html5rw in OCaml, and now Kyle Howells has built a vibespiled dependency-free HTML5 parser for Swift using the same coding agent tricks against the html5lib-tests test suite. Kyle ran some benchmarks to compare the different implementations: Rust (html5ever) total parse time: 303 ms Swift total parse time: 1313 ms JavaScript total parse time: 1035 ms Python total parse time: 4189 ms Tags: html5, ai, generative-ai, llms, ai-assisted-programming, vibe-coding, swift
It continues to be a busy December, if not quite as busy as last year. Today's big news is Gemini 3 Flash, the latest in Google's "Flash" line of faster and less expensive models. Google are emphasizing the comparison between the new Flash and their previous generation's top model Gemini 2.5 Pro: Building on 3 Pro’s strong multimodal, coding and agentic features, 3 Flash offers powerful performance at less than a quarter the cost of 3 Pro, along with higher rate limits. The new 3 Flash model surpasses 2.5 Pro across many benchmarks while delivering faster speeds. Gemini 3 Flash's characteristics are almost identical to Gemini 3 Pro: it accepts text, image, video, audio, and PDF, outputs only text, handles 1,048,576 maximum input tokens and up to 65,536 output tokens, and has the same knowledge cut-off date of January 2025 (also shared with the Gemini 2.5 series). The benchmarks look good. The cost is appealing: 1/4 the price of Gemini 3 Pro ≤200k and 1/8 the price of Gemini 3 Pro >200k, and it's nice not to have a price increase for the new Flash at larger token lengths. It's a little more expensive than previous Flash models - Gemini 2.5 Flash was $0.30/million input tokens and $2.50/million on output, Gemini 3 Flash is $0.50/million and $3/million respectively. Google claim it may still end up cheaper though, due to more efficient output token usage: > Gemini 3 Flash is able to modulate how much it thinks. It may think longer for more complex use cases, but it also uses 30% fewer tokens on average than 2.5 Pro. Here's a more extensive price comparison on my llm-prices.com site. Generating some SVGs of pelicans I released llm-gemini 0.28 this morning with support for the new model. You can try it out like this: llm install -U llm-gemini llm keys set gemini # paste in key llm -m gemini-3-flash-preview "Generate an SVG of a pelican riding a bicycle" According to the developer docs the new model supports four different thinking level options: minimal, low, medium, and high. This is different from Gemini 3 Pro, which only supported low and high. You can run those like this: llm -m gemini-3-flash-preview --thinking-level minimal "Generate an SVG of a pelican riding a bicycle" Here are four pelicans, for thinking levels minimal, low, medium, and high: I built the gallery component with Gemini 3 Flash The gallery above uses a new Web Component which I built using Gemini 3 Flash to try out its coding abilities. The code on the page looks like this: <image-gallery width="4"> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-minimal-pelican-svg.jpg" alt="A minimalist vector illustration of a stylized white bird with a long orange beak and a red cap riding a dark blue bicycle on a single grey ground line against a plain white background." /> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-low-pelican-svg.jpg" alt="Minimalist illustration: A stylized white bird with a large, wedge-shaped orange beak and a single black dot for an eye rides a red bicycle with black wheels and a yellow pedal against a solid light blue background." /> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-medium-pelican-svg.jpg" alt="A minimalist illustration of a stylized white bird with a large yellow beak riding a red road bicycle in a racing position on a light blue background." /> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg" alt="Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background." /> </image-gallery> Those alt attributes are all generated by Gemini 3 Flash as well, using this recipe: llm -m gemini-3-flash-preview --system ' You write alt text for any image pasted in by the user. Alt text is always presented in a fenced code block to make it easy to copy and paste out. It is always presented on a single line so it can be used easily in Markdown images. All text on the image (for screenshots etc) must be exactly included. A short note describing the nature of the image itself should go first.' \ -a https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg You can see the code that powers the image gallery Web Component here on GitHub. I built it by prompting Gemini 3 Flash via LLM like this: llm -m gemini-3-flash-preview ' Build a Web Component that implements a simple image gallery. Usage is like this: <image-gallery width="5"> <img src="image1.jpg" alt="Image 1"> <img src="image2.jpg" alt="Image 2" data-thumb="image2-thumb.jpg"> <img src="image3.jpg" alt="Image 3"> </image-gallery> If an image has a data-thumb= attribute that one is used instead, other images are scaled down. The image gallery always takes up 100% of available width. The width="5" attribute means that five images will be shown next to each other in each row. The default is 3. There are gaps between the images. When an image is clicked it opens a modal dialog with the full size image. Return a complete HTML file with both the implementation of the Web Component several example uses of it. Use https://picsum.photos/300/200 URLs for those example images.' It took a few follow-up prompts using llm -c: llm -c 'Use a real modal such that keyboard shortcuts and accessibility features work without extra JS' llm -c 'Use X for the close icon and make it a bit more subtle' llm -c 'remove the hover effect entirely' llm -c 'I want no border on the close icon even when it is focused' Here's the full transcript, exported using llm logs -cue. Those five prompts took: 225 input, 3,269 output 2,243 input, 2,908 output 4,319 input, 2,516 output 6,376 input, 2,094 output 8,151 input, 1,806 output Added together that's 21,314 input and 12,593 output for a grand total of 4.8436 cents. The guide to migrating from Gemini 2.5 reveals one disappointment: Image segmentation: Image segmentation capabilities (returning pixel-level masks for objects) are not supported in Gemini 3 Pro or Gemini 3 Flash. For workloads requiring native image segmentation, we recommend continuing to utilize Gemini 2.5 Flash with thinking turned off or Gemini Robotics-ER 1.5. I wrote about this capability in Gemini 2.5 back in April. I hope they come back in future models - they're a really neat capability that is unique to Gemini. Tags: google, ai, web-components, generative-ai, llms, llm, gemini, llm-pricing, pelican-riding-a-bicycle, llm-release
OpenAI introduces FrontierScience, a benchmark testing AI reasoning in physics, chemistry, and biology to measure progress toward real scientific research.
It’s a weird time to be an AI doomer. This small but influential community of researchers, scientists, and policy experts believes, in the simplest terms, that AI could get so good it could be bad—very, very bad—for humanity. Though many of these people would be more likely to describe themselves as advocates for AI safety…
When the generative AI boom took off in 2022, Rudi Miller and her law school classmates were suddenly gripped with anxiety. “Before graduating, there was discussion about what the job market would look like for us if AI became adopted,” she recalls.  So when it came time to choose a speciality, Miller—now a junior associate…
I recently came across JustHTML, a new Python library for parsing HTML released by Emil Stenström. It's a very interesting piece of software, both as a useful library and as a case study in sophisticated AI-assisted programming. First impressions of JustHTML I didn't initially know that JustHTML had been written with AI assistance at all. The README caught my eye due to some attractive characteristics: It's pure Python. I like libraries that are pure Python (no C extensions or similar) because it makes them easy to use in less conventional Python environments, including Pyodide. "Passes all 9,200+ tests in the official html5lib-tests suite (used by browser vendors)" - this instantly caught my attention! HTML5 is a big, complicated but meticulously written specification. 100% test coverage. That's not something you see every day. CSS selector queries as a feature. I built a Python library for this many years ago and I'm always interested in seeing new implementations of that pattern. html5lib has been inconsistently maintained over the last few years, leaving me interested in potential alternatives. It's only 3,000 lines of implementation code (and another ~11,000 of tests.) I was out and about without a laptop so I decided to put JustHTML through its paces on my phone. I prompted Claude Code for web on my phone and had it build this Pyodide-powered HTML tool for trying it out: This was enough for me to convince myself that the core functionality worked as advertised. It's a neat piece of code! Turns out it was almost all built by LLMs At this point I went looking for some more background information on the library and found Emil's blog entry about it: How I wrote JustHTML using coding agents: Writing a full HTML5 parser is not a short one-shot problem. I have been working on this project for a couple of months on off-hours. Tooling: I used plain VS Code with Github Copilot in Agent mode. I enabled automatic approval of all commands, and then added a blacklist of commands that I always wanted to approve manually. I wrote an agent instruction that told it to keep working, and don't stop to ask questions. Worked well! Emil used several different models - an advantage of working in VS Code Agent mode rather than a provider-locked coding agent like Claude Code or Codex CLI. Claude Sonnet 3.7, Gemini 3 Pro and Claude Opus all get a mention. Vibe engineering, not vibe coding What's most interesting about Emil's 17 step account covering those several months of work is how much software engineering was involved, independent of typing out the actual code. I wrote about vibe engineering a while ago as an alternative to vibe coding. Vibe coding is when you have an LLM knock out code without any semblance of code review - great for prototypes and toy projects, definitely not an approach to use for serious libraries or production code. I proposed "vibe engineering" as the grown up version of vibe coding, where expert programmers use coding agents in a professional and responsible way to produce high quality, reliable results. You should absolutely read Emil's account in full. A few highlights: He hooked in the 9,200 test html5lib-tests conformance suite almost from the start. There's no better way to construct a new HTML5 parser than using the test suite that the browsers themselves use. He picked the core API design himself - a TagHandler base class with handle_start() etc. methods - and told the model to implement that. He added a comparative benchmark to track performance compared to existing libraries like html5lib, then experimented with a Rust optimization based on those initial numbers. He threw the original code away and started from scratch as a rough port of Servo's excellent html5ever Rust library. He built a custom profiler and new benchmark and let Gemini 3 Pro loose on it, finally achieving micro-optimizations to beat the existing Pure Python libraries. He used coverage to identify and remove unnecessary code. He had his agent build a custom fuzzer to generate vast numbers of invalid HTML documents and harden the parser against them. This represents a lot of sophisticated development practices, tapping into Emil's deep experience as a software engineer. As described, this feels to me more like a lead architect role than a hands-on coder. It perfectly fits what I was thinking about when I described vibe engineering. Setting the coding agent up with the html5lib-tests suite is also a great example of designing an agentic loop. "The agent did the typing" Emil concluded his article like this: JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn't have written it this quickly without the agent. But "quickly" doesn't mean "without thinking." I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking. That's probably the right division of labor. I couldn't agree more. Coding agents replace the part of my job that involves typing the code into a computer. I find what's left to be a much more valuable use of my time. Tags: html, python, ai, generative-ai, llms, ai-assisted-programming, vibe-coding, coding-agents
GPT-5.2 is OpenAI’s strongest model yet for math and science, setting new state-of-the-art results on benchmarks like GPQA Diamond and FrontierMath. This post shows how those gains translate into real research progress, including solving an open theoretical problem and generating reliable mathematical proofs.
Systematically evaluating the factuality of large language models with the FACTS Benchmark Suite.
Learn how evals help businesses define, measure, and improve AI performance—reducing risk, boosting productivity, and driving strategic advantage.
At what point will AI change your daily life?
The future is biomechanical computation
OpenAI introduces IndQA, a new benchmark for evaluating AI systems in Indian languages. Built with domain experts, IndQA tests cultural understanding and reasoning across 12 languages and 10 knowledge areas.
gpt-oss-safeguard-120b and gpt-oss-safeguard-20b are two open-weight reasoning models post-trained from the gpt-oss models and trained to reason from a provided policy in order to label content under that policy. In this report, we describe gpt-oss-safeguard’s capabilities and provide our baseline safety evaluations on the gpt-oss-safeguard models, using the underlying gpt-oss models as a baseline. For more information about the development and architecture of the underlying gpt-oss models, see the original gpt-oss model model card.
This system card details GPT-5’s improvements in handling sensitive conversations, including new benchmarks for emotional reliance, mental health, and jailbreak resistance.
Built by @aellman
2026 68 Ventures, LLC. All rights reserved.