Price Per TokenPrice Per Token

LLM Benchmark News

Stay updated on LLM benchmarks and evaluations. MMLU, GPQA, coding benchmarks, and model comparisons. Daily updates.

News Feed

Dec 29
Dec 24
Dec 23
5

Cooking with Claude

I've been having an absurd amount of fun recently using LLMs for cooking. I started out using them for basic recipes, but as I've grown more confident in their culinary abilities I've leaned into them for more advanced tasks. Today I tried something new: having Claude vibe-code up a custom application to help with the timing for a complicated meal preparation. It worked really well! A custom timing app for two recipes at once We have family staying at the moment, which means cooking for four. We subscribe to a meal delivery service called Green Chef, mainly because it takes the thinking out of cooking three times a week: grab a bag from the fridge, follow the instructions, eat. Each bag serves two portions, so cooking for four means preparing two bags at once. I have done this a few times now and it is always a mad flurry of pans and ingredients and timers and desperately trying to figure out what should happen when and how to get both recipes finished at the same time. It's fun but it's also chaotic and error-prone. This time I decided to try something different, and potentially even more chaotic and error-prone: I outsourced the planning entirely to Claude. I took this single photo of the two recipe cards side-by-side and fed it to Claude Opus 4.5 (in the Claude iPhone app) with this prompt: Extract both of these recipes in as much detail as possible This is a moderately challenging vision task in that there quite a lot of small text in the photo. I wasn't confident Opus could handle it. I hadn't read the recipe cards myself. The responsible thing to do here would be a thorough review or at least a spot-check - I chose to keep things chaotic and didn't do any more than quickly eyeball the result. I asked what pots I'd need: Give me a full list of pots I would need if I was cooking both of them at once Then I prompted it to build a custom application to help me with the cooking process itself: I am going to cook them both at the same time. Build me a no react, mobile, friendly, interactive, artifact that spells out the process with exact timing on when everything needs to happen have a start setting at the top, which starts a timer and persists when I hit start in localStorage in case the page reloads. The next steps should show prominently with countdowns to when they open. The full combined timeline should be shown slow with calculated times tor when each thing should happen I copied the result out onto my own hosting (you can try it here) because I wasn't sure if localStorage would work inside the Claude app and I really didn't want it to forget my times! Then I clicked "start cooking"! Here's the full Claude transcript. There was just one notable catch: our dog, Cleo, knows exactly when her dinner time is, at 6pm sharp. I forgot to mention this to Claude, which had scheduled several key steps colliding with Cleo's meal. I got woofed at. I deserved it. To my great surprise, it worked. I followed the recipe guide to the minute and served up both meals exactly 44 minutes after I started cooking. The best way to learn the capabilities of LLMs is to throw tasks at them that may be beyond their abilities and see what happens. In this case I fully expected that something would get forgotten or a detail would be hallucinated and I'd end up scrambling to fix things half way through the process. I was surprised and impressed that it worked so well. Some credit for the app idea should go to my fellow hackers at /dev/fort 2 in 2009, when we rented Knockbrex Castle in Dumfries, Scotland for a week and attempted to build a cooking timer application for complex meals. Generating recipes from scratch Most of my other cooking experiments with LLMs have been a whole lot simpler than this: I ask for a recipe, ask for some variations and then cook one of them and see what happens. This works remarkably well considering LLMs have no taste buds. I've started to think of this as asking LLMs for the average recipe for a dish, based on all of the recipes they have hoovered up during their training. It turns out the mean version of every guacamole recipe on the internet is a decent guacamole! Here's an example of a recipe I tried recently that worked out really well. I was helping Natalie run her ceramic stall at the farmers market and the stall next to us sold excellent dried beans. I've never used dried beans before, so I took a photo of their selection and asked Claude what I could do with them: Identify these beans It took a guess at the beans, then I said: Get me excited about cooking with these! If I bought two varietiew what could I make "Get me excited" switches Claude into a sort of hype-man mode, which is kind of entertaining: Oh, you're about to enter the wonderful world of bean cooking! Let me get you pumped about some killer two-bean combos: [...] Mixed bean salad with lemon, olive oil, fresh herbs, cherry tomatoes - light but satisfying [...] I replied: OK Bean salad has me interested - these are dried beans. Give me some salad options I can make that would last a long time in the fridge ... and after some back and forth we arrived on the recipe in this transcript, which I cooked the following day (asking plenty of follow-up questions) and thoroughly enjoyed. I've done this a bunch of times with a bunch of different recipes across both Claude and ChatGPT and honestly I've not had a notable miss yet. Being able to say "make it vegan" or "I don't have coriander, what can I use instead?" or just "make it tastier" is a really fun way to explore cooking. It's also fun to repeat "make it tastier" multiple times to see how absurd you can get. I really want someone to turn this into a benchmark! Cooking with LLMs is a lot of fun. There's an opportunity here for a really neat benchmark: take a bunch of leading models, prompt them for recipes, follow those recipes and taste-test the results! The logistics of running this are definitely too much for me to handle myself. I have enough trouble cooking two meals at once, for a solid benchmark you'd ideally have several models serving meals up at the same time to a panel of tasters. If someone else wants to try this please let me know how it goes! Tags: cooking, devfort, tools, ai, generative-ai, llms, anthropic, claude, vision-llms, vibe-coding

Dec 19
Dec 18
Dec 17
8

Gemini 3 Flash

It continues to be a busy December, if not quite as busy as last year. Today's big news is Gemini 3 Flash, the latest in Google's "Flash" line of faster and less expensive models. Google are emphasizing the comparison between the new Flash and their previous generation's top model Gemini 2.5 Pro: Building on 3 Pro’s strong multimodal, coding and agentic features, 3 Flash offers powerful performance at less than a quarter the cost of 3 Pro, along with higher rate limits. The new 3 Flash model surpasses 2.5 Pro across many benchmarks while delivering faster speeds. Gemini 3 Flash's characteristics are almost identical to Gemini 3 Pro: it accepts text, image, video, audio, and PDF, outputs only text, handles 1,048,576 maximum input tokens and up to 65,536 output tokens, and has the same knowledge cut-off date of January 2025 (also shared with the Gemini 2.5 series). The benchmarks look good. The cost is appealing: 1/4 the price of Gemini 3 Pro ≤200k and 1/8 the price of Gemini 3 Pro >200k, and it's nice not to have a price increase for the new Flash at larger token lengths. It's a little more expensive than previous Flash models - Gemini 2.5 Flash was $0.30/million input tokens and $2.50/million on output, Gemini 3 Flash is $0.50/million and $3/million respectively. Google claim it may still end up cheaper though, due to more efficient output token usage: > Gemini 3 Flash is able to modulate how much it thinks. It may think longer for more complex use cases, but it also uses 30% fewer tokens on average than 2.5 Pro. Here's a more extensive price comparison on my llm-prices.com site. Generating some SVGs of pelicans I released llm-gemini 0.28 this morning with support for the new model. You can try it out like this: llm install -U llm-gemini llm keys set gemini # paste in key llm -m gemini-3-flash-preview "Generate an SVG of a pelican riding a bicycle" According to the developer docs the new model supports four different thinking level options: minimal, low, medium, and high. This is different from Gemini 3 Pro, which only supported low and high. You can run those like this: llm -m gemini-3-flash-preview --thinking-level minimal "Generate an SVG of a pelican riding a bicycle" Here are four pelicans, for thinking levels minimal, low, medium, and high: I built the gallery component with Gemini 3 Flash The gallery above uses a new Web Component which I built using Gemini 3 Flash to try out its coding abilities. The code on the page looks like this: <image-gallery width="4"> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-minimal-pelican-svg.jpg" alt="A minimalist vector illustration of a stylized white bird with a long orange beak and a red cap riding a dark blue bicycle on a single grey ground line against a plain white background." /> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-low-pelican-svg.jpg" alt="Minimalist illustration: A stylized white bird with a large, wedge-shaped orange beak and a single black dot for an eye rides a red bicycle with black wheels and a yellow pedal against a solid light blue background." /> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-medium-pelican-svg.jpg" alt="A minimalist illustration of a stylized white bird with a large yellow beak riding a red road bicycle in a racing position on a light blue background." /> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg" alt="Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background." /> </image-gallery> Those alt attributes are all generated by Gemini 3 Flash as well, using this recipe: llm -m gemini-3-flash-preview --system ' You write alt text for any image pasted in by the user. Alt text is always presented in a fenced code block to make it easy to copy and paste out. It is always presented on a single line so it can be used easily in Markdown images. All text on the image (for screenshots etc) must be exactly included. A short note describing the nature of the image itself should go first.' \ -a https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg You can see the code that powers the image gallery Web Component here on GitHub. I built it by prompting Gemini 3 Flash via LLM like this: llm -m gemini-3-flash-preview ' Build a Web Component that implements a simple image gallery. Usage is like this: <image-gallery width="5"> <img src="image1.jpg" alt="Image 1"> <img src="image2.jpg" alt="Image 2" data-thumb="image2-thumb.jpg"> <img src="image3.jpg" alt="Image 3"> </image-gallery> If an image has a data-thumb= attribute that one is used instead, other images are scaled down. The image gallery always takes up 100% of available width. The width="5" attribute means that five images will be shown next to each other in each row. The default is 3. There are gaps between the images. When an image is clicked it opens a modal dialog with the full size image. Return a complete HTML file with both the implementation of the Web Component several example uses of it. Use https://picsum.photos/300/200 URLs for those example images.' It took a few follow-up prompts using llm -c: llm -c 'Use a real modal such that keyboard shortcuts and accessibility features work without extra JS' llm -c 'Use X for the close icon and make it a bit more subtle' llm -c 'remove the hover effect entirely' llm -c 'I want no border on the close icon even when it is focused' Here's the full transcript, exported using llm logs -cue. Those five prompts took: 225 input, 3,269 output 2,243 input, 2,908 output 4,319 input, 2,516 output 6,376 input, 2,094 output 8,151 input, 1,806 output Added together that's 21,314 input and 12,593 output for a grand total of 4.8436 cents. The guide to migrating from Gemini 2.5 reveals one disappointment: Image segmentation: Image segmentation capabilities (returning pixel-level masks for objects) are not supported in Gemini 3 Pro or Gemini 3 Flash. For workloads requiring native image segmentation, we recommend continuing to utilize Gemini 2.5 Flash with thinking turned off or Gemini Robotics-ER 1.5. I wrote about this capability in Gemini 2.5 back in April. I hope they come back in future models - they're a really neat capability that is unique to Gemini. Tags: google, ai, web-components, generative-ai, llms, llm, gemini, llm-pricing, pelican-riding-a-bicycle, llm-release

Dec 16
Dec 15
Dec 14
12

JustHTML is a fascinating example of vibe engineering in action

I recently came across JustHTML, a new Python library for parsing HTML released by Emil Stenström. It's a very interesting piece of software, both as a useful library and as a case study in sophisticated AI-assisted programming. First impressions of JustHTML I didn't initially know that JustHTML had been written with AI assistance at all. The README caught my eye due to some attractive characteristics: It's pure Python. I like libraries that are pure Python (no C extensions or similar) because it makes them easy to use in less conventional Python environments, including Pyodide. "Passes all 9,200+ tests in the official html5lib-tests suite (used by browser vendors)" - this instantly caught my attention! HTML5 is a big, complicated but meticulously written specification. 100% test coverage. That's not something you see every day. CSS selector queries as a feature. I built a Python library for this many years ago and I'm always interested in seeing new implementations of that pattern. html5lib has been inconsistently maintained over the last few years, leaving me interested in potential alternatives. It's only 3,000 lines of implementation code (and another ~11,000 of tests.) I was out and about without a laptop so I decided to put JustHTML through its paces on my phone. I prompted Claude Code for web on my phone and had it build this Pyodide-powered HTML tool for trying it out: This was enough for me to convince myself that the core functionality worked as advertised. It's a neat piece of code! Turns out it was almost all built by LLMs At this point I went looking for some more background information on the library and found Emil's blog entry about it: How I wrote JustHTML using coding agents: Writing a full HTML5 parser is not a short one-shot problem. I have been working on this project for a couple of months on off-hours. Tooling: I used plain VS Code with Github Copilot in Agent mode. I enabled automatic approval of all commands, and then added a blacklist of commands that I always wanted to approve manually. I wrote an agent instruction that told it to keep working, and don't stop to ask questions. Worked well! Emil used several different models - an advantage of working in VS Code Agent mode rather than a provider-locked coding agent like Claude Code or Codex CLI. Claude Sonnet 3.7, Gemini 3 Pro and Claude Opus all get a mention. Vibe engineering, not vibe coding What's most interesting about Emil's 17 step account covering those several months of work is how much software engineering was involved, independent of typing out the actual code. I wrote about vibe engineering a while ago as an alternative to vibe coding. Vibe coding is when you have an LLM knock out code without any semblance of code review - great for prototypes and toy projects, definitely not an approach to use for serious libraries or production code. I proposed "vibe engineering" as the grown up version of vibe coding, where expert programmers use coding agents in a professional and responsible way to produce high quality, reliable results. You should absolutely read Emil's account in full. A few highlights: He hooked in the 9,200 test html5lib-tests conformance suite almost from the start. There's no better way to construct a new HTML5 parser than using the test suite that the browsers themselves use. He picked the core API design himself - a TagHandler base class with handle_start() etc. methods - and told the model to implement that. He added a comparative benchmark to track performance compared to existing libraries like html5lib, then experimented with a Rust optimization based on those initial numbers. He threw the original code away and started from scratch as a rough port of Servo's excellent html5ever Rust library. He built a custom profiler and new benchmark and let Gemini 3 Pro loose on it, finally achieving micro-optimizations to beat the existing Pure Python libraries. He used coverage to identify and remove unnecessary code. He had his agent build a custom fuzzer to generate vast numbers of invalid HTML documents and harden the parser against them. This represents a lot of sophisticated development practices, tapping into Emil's deep experience as a software engineer. As described, this feels to me more like a lead architect role than a hands-on coder. It perfectly fits what I was thinking about when I described vibe engineering. Setting the coding agent up with the html5lib-tests suite is also a great example of designing an agentic loop. "The agent did the typing" Emil concluded his article like this: JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn't have written it this quickly without the agent. But "quickly" doesn't mean "without thinking." I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking. That's probably the right division of labor. I couldn't agree more. Coding agents replace the part of my job that involves typing the code into a computer. I find what's left to be a much more valuable use of my time. Tags: html, python, ai, generative-ai, llms, ai-assisted-programming, vibe-coding, coding-agents

Dec 11
Dec 9
Nov 19
Nov 17
Nov 10
Nov 3
Oct 29
Oct 27

Built by @aellman

2026 68 Ventures, LLC. All rights reserved.