Price Per TokenPrice Per Token

LLM Speed & Performance News

AI speed and performance news. Inference optimization, latency improvements, throughput benchmarks, and model efficiency.

News Feed

Dec 26
1

How uv got so fast

How uv got so fast Andrew Nesbitt provides an insightful teardown of why uv is so much faster than pip. It's not nearly as simple as just "they rewrote it in Rust" - uv gets to skip a huge amount of Python packaging history (which pip needs to implement for backwards compatibility) and benefits enormously from work over recent years that makes it possible to resolve dependencies across most packages without having to execute the code in setup.py using a Python interpreter. Two notes that caught my eye that I hadn't understood before: HTTP range requests for metadata. Wheel files are zip archives, and zip archives put their file listing at the end. uv tries PEP 658 metadata first, falls back to HTTP range requests for the zip central directory, then full wheel download, then building from source. Each step is slower and riskier. The design makes the fast path cover 99% of cases. None of this requires Rust. [...] Compact version representation. uv packs versions into u64 integers where possible, making comparison and hashing fast. Over 90% of versions fit in one u64. This is micro-optimization that compounds across millions of comparisons. I wanted to learn more about these tricks, so I fired up an asynchronous research task and told it to checkout the astral-sh/uv repo, find the Rust code for both of those features and try porting it to Python to help me understand how it works. Here's the report that it wrote for me, the prompts I used and the Claude Code transcript. You can try the script it wrote for extracting metadata from a wheel using HTTP range requests like this: uv run --with httpx https://raw.githubusercontent.com/simonw/research/refs/heads/main/http-range-wheel-metadata/wheel_metadata.py https://files.pythonhosted.org/packages/8b/04/ef95b67e1ff59c080b2effd1a9a96984d6953f667c91dfe9d77c838fc956/playwright-1.57.0-py3-none-macosx_11_0_arm64.whl -v The Playwright wheel there is ~40MB. Adding -v at the end causes the script to spit out verbose details of how it fetched the data - which looks like this. Key extract from that output: [1] HEAD request to get file size... File size: 40,775,575 bytes [2] Fetching last 16,384 bytes (EOCD + central directory)... Received 16,384 bytes [3] Parsed EOCD: Central directory offset: 40,731,572 Central directory size: 43,981 Total entries: 453 [4] Fetching complete central directory... ... [6] Found METADATA: playwright-1.57.0.dist-info/METADATA Offset: 40,706,744 Compressed size: 1,286 Compression method: 8 [7] Fetching METADATA content (2,376 bytes)... [8] Decompressed METADATA: 3,453 bytes Total bytes fetched: 18,760 / 40,775,575 (100.0% savings) The section of the report on compact version representation is interesting too. Here's how it illustrates sorting version numbers correctly based on their custom u64 representation: Sorted order (by integer comparison of packed u64): 1.0.0a1 (repr=0x0001000000200001) 1.0.0b1 (repr=0x0001000000300001) 1.0.0rc1 (repr=0x0001000000400001) 1.0.0 (repr=0x0001000000500000) 1.0.0.post1 (repr=0x0001000000700001) 1.0.1 (repr=0x0001000100500000) 2.0.0.dev1 (repr=0x0002000000100001) 2.0.0 (repr=0x0002000000500000) Tags: performance, python, rust, uv

Dec 24
Dec 22
Dec 19
Dec 18
Dec 17
10

Gemini 3 Flash

It continues to be a busy December, if not quite as busy as last year. Today's big news is Gemini 3 Flash, the latest in Google's "Flash" line of faster and less expensive models. Google are emphasizing the comparison between the new Flash and their previous generation's top model Gemini 2.5 Pro: Building on 3 Pro’s strong multimodal, coding and agentic features, 3 Flash offers powerful performance at less than a quarter the cost of 3 Pro, along with higher rate limits. The new 3 Flash model surpasses 2.5 Pro across many benchmarks while delivering faster speeds. Gemini 3 Flash's characteristics are almost identical to Gemini 3 Pro: it accepts text, image, video, audio, and PDF, outputs only text, handles 1,048,576 maximum input tokens and up to 65,536 output tokens, and has the same knowledge cut-off date of January 2025 (also shared with the Gemini 2.5 series). The benchmarks look good. The cost is appealing: 1/4 the price of Gemini 3 Pro ≤200k and 1/8 the price of Gemini 3 Pro >200k, and it's nice not to have a price increase for the new Flash at larger token lengths. It's a little more expensive than previous Flash models - Gemini 2.5 Flash was $0.30/million input tokens and $2.50/million on output, Gemini 3 Flash is $0.50/million and $3/million respectively. Google claim it may still end up cheaper though, due to more efficient output token usage: > Gemini 3 Flash is able to modulate how much it thinks. It may think longer for more complex use cases, but it also uses 30% fewer tokens on average than 2.5 Pro. Here's a more extensive price comparison on my llm-prices.com site. Generating some SVGs of pelicans I released llm-gemini 0.28 this morning with support for the new model. You can try it out like this: llm install -U llm-gemini llm keys set gemini # paste in key llm -m gemini-3-flash-preview "Generate an SVG of a pelican riding a bicycle" According to the developer docs the new model supports four different thinking level options: minimal, low, medium, and high. This is different from Gemini 3 Pro, which only supported low and high. You can run those like this: llm -m gemini-3-flash-preview --thinking-level minimal "Generate an SVG of a pelican riding a bicycle" Here are four pelicans, for thinking levels minimal, low, medium, and high: I built the gallery component with Gemini 3 Flash The gallery above uses a new Web Component which I built using Gemini 3 Flash to try out its coding abilities. The code on the page looks like this: <image-gallery width="4"> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-minimal-pelican-svg.jpg" alt="A minimalist vector illustration of a stylized white bird with a long orange beak and a red cap riding a dark blue bicycle on a single grey ground line against a plain white background." /> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-low-pelican-svg.jpg" alt="Minimalist illustration: A stylized white bird with a large, wedge-shaped orange beak and a single black dot for an eye rides a red bicycle with black wheels and a yellow pedal against a solid light blue background." /> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-medium-pelican-svg.jpg" alt="A minimalist illustration of a stylized white bird with a large yellow beak riding a red road bicycle in a racing position on a light blue background." /> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg" alt="Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background." /> </image-gallery> Those alt attributes are all generated by Gemini 3 Flash as well, using this recipe: llm -m gemini-3-flash-preview --system ' You write alt text for any image pasted in by the user. Alt text is always presented in a fenced code block to make it easy to copy and paste out. It is always presented on a single line so it can be used easily in Markdown images. All text on the image (for screenshots etc) must be exactly included. A short note describing the nature of the image itself should go first.' \ -a https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg You can see the code that powers the image gallery Web Component here on GitHub. I built it by prompting Gemini 3 Flash via LLM like this: llm -m gemini-3-flash-preview ' Build a Web Component that implements a simple image gallery. Usage is like this: <image-gallery width="5"> <img src="image1.jpg" alt="Image 1"> <img src="image2.jpg" alt="Image 2" data-thumb="image2-thumb.jpg"> <img src="image3.jpg" alt="Image 3"> </image-gallery> If an image has a data-thumb= attribute that one is used instead, other images are scaled down. The image gallery always takes up 100% of available width. The width="5" attribute means that five images will be shown next to each other in each row. The default is 3. There are gaps between the images. When an image is clicked it opens a modal dialog with the full size image. Return a complete HTML file with both the implementation of the Web Component several example uses of it. Use https://picsum.photos/300/200 URLs for those example images.' It took a few follow-up prompts using llm -c: llm -c 'Use a real modal such that keyboard shortcuts and accessibility features work without extra JS' llm -c 'Use X for the close icon and make it a bit more subtle' llm -c 'remove the hover effect entirely' llm -c 'I want no border on the close icon even when it is focused' Here's the full transcript, exported using llm logs -cue. Those five prompts took: 225 input, 3,269 output 2,243 input, 2,908 output 4,319 input, 2,516 output 6,376 input, 2,094 output 8,151 input, 1,806 output Added together that's 21,314 input and 12,593 output for a grand total of 4.8436 cents. The guide to migrating from Gemini 2.5 reveals one disappointment: Image segmentation: Image segmentation capabilities (returning pixel-level masks for objects) are not supported in Gemini 3 Pro or Gemini 3 Flash. For workloads requiring native image segmentation, we recommend continuing to utilize Gemini 2.5 Flash with thinking turned off or Gemini Robotics-ER 1.5. I wrote about this capability in Gemini 2.5 back in April. I hope they come back in future models - they're a really neat capability that is unique to Gemini. Tags: google, ai, web-components, generative-ai, llms, llm, gemini, llm-pricing, pelican-riding-a-bicycle, llm-release

Dec 16
12

The new ChatGPT Images is here

The new ChatGPT Images is here OpenAI shipped an update to their ChatGPT Images feature - the feature that gained them 100 million new users in a week when they first launched it back in March, but has since been eclipsed by Google's Nano Banana and then further by Nana Banana Pro in November. The focus for the new ChatGPT Images is speed and instruction following: It makes precise edits while keeping details intact, and generates images up to 4x faster It's also a little cheaper: OpenAI say that the new gpt-image-1.5 API model makes image input and output "20% cheaper in GPT Image 1.5 as compared to GPT Image 1". I tried a new test prompt against a photo I took of Natalie's ceramic stand at the farmers market a few weeks ago: Add two kakapos inspecting the pots Here's the result from the new ChatGPT Images model: And here's what I got from Nano Banana Pro: The ChatGPT Kākāpō are a little chonkier, which I think counts as a win. I was a little less impressed by the result I got for an infographic from the prompt "Infographic explaining how the Datasette open source project works" followed by "Run some extensive searches and gather a bunch of relevant information and then try again" (transcript): See my Nano Banana Pro post for comparison. Both models are clearly now usable for text-heavy graphics though, which makes them far more useful than previous generations of this technology. Update 21st December 2025: I realized I already have a tool for accessing this new model via the API. Here's what I got from the following: OPENAI_API_KEY="$(llm keys get openai)" \ uv run openai_image.py -m gpt-image-1.5\ 'a raccoon with a double bass in a jazz bar rocking out' Total cost: $0.2041. Tags: ai, kakapo, openai, generative-ai, text-to-image, nano-banana

Dec 15
Dec 14
16

JustHTML is a fascinating example of vibe engineering in action

I recently came across JustHTML, a new Python library for parsing HTML released by Emil Stenström. It's a very interesting piece of software, both as a useful library and as a case study in sophisticated AI-assisted programming. First impressions of JustHTML I didn't initially know that JustHTML had been written with AI assistance at all. The README caught my eye due to some attractive characteristics: It's pure Python. I like libraries that are pure Python (no C extensions or similar) because it makes them easy to use in less conventional Python environments, including Pyodide. "Passes all 9,200+ tests in the official html5lib-tests suite (used by browser vendors)" - this instantly caught my attention! HTML5 is a big, complicated but meticulously written specification. 100% test coverage. That's not something you see every day. CSS selector queries as a feature. I built a Python library for this many years ago and I'm always interested in seeing new implementations of that pattern. html5lib has been inconsistently maintained over the last few years, leaving me interested in potential alternatives. It's only 3,000 lines of implementation code (and another ~11,000 of tests.) I was out and about without a laptop so I decided to put JustHTML through its paces on my phone. I prompted Claude Code for web on my phone and had it build this Pyodide-powered HTML tool for trying it out: This was enough for me to convince myself that the core functionality worked as advertised. It's a neat piece of code! Turns out it was almost all built by LLMs At this point I went looking for some more background information on the library and found Emil's blog entry about it: How I wrote JustHTML using coding agents: Writing a full HTML5 parser is not a short one-shot problem. I have been working on this project for a couple of months on off-hours. Tooling: I used plain VS Code with Github Copilot in Agent mode. I enabled automatic approval of all commands, and then added a blacklist of commands that I always wanted to approve manually. I wrote an agent instruction that told it to keep working, and don't stop to ask questions. Worked well! Emil used several different models - an advantage of working in VS Code Agent mode rather than a provider-locked coding agent like Claude Code or Codex CLI. Claude Sonnet 3.7, Gemini 3 Pro and Claude Opus all get a mention. Vibe engineering, not vibe coding What's most interesting about Emil's 17 step account covering those several months of work is how much software engineering was involved, independent of typing out the actual code. I wrote about vibe engineering a while ago as an alternative to vibe coding. Vibe coding is when you have an LLM knock out code without any semblance of code review - great for prototypes and toy projects, definitely not an approach to use for serious libraries or production code. I proposed "vibe engineering" as the grown up version of vibe coding, where expert programmers use coding agents in a professional and responsible way to produce high quality, reliable results. You should absolutely read Emil's account in full. A few highlights: He hooked in the 9,200 test html5lib-tests conformance suite almost from the start. There's no better way to construct a new HTML5 parser than using the test suite that the browsers themselves use. He picked the core API design himself - a TagHandler base class with handle_start() etc. methods - and told the model to implement that. He added a comparative benchmark to track performance compared to existing libraries like html5lib, then experimented with a Rust optimization based on those initial numbers. He threw the original code away and started from scratch as a rough port of Servo's excellent html5ever Rust library. He built a custom profiler and new benchmark and let Gemini 3 Pro loose on it, finally achieving micro-optimizations to beat the existing Pure Python libraries. He used coverage to identify and remove unnecessary code. He had his agent build a custom fuzzer to generate vast numbers of invalid HTML documents and harden the parser against them. This represents a lot of sophisticated development practices, tapping into Emil's deep experience as a software engineer. As described, this feels to me more like a lead architect role than a hands-on coder. It perfectly fits what I was thinking about when I described vibe engineering. Setting the coding agent up with the html5lib-tests suite is also a great example of designing an agentic loop. "The agent did the typing" Emil concluded his article like this: JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn't have written it this quickly without the agent. But "quickly" doesn't mean "without thinking." I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking. That's probably the right division of labor. I couldn't agree more. Coding agents replace the part of my job that involves typing the code into a computer. I find what's left to be a much more valuable use of my time. Tags: html, python, ai, generative-ai, llms, ai-assisted-programming, vibe-coding, coding-agents

Dec 1
Nov 19
Nov 13
Oct 30

Built by @aellman

2026 68 Ventures, LLC. All rights reserved.