AI speed and performance news. Inference optimization, latency improvements, throughput benchmarks, and model efficiency.
How uv got so fast Andrew Nesbitt provides an insightful teardown of why uv is so much faster than pip. It's not nearly as simple as just "they rewrote it in Rust" - uv gets to skip a huge amount of Python packaging history (which pip needs to implement for backwards compatibility) and benefits enormously from work over recent years that makes it possible to resolve dependencies across most packages without having to execute the code in setup.py using a Python interpreter. Two notes that caught my eye that I hadn't understood before: HTTP range requests for metadata. Wheel files are zip archives, and zip archives put their file listing at the end. uv tries PEP 658 metadata first, falls back to HTTP range requests for the zip central directory, then full wheel download, then building from source. Each step is slower and riskier. The design makes the fast path cover 99% of cases. None of this requires Rust. [...] Compact version representation. uv packs versions into u64 integers where possible, making comparison and hashing fast. Over 90% of versions fit in one u64. This is micro-optimization that compounds across millions of comparisons. I wanted to learn more about these tricks, so I fired up an asynchronous research task and told it to checkout the astral-sh/uv repo, find the Rust code for both of those features and try porting it to Python to help me understand how it works. Here's the report that it wrote for me, the prompts I used and the Claude Code transcript. You can try the script it wrote for extracting metadata from a wheel using HTTP range requests like this: uv run --with httpx https://raw.githubusercontent.com/simonw/research/refs/heads/main/http-range-wheel-metadata/wheel_metadata.py https://files.pythonhosted.org/packages/8b/04/ef95b67e1ff59c080b2effd1a9a96984d6953f667c91dfe9d77c838fc956/playwright-1.57.0-py3-none-macosx_11_0_arm64.whl -v The Playwright wheel there is ~40MB. Adding -v at the end causes the script to spit out verbose details of how it fetched the data - which looks like this. Key extract from that output: [1] HEAD request to get file size... File size: 40,775,575 bytes [2] Fetching last 16,384 bytes (EOCD + central directory)... Received 16,384 bytes [3] Parsed EOCD: Central directory offset: 40,731,572 Central directory size: 43,981 Total entries: 453 [4] Fetching complete central directory... ... [6] Found METADATA: playwright-1.57.0.dist-info/METADATA Offset: 40,706,744 Compressed size: 1,286 Compression method: 8 [7] Fetching METADATA content (2,376 bytes)... [8] Decompressed METADATA: 3,453 bytes Total bytes fetched: 18,760 / 40,775,575 (100.0% savings) The section of the report on compact version representation is interesting too. Here's how it illustrates sorting version numbers correctly based on their custom u64 representation: Sorted order (by integer comparison of packed u64): 1.0.0a1 (repr=0x0001000000200001) 1.0.0b1 (repr=0x0001000000300001) 1.0.0rc1 (repr=0x0001000000400001) 1.0.0 (repr=0x0001000000500000) 1.0.0.post1 (repr=0x0001000000700001) 1.0.1 (repr=0x0001000100500000) 2.0.0.dev1 (repr=0x0002000000100001) 2.0.0 (repr=0x0002000000500000) Tags: performance, python, rust, uv
Enterprise organizations increasingly rely on web-based applications for critical business processes, yet many workflows remain manually intensive, creating operational inefficiencies and compliance risks. Despite significant technology investments, knowledge workers routinely navigate between eight to twelve different web applications during standard workflows, constantly switching contexts and manually transferring information between systems. Data entry and validation tasks […]
In this post, we explore how agentic QA automation addresses these challenges and walk through a practical example using Amazon Bedrock AgentCore Browser and Amazon Nova Act to automate testing for a sample retail application.
In this post, we demonstrate how to optimize large language model (LLM) inference on Amazon SageMaker AI using BentoML's LLM-Optimizer to systematically identify the best serving configurations for your workload.
This post explores Chain-of-Draft (CoD), an innovative prompting technique introduced in a Zoom AI Research paper Chain of Draft: Thinking Faster by Writing Less, that revolutionizes how models approach reasoning tasks. While Chain-of-Thought (CoT) prompting has been the go-to method for enhancing model reasoning, CoD offers a more efficient alternative that mirrors human problem-solving patterns—using concise, high-signal thinking steps rather than verbose explanations.
In this post, we demonstrate hosting Voxtral models on Amazon SageMaker AI endpoints using vLLM and the Bring Your Own Container (BYOC) approach. vLLM is a high-performance library for serving large language models (LLMs) that features paged attention for improved memory management and tensor parallelism for distributing models across multiple GPUs.
Today, we are excited to introduce a new feature for SageMaker Studio: SOCI (Seekable Open Container Initiative) indexing. SOCI supports lazy loading of container images, where only the necessary parts of an image are downloaded initially rather than the entire container.
This post demonstrates how to use the powerful combination of Strands Agents, Amazon Bedrock AgentCore, and NVIDIA NeMo Agent Toolkit to build, evaluate, optimize, and deploy AI agents on Amazon Web Services (AWS) from initial development through production deployment.
In this post, you will learn about bi-directional streaming on AgentCore Runtime and the prerequisites to create a WebSocket implementation. You will also learn how to use Strands Agents to implement a bi-directional streaming solution for voice agents.
It continues to be a busy December, if not quite as busy as last year. Today's big news is Gemini 3 Flash, the latest in Google's "Flash" line of faster and less expensive models. Google are emphasizing the comparison between the new Flash and their previous generation's top model Gemini 2.5 Pro: Building on 3 Pro’s strong multimodal, coding and agentic features, 3 Flash offers powerful performance at less than a quarter the cost of 3 Pro, along with higher rate limits. The new 3 Flash model surpasses 2.5 Pro across many benchmarks while delivering faster speeds. Gemini 3 Flash's characteristics are almost identical to Gemini 3 Pro: it accepts text, image, video, audio, and PDF, outputs only text, handles 1,048,576 maximum input tokens and up to 65,536 output tokens, and has the same knowledge cut-off date of January 2025 (also shared with the Gemini 2.5 series). The benchmarks look good. The cost is appealing: 1/4 the price of Gemini 3 Pro ≤200k and 1/8 the price of Gemini 3 Pro >200k, and it's nice not to have a price increase for the new Flash at larger token lengths. It's a little more expensive than previous Flash models - Gemini 2.5 Flash was $0.30/million input tokens and $2.50/million on output, Gemini 3 Flash is $0.50/million and $3/million respectively. Google claim it may still end up cheaper though, due to more efficient output token usage: > Gemini 3 Flash is able to modulate how much it thinks. It may think longer for more complex use cases, but it also uses 30% fewer tokens on average than 2.5 Pro. Here's a more extensive price comparison on my llm-prices.com site. Generating some SVGs of pelicans I released llm-gemini 0.28 this morning with support for the new model. You can try it out like this: llm install -U llm-gemini llm keys set gemini # paste in key llm -m gemini-3-flash-preview "Generate an SVG of a pelican riding a bicycle" According to the developer docs the new model supports four different thinking level options: minimal, low, medium, and high. This is different from Gemini 3 Pro, which only supported low and high. You can run those like this: llm -m gemini-3-flash-preview --thinking-level minimal "Generate an SVG of a pelican riding a bicycle" Here are four pelicans, for thinking levels minimal, low, medium, and high: I built the gallery component with Gemini 3 Flash The gallery above uses a new Web Component which I built using Gemini 3 Flash to try out its coding abilities. The code on the page looks like this: <image-gallery width="4"> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-minimal-pelican-svg.jpg" alt="A minimalist vector illustration of a stylized white bird with a long orange beak and a red cap riding a dark blue bicycle on a single grey ground line against a plain white background." /> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-low-pelican-svg.jpg" alt="Minimalist illustration: A stylized white bird with a large, wedge-shaped orange beak and a single black dot for an eye rides a red bicycle with black wheels and a yellow pedal against a solid light blue background." /> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-medium-pelican-svg.jpg" alt="A minimalist illustration of a stylized white bird with a large yellow beak riding a red road bicycle in a racing position on a light blue background." /> <img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg" alt="Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background." /> </image-gallery> Those alt attributes are all generated by Gemini 3 Flash as well, using this recipe: llm -m gemini-3-flash-preview --system ' You write alt text for any image pasted in by the user. Alt text is always presented in a fenced code block to make it easy to copy and paste out. It is always presented on a single line so it can be used easily in Markdown images. All text on the image (for screenshots etc) must be exactly included. A short note describing the nature of the image itself should go first.' \ -a https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg You can see the code that powers the image gallery Web Component here on GitHub. I built it by prompting Gemini 3 Flash via LLM like this: llm -m gemini-3-flash-preview ' Build a Web Component that implements a simple image gallery. Usage is like this: <image-gallery width="5"> <img src="image1.jpg" alt="Image 1"> <img src="image2.jpg" alt="Image 2" data-thumb="image2-thumb.jpg"> <img src="image3.jpg" alt="Image 3"> </image-gallery> If an image has a data-thumb= attribute that one is used instead, other images are scaled down. The image gallery always takes up 100% of available width. The width="5" attribute means that five images will be shown next to each other in each row. The default is 3. There are gaps between the images. When an image is clicked it opens a modal dialog with the full size image. Return a complete HTML file with both the implementation of the Web Component several example uses of it. Use https://picsum.photos/300/200 URLs for those example images.' It took a few follow-up prompts using llm -c: llm -c 'Use a real modal such that keyboard shortcuts and accessibility features work without extra JS' llm -c 'Use X for the close icon and make it a bit more subtle' llm -c 'remove the hover effect entirely' llm -c 'I want no border on the close icon even when it is focused' Here's the full transcript, exported using llm logs -cue. Those five prompts took: 225 input, 3,269 output 2,243 input, 2,908 output 4,319 input, 2,516 output 6,376 input, 2,094 output 8,151 input, 1,806 output Added together that's 21,314 input and 12,593 output for a grand total of 4.8436 cents. The guide to migrating from Gemini 2.5 reveals one disappointment: Image segmentation: Image segmentation capabilities (returning pixel-level masks for objects) are not supported in Gemini 3 Pro or Gemini 3 Flash. For workloads requiring native image segmentation, we recommend continuing to utilize Gemini 2.5 Flash with thinking turned off or Gemini Robotics-ER 1.5. I wrote about this capability in Gemini 2.5 back in April. I hope they come back in future models - they're a really neat capability that is unique to Gemini. Tags: google, ai, web-components, generative-ai, llms, llm, gemini, llm-pricing, pelican-riding-a-bicycle, llm-release
The new ChatGPT Images is here OpenAI shipped an update to their ChatGPT Images feature - the feature that gained them 100 million new users in a week when they first launched it back in March, but has since been eclipsed by Google's Nano Banana and then further by Nana Banana Pro in November. The focus for the new ChatGPT Images is speed and instruction following: It makes precise edits while keeping details intact, and generates images up to 4x faster It's also a little cheaper: OpenAI say that the new gpt-image-1.5 API model makes image input and output "20% cheaper in GPT Image 1.5 as compared to GPT Image 1". I tried a new test prompt against a photo I took of Natalie's ceramic stand at the farmers market a few weeks ago: Add two kakapos inspecting the pots Here's the result from the new ChatGPT Images model: And here's what I got from Nano Banana Pro: The ChatGPT Kākāpō are a little chonkier, which I think counts as a win. I was a little less impressed by the result I got for an infographic from the prompt "Infographic explaining how the Datasette open source project works" followed by "Run some extensive searches and gather a bunch of relevant information and then try again" (transcript): See my Nano Banana Pro post for comparison. Both models are clearly now usable for text-heavy graphics though, which makes them far more useful than previous generations of this technology. Update 21st December 2025: I realized I already have a tool for accessing this new model via the API. Here's what I got from the following: OPENAI_API_KEY="$(llm keys get openai)" \ uv run openai_image.py -m gpt-image-1.5\ 'a raccoon with a double bass in a jazz bar rocking out' Total cost: $0.2041. Tags: ai, kakapo, openai, generative-ai, text-to-image, nano-banana
ty: An extremely fast Python type checker and LSP The team at Astral have been working on this for quite a long time, and are finally releasing the first beta. They have some big performance claims: Without caching, ty is consistently between 10x and 60x faster than mypy and Pyright. When run in an editor, the gap is even more dramatic. As an example, after editing a load-bearing file in the PyTorch repository, ty recomputes diagnostics in 4.7ms: 80x faster than Pyright (386ms) and 500x faster than Pyrefly (2.38 seconds). ty is very fast! The easiest way to try it out is via uvx: cd my-python-project/ uvx ty check I tried it against sqlite-utils and it turns out I have quite a lot of work to do! Astral also released a new VS Code extension adding ty-powered language server features like go to definition. I'm still getting my head around how this works and what it can do. Via Hacker News Tags: python, vs-code, astral
The new ChatGPT Images is powered by our flagship image generation model, delivering more precise edits, consistent details, and image generation up to 4× faster. The upgraded model is rolling out to all ChatGPT users today and is also available in the API as GPT-Image-1.5.
When the ABB FIA Formula E World Championship launched its first race through Beijing’s Olympic Park in 2014, the idea of all-electric motorsport still bordered on experimental. Batteries couldn’t yet last a full race, and drivers had to switch cars mid-competition. Just over a decade later, Formula E has evolved into a global entertainment brand…
I recently came across JustHTML, a new Python library for parsing HTML released by Emil Stenström. It's a very interesting piece of software, both as a useful library and as a case study in sophisticated AI-assisted programming. First impressions of JustHTML I didn't initially know that JustHTML had been written with AI assistance at all. The README caught my eye due to some attractive characteristics: It's pure Python. I like libraries that are pure Python (no C extensions or similar) because it makes them easy to use in less conventional Python environments, including Pyodide. "Passes all 9,200+ tests in the official html5lib-tests suite (used by browser vendors)" - this instantly caught my attention! HTML5 is a big, complicated but meticulously written specification. 100% test coverage. That's not something you see every day. CSS selector queries as a feature. I built a Python library for this many years ago and I'm always interested in seeing new implementations of that pattern. html5lib has been inconsistently maintained over the last few years, leaving me interested in potential alternatives. It's only 3,000 lines of implementation code (and another ~11,000 of tests.) I was out and about without a laptop so I decided to put JustHTML through its paces on my phone. I prompted Claude Code for web on my phone and had it build this Pyodide-powered HTML tool for trying it out: This was enough for me to convince myself that the core functionality worked as advertised. It's a neat piece of code! Turns out it was almost all built by LLMs At this point I went looking for some more background information on the library and found Emil's blog entry about it: How I wrote JustHTML using coding agents: Writing a full HTML5 parser is not a short one-shot problem. I have been working on this project for a couple of months on off-hours. Tooling: I used plain VS Code with Github Copilot in Agent mode. I enabled automatic approval of all commands, and then added a blacklist of commands that I always wanted to approve manually. I wrote an agent instruction that told it to keep working, and don't stop to ask questions. Worked well! Emil used several different models - an advantage of working in VS Code Agent mode rather than a provider-locked coding agent like Claude Code or Codex CLI. Claude Sonnet 3.7, Gemini 3 Pro and Claude Opus all get a mention. Vibe engineering, not vibe coding What's most interesting about Emil's 17 step account covering those several months of work is how much software engineering was involved, independent of typing out the actual code. I wrote about vibe engineering a while ago as an alternative to vibe coding. Vibe coding is when you have an LLM knock out code without any semblance of code review - great for prototypes and toy projects, definitely not an approach to use for serious libraries or production code. I proposed "vibe engineering" as the grown up version of vibe coding, where expert programmers use coding agents in a professional and responsible way to produce high quality, reliable results. You should absolutely read Emil's account in full. A few highlights: He hooked in the 9,200 test html5lib-tests conformance suite almost from the start. There's no better way to construct a new HTML5 parser than using the test suite that the browsers themselves use. He picked the core API design himself - a TagHandler base class with handle_start() etc. methods - and told the model to implement that. He added a comparative benchmark to track performance compared to existing libraries like html5lib, then experimented with a Rust optimization based on those initial numbers. He threw the original code away and started from scratch as a rough port of Servo's excellent html5ever Rust library. He built a custom profiler and new benchmark and let Gemini 3 Pro loose on it, finally achieving micro-optimizations to beat the existing Pure Python libraries. He used coverage to identify and remove unnecessary code. He had his agent build a custom fuzzer to generate vast numbers of invalid HTML documents and harden the parser against them. This represents a lot of sophisticated development practices, tapping into Emil's deep experience as a software engineer. As described, this feels to me more like a lead architect role than a hands-on coder. It perfectly fits what I was thinking about when I described vibe engineering. Setting the coding agent up with the html5lib-tests suite is also a great example of designing an agentic loop. "The agent did the typing" Emil concluded his article like this: JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn't have written it this quickly without the agent. But "quickly" doesn't mean "without thinking." I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking. That's probably the right division of labor. I couldn't agree more. Coding agents replace the part of my job that involves typing the code into a computer. I find what's left to be a much more valuable use of my time. Tags: html, python, ai, generative-ai, llms, ai-assisted-programming, vibe-coding, coding-agents
OpenAI takes an ownership stake in Thrive Holdings to accelerate enterprise AI adoption, embedding frontier research and engineering directly into accounting and IT services to boost speed, accuracy, and efficiency while creating a scalable model for industry-wide transformation.
How Stripe built a payments foundation model, why stablecoins are powering more of the AI economy, and growing internal AI adoption to 8,500 employees daily.
Built by @aellman
2026 68 Ventures, LLC. All rights reserved.