Everyone here tracks token pricing (this site literally exists for that), but I’m starting to think it’s becoming a misleading metric for real-world systems.
A few observations from building with LLMs:
There’s also the hidden cost of:
This matches what some FinOps discussions highlight - the real cost is driven by usage patterns, not list price...
So I’m wondering:
👉 Should we stop comparing models by $/token and instead compare them by something like:
Curious how others here think about this....
Do you track anything beyond token usage?
Or is everyone still optimizing for the wrong metric?
It's probably the right question to be asking. Just saw some Ramp data saying that people are spending 13x on AI subscriptions from 16 months ago. I doubt they're getting 13x the value.
I've seen some analysis along these lines, such as Artificial Analysis publishing their data about cost to run the full benchmark sweet. That wont factor in the cost of "incorrect" answers that require followup retires, but it does factor in verbosity. That's something I frequently hear people point out when comparing GPT 5.4 vs Opus 4.6, GPT is just chatty.
One other intangible that I don't know if we'll ever really be able to put our finger on is the cost of staying up to date on the latest tools / harnesses / skills / models / prompting techniques / / / / / /... It's a never ending cycle of immeasurable cost.
Yes I think this is a great point! It's really hard because every individual task is quite different but something we can use is tracking the amount of tokens each model uses when they complete bechmarks. That is a metric I am looking to add now.
Love that! I know Artificial Analysis has something similar but only for their intelligence index. I'd love to see something more advanced around this
Built by @aellman
2026 68 Ventures, LLC. All rights reserved.