Price Per TokenPrice Per Token
z-spectral·17d ago

“Cost per token vs cost per outcome” - are we optimizing the wrong thing?

Everyone here tracks token pricing (this site literally exists for that), but I’m starting to think it’s becoming a misleading metric for real-world systems.

A few observations from building with LLMs:

  1. Token price is going down, but total system cost often goes up
  2. Longer context, retries, tool calls, and evaluation loops all multiply usage
  3. A “cheap” model can end up more expensive if it fails more often

There’s also the hidden cost of:

  • bad outputs → user churn
  • retries → more tokens
  • guardrails → more complexity

This matches what some FinOps discussions highlight - the real cost is driven by usage patterns, not list price...

So I’m wondering:
👉 Should we stop comparing models by $/token and instead compare them by something like:

  • cost per successful task
  • cost per correct answer
  • cost per user session

Curious how others here think about this....

Do you track anything beyond token usage?
Or is everyone still optimizing for the wrong metric?

3
to join the discussion.

3 comments

Cahl-Dee·17d ago

It's probably the right question to be asking. Just saw some Ramp data saying that people are spending 13x on AI subscriptions from 16 months ago. I doubt they're getting 13x the value.

I've seen some analysis along these lines, such as Artificial Analysis publishing their data about cost to run the full benchmark sweet. That wont factor in the cost of "incorrect" answers that require followup retires, but it does factor in verbosity. That's something I frequently hear people point out when comparing GPT 5.4 vs Opus 4.6, GPT is just chatty.

One other intangible that I don't know if we'll ever really be able to put our finger on is the cost of staying up to date on the latest tools / harnesses / skills / models / prompting techniques / / / / / /... It's a never ending cycle of immeasurable cost.

1 pt
ellmanalex·16d ago

Yes I think this is a great point! It's really hard because every individual task is quite different but something we can use is tracking the amount of tokens each model uses when they complete bechmarks. That is a metric I am looking to add now.

2 pts
Cahl-Dee·15d ago

Love that! I know Artificial Analysis has something similar but only for their intelligence index. I'd love to see something more advanced around this

1 pt