Disclaimer: This article is not about helping you choose the right LLM for your specific use case. Its purpose is to explain the metrics you should be comparing when making that decision. Once you understand what these metrics mean, you can ask ChatGPT, Claude, Gemini, Perplexity, Artificial Analysis, OpenRouter, or any benchmarking tool to provide the latest numbers for the models you’re evaluating. The goal is not to tell you which model wins, but to help you understand what game is being played and which scoreboards actually matter.
So…
We keep judging LLMs like football teams by goals scored.
Fast model.
High tokens per second.
Looks great on the scoreboard.
But even in football, goals are not everything.
What’s the value of scoring if:
No serious club is built on goals alone.
Yet in AI, we keep doing exactly that.
We look at average tokens per second and crown winners, as if we’re watching highlights — not running a full season.
That’s where the comparison should stop.
Because once LLMs move from demos to production, they stop being highlights — and start behaving like infrastructure.
Download the cheatsheet in High Definition
Here’s why that’s the wrong starting point.
Most model comparisons stop at Avg tokens per second.
Fast model wins.
Slow model loses.
That works well for tweets.
It works terribly for products.
Once an LLM enters a real system, speed becomes just one variable among many failure points.
Let’s go through the metrics that actually decide whether a model survives in production.
This is the metric everyone knows.
What it measures
How many tokens a model can generate per second after it has started responding.
Why it matters
What it does not tell you
Avg t/s is output speed — not user experience.
This is the metric users feel immediately.
What it measures
The time between sending a prompt and receiving the first token of output.
Why it matters
Two models can have identical Avg t/s.
One feels instant.
The other feels broken.
Averages hide problems.
What these metrics measure
Why this matters
LLMs do not run in isolation.
They run:
When P95 or P99 latency spikes:
Nothing crashes.
The system just becomes unreliable.
Most pricing discussions are misleading.
What people usually look at
What actually matters
How much does it cost to complete one task successfully?
A real task includes:
A cheap model that fails often is not cheap.
This is where many systems fail silently.
What it measures
Most production failures are not wrong answers.
They are almost correct outputs.
Smart but sloppy models break pipelines.
Same prompt.
Same context.
Same settings.
Do you get the same structure and behavior?
Why this matters
High variance makes systems unpredictable.
Big context windows are easy to advertise.
What actually matters
Large context without reasoning quality is just expensive memory.
For agentic systems, this is critical.
What it measures
A model can reason well and still fail as an agent.
Refusals are inevitable.
What matters is:
Bad refusals break flows.
Good refusals preserve them.
Often ignored. Always painful.
This includes:
Reliability is part of model quality.
There is no governing body for LLM metrics.
No ISO.
No regulator.
No standardized benchmarks across providers.
Why?
So we rely on community-driven measurement.
One of the few structured attempts to evaluate models as systems.
They track:
Not perfect.
But consistent and transparent.
A routing layer that exposes real-world behavior.
Because it sits between:
It reveals tradeoffs you only see in production.
Average tokens per second is not useless.
It’s just incomplete.
If you’re building real products, the real question is not:
“Which model is the fastest?”
It’s:
“Which model behaves predictably when things are not ideal?”
Growth has never been about picking the best tool.
It’s about understanding the trade-offs before they hurt you.
Theodore has 20 years of experience running successful and profitable software products. In his free time, he coaches and consults startups. His career includes managerial posts for companies in the UK and abroad, and he has significant skills in intrapreneurship and entrepreneurship.
Read why companies, universities, and leaders need to redesign AI as an apprenticeship accelerator, not…
Grokipedia is xAI's AI-generated encyclopedia — and it's already being cited by ChatGPT, Perplexity, and…
The web is entering a new phase. There are 2 questions arising. Do you know…
When a user selects your site as a preferred source, your content is more likely…
FAQ schema can stay on your pages, but it no longer earns visible FAQ results…
SEO teams like to think they are data-driven. In practice, most decisions still rely on…