LLMs metrics

Disclaimer: This article is not about helping you choose the right LLM for your specific use case. Its purpose is to explain the metrics you should be comparing when making that decision. Once you understand what these metrics mean, you can ask ChatGPT, Claude, Gemini, Perplexity, Artificial Analysis, OpenRouter, or any benchmarking tool to provide the latest numbers for the models you’re evaluating. The goal is not to tell you which model wins, but to help you understand what game is being played and which scoreboards actually matter.

So…

We keep judging LLMs like football teams by goals scored.

Fast model.
High tokens per second.
Looks great on the scoreboard.

But even in football, goals are not everything.

What’s the value of scoring if:

your star player gets injured every second match?
your coach loses the locker room halfway through the season?
you score once… but concede twice?
your defense collapses under pressure?

No serious club is built on goals alone.

Yet in AI, we keep doing exactly that.

We look at average tokens per second and crown winners, as if we’re watching highlights — not running a full season.

That’s where the comparison should stop.

Because once LLMs move from demos to production, they stop being highlights — and start behaving like infrastructure.

Download the cheatsheet in High Definition

Everyone compares models by speed.

Here’s why that’s the wrong starting point.

Most model comparisons stop at Avg tokens per second.

Fast model wins.
Slow model loses.

That works well for tweets.
It works terribly for products.

Once an LLM enters a real system, speed becomes just one variable among many failure points.

Let’s go through the metrics that actually decide whether a model survives in production.

1. Average tokens per second (Avg t/s)

This is the metric everyone knows.

What it measures
How many tokens a model can generate per second after it has started responding.

Why it matters

Faster streaming responses feel better
High throughput helps with scale
Batch jobs complete sooner

What it does not tell you

How long the model stays silent before responding
Whether the output is usable
Whether the model breaks mid-response

Avg t/s is output speed — not user experience.

2. Time to First Token (TTFT)

This is the metric users feel immediately.

What it measures
The time between sending a prompt and receiving the first token of output.

Why it matters

Silence feels like failure
Streaming UX lives or dies here
Users judge responsiveness before quality

Two models can have identical Avg t/s.
One feels instant.
The other feels broken.

3. Latency distribution (P50 / P95 / P99)

Averages hide problems.

What these metrics measure

P50 (median): the typical experience
P95: the slowest experience for 5% of requests
P99: worst-case behavior

Why this matters
LLMs do not run in isolation.

They run:

in chains
with retries
under concurrency
inside agents

When P95 or P99 latency spikes:

agents stall
queues build
costs rise
trust erodes

Nothing crashes.
The system just becomes unreliable.

4. Cost per successful task

Most pricing discussions are misleading.

What people usually look at

Cost per input token
Cost per output token

What actually matters

How much does it cost to complete one task successfully?

A real task includes:

retries
validation
fallback models
hallucination cleanup

A cheap model that fails often is not cheap.

5. Instruction following & constraint adherence

This is where many systems fail silently.

What it measures

Respect for constraints
Format adherence
JSON correctness
Tool-call accuracy

Most production failures are not wrong answers.
They are almost correct outputs.

Smart but sloppy models break pipelines.

6. Determinism & output stability

Same prompt.
Same context.
Same settings.

Do you get the same structure and behavior?

Why this matters

Debugging
QA
Auditing
Trust

High variance makes systems unpredictable.

7. Long-context effectiveness

Big context windows are easy to advertise.

What actually matters

Can the model reason across long inputs?
Does attention degrade?
Does it focus on relevant information?

Large context without reasoning quality is just expensive memory.

8. Tool use & agent behavior

For agentic systems, this is critical.

What it measures

Correct tool selection
Correct arguments
Recovery after tool failure
Knowing when not to act

A model can reason well and still fail as an agent.

9. Safety, refusals & guardrails

Refusals are inevitable.

What matters is:

false positives
refusal style
recovery paths

Bad refusals break flows.
Good refusals preserve them.

10. Reliability & operational consistency

Often ignored. Always painful.

This includes:

uptime
rate limits
throttling
breaking updates
SDK stability

Reliability is part of model quality.

Why there is no official scoreboard

There is no governing body for LLM metrics.

No ISO.
No regulator.
No standardized benchmarks across providers.

Why?

Models evolve constantly
Infrastructure affects results
Prompts shape performance
Vendors avoid exposing weaknesses

So we rely on community-driven measurement.

Two unofficial sources worth watching

Artificial Analysis

One of the few structured attempts to evaluate models as systems.

They track:

speed
latency
cost
benchmark performance

Not perfect.
But consistent and transparent.

OpenRouter

A routing layer that exposes real-world behavior.

Because it sits between:

users
workloads
multiple providers

It reveals tradeoffs you only see in production.

Final thought

Average tokens per second is not useless.

It’s just incomplete.

If you’re building real products, the real question is not:

“Which model is the fastest?”

It’s:

“Which model behaves predictably when things are not ideal?”

Growth has never been about picking the best tool.
It’s about understanding the trade-offs before they hurt you.

Theodore Moulos

Theodore has 20 years of experience running successful and profitable software products. In his free time, he coaches and consults startups. His career includes managerial posts for companies in the UK and abroad, and he has significant skills in intrapreneurship and entrepreneurship.

Previous « Do we really like AI?

Published by

Theodore Moulos

Tags: cheatsheetevaluationllmsmetrics

3 hours ago

Do we really like AI?
Read why companies, universities, and leaders need to redesign AI as an apprenticeship accelerator, not…
Grokipedia for SEO: How to Write the Perfect Article & Get a Dofollow Backlink
Grokipedia is xAI's AI-generated encyclopedia — and it's already being cited by ChatGPT, Perplexity, and…

Do we really like AI?

Read why companies, universities, and leaders need to redesign AI as an apprenticeship accelerator, not…

7 hours ago

Grokipedia for SEO: How to Write the Perfect Article & Get a Dofollow Backlink

Grokipedia is xAI's AI-generated encyclopedia — and it's already being cited by ChatGPT, Perplexity, and…

3 weeks ago

Growth Hacking

Is Your Site Agent-Ready?

The web is entering a new phase. There are 2 questions arising. Do you know…

1 month ago

Growth Hacking

Preferred Sources: The Moment Google Admitted Search Is Becoming a Trust Engine

When a user selects your site as a preferred source, your content is more likely…

1 month ago

Growth Hacking

FAQ Schema Is Dead. FAQ Content Is More Important Than Ever.

FAQ schema can stay on your pages, but it no longer earns visible FAQ results…

1 month ago

Growth Hacking

Why Most SERP Scraping Setups Fail Before They Deliver Insights

SEO teams like to think they are data-driven. In practice, most decisions still rely on…

2 months ago

Rating

[vc_row row_height_percent="0" override_padding="yes" h_padding="0" top_padding="0" bottom_padding="0" overlay_alpha="50" gutter_size="3" shift_y="0"][vc_column width="1/1"][vc_single_image media="54561" caption="yes" media_width_percent="80" alignment="center"][/vc_column][/vc_row]

Fund Manager

[vc_row row_height_percent="0" override_padding="yes" h_padding="0" top_padding="0" bottom_padding="0" overlay_alpha="50" gutter_size="3" shift_y="0"][vc_column width="1/1"][vc_column_text]Nota Zagari More than 24 years of professional experience. She began her career as Equities Analyst in 1991 in ALPHA TRUST. She manages the Alpha Trust Hellenic Equity Fund since 1995.[/vc_column_text][vc_single_image media="54544" media_width_percent="80" alignment="center"][/vc_column][/vc_row]

Facts and Figures (PDF)

[vc_row row_height_percent="0" override_padding="yes" h_padding="0" top_padding="0" bottom_padding="0" overlay_alpha="50" gutter_size="3" shift_y="0"][vc_column width="1/1"][vc_single_image media="54564" caption="yes" media_width_percent="30" alignment="center" media_link="url:https%3A%2F%2Fwww.alphatrust.gr%2Fimages%2FENHMERWTIKO_YLIKO%2FFACTS_AND_FIGURES%2FHellenic_ENGLISH.pdf|||"][/vc_column][/vc_row]

LLMs metrics

Everyone compares models by speed.

1. Average tokens per second (Avg t/s)

2. Time to First Token (TTFT)

3. Latency distribution (P50 / P95 / P99)

4. Cost per successful task

5. Instruction following & constraint adherence

6. Determinism & output stability

7. Long-context effectiveness

8. Tool use & agent behavior

9. Safety, refusals & guardrails

10. Reliability & operational consistency

Why there is no official scoreboard

Two unofficial sources worth watching

Artificial Analysis

OpenRouter

Final thought

Related Post

Recent Posts

Do we really like AI?

Grokipedia for SEO: How to Write the Perfect Article & Get a Dofollow Backlink

Is Your Site Agent-Ready?

Preferred Sources: The Moment Google Admitted Search Is Becoming a Trust Engine

FAQ Schema Is Dead. FAQ Content Is More Important Than Ever.

Why Most SERP Scraping Setups Fail Before They Deliver Insights

Rating

Fund Manager

Facts and Figures (PDF)