LLMs metrics

18 June 2026
• Theodore Moulos

Navigate to: Home • Blog

Get Our Best Articles Weekly

Disclaimer: This article is not about helping you choose the right LLM for your specific use case. Its purpose is to explain the metrics you should be comparing when making that decision. Once you understand what these metrics mean, you can ask ChatGPT, Claude, Gemini, Perplexity, Artificial Analysis, OpenRouter, or any benchmarking tool to provide the latest numbers for the models you’re evaluating. The goal is not to tell you which model wins, but to help you understand what game is being played and which scoreboards actually matter.

So…

We keep judging LLMs like football teams by goals scored.

Fast model.
High tokens per second.
Looks great on the scoreboard.

But even in football, goals are not everything.

What’s the value of scoring if:

your star player gets injured every second match?
your coach loses the locker room halfway through the season?
you score once… but concede twice?
your defense collapses under pressure?

No serious club is built on goals alone.

Yet in AI, we keep doing exactly that.

We look at average tokens per second and crown winners, as if we’re watching highlights — not running a full season.

That’s where the comparison should stop.

Because once LLMs move from demos to production, they stop being highlights — and start behaving like infrastructure.

Download the cheatsheet in High Definition

Everyone compares models by speed.

Here’s why that’s the wrong starting point.

Most model comparisons stop at Avg tokens per second.

Fast model wins.
Slow model loses.

That works well for tweets.
It works terribly for products.

Once an LLM enters a real system, speed becomes just one variable among many failure points.

Let’s go through the metrics that actually decide whether a model survives in production.

1. Average tokens per second (Avg t/s)

This is the metric everyone knows.

What it measures
How many tokens a model can generate per second after it has started responding.

Why it matters

Faster streaming responses feel better
High throughput helps with scale
Batch jobs complete sooner

What it does not tell you

How long the model stays silent before responding
Whether the output is usable
Whether the model breaks mid-response

Avg t/s is output speed — not user experience.

2. Time to First Token (TTFT)

This is the metric users feel immediately.

What it measures
The time between sending a prompt and receiving the first token of output.

Why it matters

Silence feels like failure
Streaming UX lives or dies here
Users judge responsiveness before quality

Two models can have identical Avg t/s.
One feels instant.
The other feels broken.

3. Latency distribution (P50 / P95 / P99)

Averages hide problems.

What these metrics measure

P50 (median): the typical experience
P95: the slowest experience for 5% of requests
P99: worst-case behavior

Why this matters
LLMs do not run in isolation.

They run:

in chains
with retries
under concurrency
inside agents

When P95 or P99 latency spikes:

agents stall
queues build
costs rise
trust erodes

Nothing crashes.
The system just becomes unreliable.

4. Cost per successful task

Most pricing discussions are misleading.

What people usually look at

Cost per input token
Cost per output token

What actually matters

How much does it cost to complete one task successfully?

A real task includes:

retries
validation
fallback models
hallucination cleanup

A cheap model that fails often is not cheap.

5. Instruction following & constraint adherence

This is where many systems fail silently.

What it measures

Respect for constraints
Format adherence
JSON correctness
Tool-call accuracy

Most production failures are not wrong answers.
They are almost correct outputs.

Smart but sloppy models break pipelines.

6. Determinism & output stability

Same prompt.
Same context.
Same settings.

Do you get the same structure and behavior?

Why this matters

Debugging
QA
Auditing
Trust

High variance makes systems unpredictable.

7. Long-context effectiveness

Big context windows are easy to advertise.

What actually matters

Can the model reason across long inputs?
Does attention degrade?
Does it focus on relevant information?

Large context without reasoning quality is just expensive memory.

8. Tool use & agent behavior

For agentic systems, this is critical.

What it measures

Correct tool selection
Correct arguments
Recovery after tool failure
Knowing when not to act

A model can reason well and still fail as an agent.

9. Safety, refusals & guardrails

Refusals are inevitable.

What matters is:

false positives
refusal style
recovery paths

Bad refusals break flows.
Good refusals preserve them.

10. Reliability & operational consistency

Often ignored. Always painful.

This includes:

uptime
rate limits
throttling
breaking updates
SDK stability

Reliability is part of model quality.

Why there is no official scoreboard

There is no governing body for LLM metrics.

No ISO.
No regulator.
No standardized benchmarks across providers.

Why?

Models evolve constantly
Infrastructure affects results
Prompts shape performance
Vendors avoid exposing weaknesses

So we rely on community-driven measurement.

Two unofficial sources worth watching

Artificial Analysis

One of the few structured attempts to evaluate models as systems.

They track:

speed
latency
cost
benchmark performance

Not perfect.
But consistent and transparent.

OpenRouter

A routing layer that exposes real-world behavior.

Because it sits between:

users
workloads
multiple providers

It reveals tradeoffs you only see in production.

Final thought

Average tokens per second is not useless.

It’s just incomplete.

If you’re building real products, the real question is not:

“Which model is the fastest?”

It’s:

“Which model behaves predictably when things are not ideal?”

Growth has never been about picking the best tool.
It’s about understanding the trade-offs before they hurt you.

Theodore Moulos

Theodore has 20 years of experience running successful and profitable software products. In his free time, he coaches and consults startups. His career includes managerial posts for companies in the UK and abroad, and he has significant skills in intrapreneurship and entrepreneurship.

Topics:

Artificial Intelligence
, cheatsheet, evaluation, llms, metrics

If you found this article valuable, you can share it with your fellow marketers

Get Our Best Articles Weekly

Artificial Intelligence

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-52042964-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_lfa	2 years	This cookie is set by the provider Leadfeeder to identify the IP address of devices visiting the website, in order to retarget multiple users routing from the same IP address.
apbct_cookies_test	session	CleanTalk sets this cookie to prevent spam on comments and forms and act as a complete anti-spam solution and firewall for the site.
apbct_page_hits	session	CleanTalk sets this cookie to prevent spam on comments and forms and act as a complete anti-spam solution and firewall for the site.
apbct_prev_referer	session	Functional cookie placed by CleanTalk Spam Protect to store referring IDs and prevent unauthorized spam from being sent from the website.
apbct_site_landing_ts	session	CleanTalk sets this cookie to prevent spam on comments and forms and act as a complete anti-spam solution and firewall for the site.
apbct_site_referer	3 days	This cookie is placed by CleanTalk Spam Protect to prevent spam and to store the referrer page address which led the user to the website.
apbct_timestamp	session	CleanTalk sets this cookie to prevent spam on comments and forms and act as a complete anti-spam solution and firewall for the site.
apbct_urls	3 days	This cookie is placed by CleanTalk Spam Protect to prevent spam and to store the addresses (urls) visited on the website.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
ct_checkjs	session	CleanTalk–Used to prevent spam on our comments and forms and acts as a complete anti-spam solution and firewall for this site.
ct_fkp_timestamp	session	CleanTalk sets this cookie to prevent spam on the site's comments/forms, and to act as a complete anti-spam solution and firewall for the site.
ct_pointer_data	session	CleanTalk sets this cookie to prevent spam on the site's comments/forms, and to act as a complete anti-spam solution and firewall for the site.
ct_ps_timestamp	session	CleanTalk sets this cookie to prevent spam on the site's comments/forms, and to act as a complete anti-spam solution and firewall for the site.
ct_timezone	session	CleanTalk–Used to prevent spam on our comments and forms and acts as a complete anti-spam solution and firewall for this site.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_gr	2 years	This cookie captures the email of the user when identified. We have three (3) ways to identify the email of the user. a) when user clicks on a link of a Growthrocks' campaign, b) when user is logged-in and c) when a user submits a form containing an email
_gr_flag	2 years	A simple cookie to capture that information of the user has been sent to Marketing Automation. If true no further information will be sent. If no, the system will try to send information when the email will be filled
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-52042964-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
_lfa_test_cookie_stored	past	No description
AnalyticsSyncHistory	1 month	No description
apbct_visible_fields_0	session	No description
apbct_visible_fields_1	session	No description
apbct_visible_fields_2	session	No description
apbct_visible_fields_3	session	No description
apbct_visible_fields_4	session	No description
apbct_visible_fields_5	session	No description
apbct_visible_fields_6	session	No description
apbct_visible_fields_7	session	No description
ct_checked_emails	session	No description
ct_has_scrolled	session	No description
ct_mouse_moved	session	No description
ct_screen_info	session	No description
GetLocalTimeZone	session	No description
ifso_last_viewed	session	No description available.
ifso_visit_counts	1 year	No description available.
li_gc	2 years	No description
nitroCachedPage	session	No description
prism_649664625	1 month	No description

ABOUT US

B2B BUSINESSES

B2C BUSINESSES

MARKETPLACES

SOLUTIONS

INVALUABLE EBOOKS

THE PLAYBOOKS

LISTINGS / DIRECTORIES

LLMs metrics

Everyone compares models by speed.

1. Average tokens per second (Avg t/s)

2. Time to First Token (TTFT)

3. Latency distribution (P50 / P95 / P99)

4. Cost per successful task

5. Instruction following & constraint adherence

6. Determinism & output stability

7. Long-context effectiveness

8. Tool use & agent behavior

9. Safety, refusals & guardrails

10. Reliability & operational consistency

Why there is no official scoreboard

Two unofficial sources worth watching

Artificial Analysis

OpenRouter

Final thought

Get Our Best Articles Weekly

Related Posts

About

Services

Special Cases

Industries

Latest Posts

Call us

Get actionable growth tips

Have a cookie :)