AI detectors have a certain kind of power.
They decide whether your article gets published, whether a student gets flagged, and whether a freelance writer gets paid.
But as our favorite Marvel hero used to say: With great power comes great responsibility.
And that responsibility is often missing.
Most AI detectors speak in percentages, confidence scores, and definitive labels, like “92% AI” or “Human-written.” And for editors under pressure, those labels often end the conversation before it even starts.
So is that enough? Or fair, even?
To find out, we had to put them to the test. Literally.
Before We Begin: The Truth About AI Detectors
Here’s the thing: AI detectors aren’t truth machines. They’re probability engines.
At the end of the day, a technology that can potentially detect AI-generated text doesn’t actually determine authorship; it simply estimates probabilities based on patterns.
AI detectors don’t know who wrote a text or how it was created. They compare patterns and make an educated guess based on what “AI-written” and “human-written” usually look like.
The problem is that those two categories overlap. A lot. Humans can write robotic content. AI can write natural language. And once editing is involved, the line gets even blurrier.
Methodology
Background
To test AI detectors fairly, we kept things simple and realistic.
Instead of synthetic examples or cherry-picked prompts, we used one real blog post and evaluated it in three different versions. Every detector saw the exact same inputs. No regeneration. No rewriting per tool. No detector-specific optimization.
Even though we use both ChatGPT and Claude for our writing purposes, we conducted the experiment with ChatGPT, as it remains the most widely used AI writing tool.
The source content
We started with an older GrowthRocks article that involved zero AI assistance: Discord Marketing: The Complete Guide
This article serves as our baseline for fully human-written content.
The 3 Versions we tested
Each AI detector was tested against the following three drafts:
Version 1: Human-Written, Original: The original article, published before AI writing tools were part of the workflow. [DRAFT]
Version 2: AI-Generated w/ Prompts, Minor edits: A version written with AI using structured prompting and then refined lightly by me. [DRAFT]
Version 3: AI-Generated, Unedited: A version produced by vanilla ChatGPT, written section by section, with no human edits or rewrites after generation. [DRAFT]
For full transparency, we’ve included the screenshots from every scan across all detectors and all three versions, exactly as they appeared at the time of testing.
1. GPTZero
GPTZero is one of the earliest AI detectors to gain mainstream attention, initially built with education and academic integrity as its primary use case. It launched in early 2023 and quickly spread beyond universities into newsrooms and content teams looking for a fast, opinionated AI verdict.
Score: 3/3
GPTZero delivered perfect accuracy across all three versions. It correctly identified human writing (97% human), detected AI with minor edits (99% AI), and flagged pure AI output (100% AI).
The 1-percentage-point difference between versions 2 and 3 suggests GPTZero can detect minor human intervention, but appropriately treats both as predominantly AI-generated.
2. ZeroGPT
Not to be confused with the previous AI detector, ZeroGPT positions itself as a lightweight, web-based checker designed to flag AI-generated text from popular language models like ChatGPT. ZeroGPT is primarily used for quick, surface-level assessments, offering probability scores through a simple paste-and-scan interface rather than deep editorial analysis.

Score: 2/3
ZeroGPT correctly identified version 1 as human (21.6% AI) and version 3 as AI-generated (96.22% AI). However, it failed on version 2, scoring it at only 26.59% AI and labeling it “Most Likely Human” despite being AI-generated with minor edits.
This suggests ZeroGPT’s detection threshold is too permissive. The small difference between versions 1 and 2 (21.6% vs 26.59%) indicates that light editing can push AI content below its detection threshold.
3. Originality.ai
Originality.ai is a commercial AI content detection and plagiarism platform popular with publishers, SEO teams, and professional content creators looking to verify authenticity before publication. What’s more, it combines an AI detector with a full plagiarism scanner and additional content quality checks, all in one workflow.
Score: 3/3
Originality.ai achieved perfect detection across all three versions. It correctly identified version 1 as 99% original (human), and flagged both versions 2 and 3 as 100% AI-generated.
The detector showed no sensitivity to minor human edits. Both the lightly edited AI (version 2) and pure AI (version 3) received identical 100% scores.
4. Humanize AI Detector
Humanize AI Detector is part of a class of tools focused on both detecting AI text and supporting workflows around “human-like” adjustments. The way it works is that it aggregates signals from several backend models to give users a broader sense of whether content might be machine-generated.
Score: 1/3
Humanize AI Detector only succeeded on version 1, correctly identifying it as 0% AI (human-written). It failed on versions 2 and 3, scoring version 2 at 0% AI and version 3 at just 2% AI, both labeled as “human-written.”
The detector couldn’t identify pure, unedited ChatGPT output, which is the easiest possible test case. It also failed to detect AI content with minor edits.
5. Copyleaks
Copyleaks is a content authenticity platform best known for its plagiarism detection technology, used widely by educators, institutions, and publishers to identify copied content. As generative AI became more prevalent, Copyleaks expanded its offerings to also include AI text detection.
Score: 3/3
Copyleaks delivered perfect accuracy across all three versions. It correctly identified version 1 as human (0% AI) and flagged both versions 2 and 3 as 100% AI-generated.
What sets Copyleaks apart is its “AI Phrases Detected” metric. Version 2 showed 103 AI phrases, while version 3 showed 206. This doubling suggests the detector can identify granular differences between edited and unedited AI content, even when both receive the same 100% AI classification.
Final Results
| AI Detector | Human | AI prompting + minor editing | Pure AI | Score |
|---|---|---|---|---|
| GPTZero | ✅ 97% Human | ✅ 99% AI | ✅ 100% AI | 3/3 |
| ZeroGPT | ✅ 21.6% AI | ❌ 26.59% AI | ✅ 96.22% AI | 2/3 |
| Originality.ai | ✅ 99% Original | ✅ 100% AI | ✅ 100% AI | 3/3 |
| Humanize AI | ✅ 0% AI | ❌ 0% AI | ❌ 2% AI | 1/3 |
| Copyleaks | ✅ 0% AI | ✅ 100% AI | ✅ 100% AI | 3/3 |
Honorable mentions
1. AI Detector by Grammarly
Grammarly’s AI detector is a recent addition to a platform best known for grammar, style, and writing assistance. Rather than positioning itself as a standalone AI detection product, Grammarly presents AI detection as a supportive signal within its broader writing ecosystem.
Score: 1/3
Grammarly correctly identified version 1 as human (0% AI) but failed on both AI-generated versions. Version 2 scored 47% AI, and version 3 scored just 37% AI, both below typical detection thresholds.
The results are not just inaccurate but inverted. Pure, unedited AI (version 3) scored lower than AI with minor edits (version 2).
2. JustDone
JustDone is an AI content platform that offers detection, rewriting, and content enhancement tools aimed at everyday creators and marketers. Its AI detector is designed to present results as probability signals rather than absolute verdicts.
Score: 2/3
JustDone correctly identified both AI-generated versions, scoring version 2 at 82% AI and version 3 at 88% AI. However, it flagged version 1 (the original human-written article) as 74% AI.
While it can detect actual AI content and shows sensitivity to the gradient between versions (74% to 82% to 88%), it looks like its baseline calibration needs some refinement.
3. GPTinf
GPTinf is an AI detection and “humanization” tool that focuses on estimating how likely a piece of content is to be AI-generated and then offering ways to reduce that likelihood. It presents results as percentages, positioning itself as a practical tool for users worried about AI flags.
Score: 2/3
GPTinf accurately identified the extremes: version 1 scored 1% AI (human) and version 3 scored 100% AI. However, it failed on version 2, scoring it at only 15% AI despite labeling it “likely AI-generated.”
Minor edits and -most importantly, custom prompting, were enough to drop the score from 100% to 15%, allowing lightly edited AI content to potentially slip through detection thresholds.
FinalResults
| AI Detector | Human | AI prompting + minor editing | Pure AI | Score |
|---|---|---|---|---|
| Grammarly | ✅ 0% AI | ❌ 47% AI | ❌ 37% AI | 1/3 |
| JustDone | ❌ 74% AI | ✅ 82% AI | ✅ 88% AI | 2/3 |
| GPTinf | ✅ 1% AI | ❌ 15% AI | ✅ 100% AI | 2/3 |
Conclusion
This was a controlled test using one article and three versions. It’s not comprehensive, and no single experiment can definitively judge the reliability of any AI detector. Further testing across different content types, writing styles, and use cases would be needed before drawing firm conclusions about any of these tools.
That said, the results from this experiment are worth examining. Out of eight AI detectors tested, only three achieved perfect accuracy. The results expose a potential problem with how these tools are marketed and used. They speak in percentages and confidence scores that suggest precision, but in this test, half of them couldn’t reliably tell the difference between human and AI writing.
We need to keep in mind that these tools have real consequences.
If you’re a marketer, an editor, a teacher, or anyone making decisions based on these tools, the takeaway is clear: take them with a pinch -or rather, a bag- of salt. Treat these scores as one data point, not the final word.
Was this article useful?
I write for GrowthRocks, one of the top growth hacking agencies. For some mysterious reason, I write on the internet yet I’m not a vegan, I don’t do yoga and I don’t drink smoothies.