AI News HubLIVE
In-site rewrite6 min read

On AI Text Detection

This article provides a technical audit of AI text detection service Pangram, highlighting that its accuracy drops drastically on mixed human-AI text, false positive rates vary greatly by individual and genre, and the company’s incentives promote overconfidence in results, risking witch hunts and damaging trust.

SourceHacker News AIAuthor: dvrp

AI Text Detection: Arms Dealers in a War on Truth

Technical audit on Pangram, commentary on alignment of businesses selling an authority on truth, and how it may throw fire on the epidemic of AI-generated content rather than quelling it.

Ethan Smith

Jul 02, 2026

Pangram, at face value, comes off as doing a morally good duty cleaning up the internet, rebelling against slop precipitating from models made by big companies and detecting deceit. And I agree with this mission for the most part.

However, doing this effectively and fairly hinges on both the reliability of the detector and also an accurate understanding of where it falls short

Ultimately a revenue-creating startup and if leveraging a “for the people” aura means growth and marketing, they’d be incentivized to lean into that.

If it was a perfect detector, I’d have no beef. I don’t have qualms with pointing out AI writing, particularly if the author is dishonest about their usage.

Where it is a problem is they’ve garnered a sense of authority on a final verdict of the authorship of a piece of text. The posts and marketing from their employees go around deeming content as AI generated. Again, if this could be certainly factual, fine, but otherwise it’s witch-hunting on noisy and fallible metrics that are repeatedly reported as ground truth.

I believe you can be right most of the time, but if we’re publicly shaming people, punishing students/academics on something that could be wrong, it’s a risk we should be taking seriously. This is a similar reason polygraph lie-detectors have become a controversial raise in courts and often inadmissible as evidence in many jurisdictions.

Lately, my X feed has been filled with claims of various writings to be authored by AI to the point where its become an epidemic. Maybe they’re true. Maybe they’re not.

This article aims to not to decipher the validity of these accusations directly, but highlight:

Points of fragility with AI text detection, and what the research shows.

An alignment and incentives issue with an AI-text-detection as service.

An epidemic around a lack of falsability, and that providing hard proof is often hopeless.

Note: All text here is human-written, though Claude was used in help for research, interpreting results, stress-testing claims and ideas in this piece, playing devil’s advocate against my arguments. Additionally, there are a few direct quotes from Claude where I felt the statement did a good job at summarizing an idea; these are explicitly labeled.

This in itself, transparency in how AI was used, is something I hope catches on as a norm. In best case, I think it can be an effective thought-sparring partner and help keep authors from overclaiming by providing opposing perspectives or searching for evidence.

The point to be made here is not that AI-text-detection is useless or even that the product is “broken,” I have good reason to believe Pangram is one of the top options, but claims are overstated in a way that creates risk, some of the use cases have introduced, potentially, new problems and drama, and the company may have incentives to double down on these fiascos rather than pull back.

I’d like to give an overview of the main points in case there isn’t time to read the whole thing. If anything, and not too deep into the technicals,

Section 1 in understanding classifiers and what an output tells us.

Section 8 in understanding company positioning

On Efficacy

False positive rate varies enormously based on how the text was created

Pangram shines the most when dealing with purely human or purely LLM articles. This would be the closest spot where 1-in-10,000 FPR holds (caveats)

The realm of hybrid authorship, a common workflow for many of mixing one’s own writing with AI edits etc., based on the EditLens report contributions of estimating different levels of AI involvement (the only available reference here), shows substantial drops in classification accuracy both in the “home-court” lab-created examples validated on and in testing on external datasets.

Prose polishing, rewording etc. can trigger AI detection far more commonly, a single AI-edit or even asking for paraphrasing through the Grammarly app lands a substantial portion of examples in still the fully-human area and even more past the threshold of fully AI (box plots unfortunately do not show us frequency but assuming a loose gaussian-like spread, I’d estimate as much as ~20% could be classified as fully AI

The number of edits chart, based on standard deviation (which can be distorted for bounded asymmetric data), suggests after 1 edit ~15% of examples can remain in fully human space, meanwhile 1.5 standard deviations above mean reveal 1 edit can cause about 6.5% examples to become classified as fully AI

But ultimately, nothing in the UI tells you that this text you put in is of this category and you should recalibrate your expectations

Benchmarks may not reflect real world conditions

The repeating theme is both training and testing on human text from prior to the LLM chatbot era tested against “lab-grown-ai-text” created in structured methods that may not reflect the real-world creation process.

Substantial drops in performance are observed when visiting areas of text less familiar to the model

Genres/domains (research papers, creative writing, etc as examples of domains, but not specifying these are the specific cases) when its areas less familiar to the model

Different LLMs not seen in training set

Deviations from distributions of text seen in training (human text from other years, “in-the-wild” AI text.

These scenarios are reported on with fixed FPR, so we can’t see failure rates on human text, but the accuracy as split between human and AI in aggregate are seen around 1% failure rate.

False positive rates vary by genre or type of text

Poetry notably is around 1-in-200

Different domains can be more or less structured or have more variance in how they show up, inducing a higher error rate.

Failure rates/FPR assume population uniformity - very common misconception

1-in-10,000 error rate is what comes out of computing over the pre-LLM human text in aggregate.

Not all texts are equally challenging, and certain writers may naturally sit closer to LLM-speak.

Some writers might have a misclassification rate closer to 1 in 100, some maybe around 1 in 1 million, just by nature of their writing style.

Some writers sit in the “worst case scenario” and are paying the cost moreso than others, on what might be on orders-of-magnitude.

Pangram did establish reduced bias against English-second-language writers on text from before the LLM era, there are not benchmarks at present that show current day ESL writers who may be learning from LLMs their target language.

Research shows humans today are taking on AI language patterns, risking driving their writing towards higher risk of misclassification

We naturally pick up and adapt our language based on what we’re exposed to.

This equally happens when reading AI-generated text all over the internet or talking with chatbots.

This suggests two things:

In general, across the board, in aggregate, 1-in-10,000 failure rate might not be reflective of real world conditions.

It varies per genre, per person, per scenario of how that person writes text. We cannot naively apply the 1-in-10,000 population average to individuals when some cases are far more likely to trigger AI detection.

In other words, we’re working off a number that signals the most ideal, optimistic case, when its hard to look at a text and know the expected false positive rate.

We can recognize poetry and calibrate our FPR to expect 1-in-200 error rate

We cannot, a priori, look at a text and know that its a blend of human and light AI adjustments and know that we should now consider a different expected error rate closer to 1-in-10 to 1-in-50 range.

When you feed a text into Pangram, nothing can signal, “ah this is a hybrid text, note that failure rate is actually X” for this scenario.

The realistic treatment is to assume the optimal real world case to be somewhere less than 1-in-10,000 and then the worst case to be far below that, and use both of those when considering how to interpret a result. Easier said than done.

On what happens after the classification result returns

Falsifiability and verification of results

Many real world scenarios don’t give us access to investigate further. Although Pangram states that their result should not be the final arbiter, it happens a lot in practice, and the hedging done here is minimal.

We end up having to choose between the author’s word (they don’t really have a good way to defend themself) and we can’t get more out of the detector

In a way, this creates its own breed of the termed “LLM psychosis” in treating a fallible metric as truth or reality

On Company Positioning

Pangram’s social accounts and employees have had numerous accounts of scanning articles in the wild and publicly calling them out, which sets a bad example in how we should interpret results and treat discourse around this matter. And arguably this making headlines has been free promotion/marketing

The company is incentivized to report best-case-scenario reliability rates as opposed to realistic rates. People want a clear truth and answer which there are numerous factors on why we can’t get that. After all, as with just about any product stating the best aspects of it and doing a little sugarcoating is natural practice, but understates the risk.

Table of Contents

Section 1 - Deep-Learning Classification: Basics, Intuitions, and what they actually tell us.

Section 2 - Pangram Product/UI, Model Cards, Usage + Cleaning Text

Section 3 - Evidence Tab, Primer on Population vs Individual

Section 4 - Pangram Claims, Technical Report, External Audit - Binary Cases

Section 5 - Pangram Mixed Human+AI Text and Attempts to Avoid Detection

Section 6 - The Ceiling of Classification Performance | Fundamental Statistical Limits

Section 7 - Difficulties and Eases Specific to Ai-Text Detection

Section 8 - Alignment and Incentives: When being an arbiter of truth becomes a business | Discourse episodes, AI-use philosophy, Documented misclassifications

Section 9 - Neurips PPT 2026 Drama

Section 10 - Proposed Advancements: Refining predictions with respect to subpopulations and supplemental signals

Overview of Sections

Section 1 - Deep-Learning Classification: Basics, Intuitions, and what they actually tell us.

covers ML classification basics, but also some bits recommended even for avid practitioners

P-value framing, how do move gather evidence towards an established truth?

Prosecutor’s fallacy and base-rate problems

Section 2 - Pangram Product/UI, Model Cards, Usage + Cleaning Text

Product Surface

Model cards, subversions, benchmarks, and reports

The prediction of a model, on a given piece of text, may change as the detector is updated. Whether you run the detector on one day or another could be the difference of whether an accusation is made or not.

UI around cleaning text

Section 3 - Evidence Tab, Primer on Population vs Individual

Evidence tab singles out isolated, innocuous cases

It creates confirmation bias rather than causal explanations of why the detector made a decision

The rate mismatch between human use and AI use of certain phrases (and others like bullet points etc) misleadingly use global/aggregate statistics rather than genre-level or individual statistics.

bullet points might be 9x more common in LLM-writing vs human writing in aggregate but not necessarily comparing LLM-writing of research papers vs human writing of research papers

Section 4 - Pangram Claims, Technical Report, External Audit - Binary Cases

Technical report

Datasets are human-text corpuses pre-2022 and “lab-grown-AI text” created by their mirror prompting technique.

Evaluations are performed on a held-out validation set

[truncated for AI cost control]