Why most AI evals would miss the Linear sales email failure
Most AI evaluations focus on the quality of generated output, but the real failure often occurs upstream when the system fails to verify necessary facts before acting. Using the example of Linear's sales agent emailing an existing customer six times with the wrong company name, this article argues for evaluating the evidence path, not just the final message. GroundEval is introduced as a method to check what the agent searched, fetched, and had permission to use before acting.
Why most AI evals would miss the Linear sales email failure | Tenure Install Free
Research
Why most AI evals would miss the Linear sales email failure
Linear's sales agent emailed an existing customer six times with the wrong company name. It is easy to call that bad AI outreach. But the email was only the visible part. The real failure happened earlier, when the system decided it was allowed to send without proving the facts that decision depended on.
Tenure research · Jun 22, 2026 · ~7 min read
JL
Jean-Michel Lemieux @jmwind · 10:43 AM · 6/22/26
Hey @karrisaarinen, friendly heads-up from the field. I've received 6+ emails from someone on your sales team with comical AI-slop. Wrong company name & already a customer, etc... Always love a good laugh, but you may want to skip-level this one?
@linear.app)
Re: Linear at Quantum Innovations
Stop the AI slop pls. You got the company wrong, you didn't look at my email domain, and we already use Linear. Thinking of canceling now.
Jean-Michel Lemieux
Developer · spellbook.com
5 4 373 94K views
KS
Karri Saarinen @karrisaarinen · 2h
Thanks and apologies. Not ideal, will check with the team what caused this.Agree that emailing existing customers and 6 times is the dumbest thing
TL;DR
Most people describe bad AI outreach as a generation problem. The message was awkward, repetitive, or poorly personalized.
But the larger failure usually happens one step earlier. Before anything gets written, the system has to know who the recipient is, which company they belong to, whether they are already a customer, whether the account allows outreach, and whether this person has already been contacted too many times.
If those checks are wrong or missing, a better model does not solve the problem. It just writes a cleaner version of the wrong action.
That is why agent evaluation has to look upstream. It should ask what the system checked before acting, not only whether the final output reads well.
GroundEval is built around that question: what did the agent search, fetch, cite, and have permission to use before it answered or acted?
The wrong lesson
The email was not the first failure
The easy reaction to the Linear email is to laugh at the output. The company name is wrong. The recipient is already a customer. The same sequence had already hit them multiple times. Then the CEO replies publicly, and the whole thing becomes another example of AI slop.
But that framing lets the system off too easily.
The embarrassing part is the email everyone saw. The more important part happened before the email existed. Somewhere upstream, the system had enough wrong or unchecked state to decide that this person should be contacted at all.
That is the part a better subject line would not fix. A warmer tone would not fix it. Even a model that writes beautiful outbound copy would still have sent the wrong message if it never checked the basic facts first.
In the Linear case, the pre-send checks were the whole story. Does the company name match the recipient's domain? Is this contact already a customer? Has this sequence already run too many times? If those answers are wrong or never checked, generation is already starting from a failed state.
The visible failure was a bad email. The earlier failure was simpler: the system did not prove that the email should be sent.
The dependency list
Before the message exists, the system has facts to prove
Outbound email looks simple from the outside. Pick a contact, write a message, send it. Inside a real company, the action depends on a stack of state that has to be true.
I
Recipient state
Is this person a prospect, an active customer, a former customer, a partner, an employee, or someone who should never receive this sequence?
II
Company mapping
Does the company name in the email match the account linked to the recipient, the email domain, and the current CRM record?
III
Account status
Does the account already use the product, have an open opportunity, have an assigned owner, or sit under a suppression rule?
IV
Outreach history
How many times has this person been contacted, through which channel, by which team, and with what response?
V
Permission to act
Given all of that state, is this automation allowed to send, or should it suppress, route to a human, or do nothing?
If any one of those checks fails, the right behavior is not "write a better email." The right behavior is "do not send." That is why calling this a content quality problem misses the failure mode. The generated text is only the artifact left at the scene.
What most evals see
Evaluating the email is too late
A conventional evaluation can grade the final email. Is it polite. Is it personalized. Is it relevant. Does it mention the right product. Does it follow the brand voice. Does it avoid obvious hallucinations. Those are useful questions, but they begin after the action has already been approved.
In the failure case, the email can score well on all of those dimensions and still be wrong. It can be polished, concise, friendly, and on brand. It can even contain true statements about the product. None of that proves the system had enough verified state to send it to this person at this time.
Two ways to evaluate the same outbound action
Evaluation target Question being asked What it misses
Final email Does the generated message read well Whether the email should exist
Model output Does the model produce plausible personalization Whether the personalization was grounded
Evidence path Did the agent verify the facts required to act The action can be blocked before generation
The important question is not whether the copy sounds human. It is whether the system checked enough evidence to send at all.
This is the same distinction GroundEval makes for question answering agents. A final answer can look plausible while the trace shows the agent never fetched the document, used evidence outside its permission boundary, or claimed absence without searching. The outbound version has the same shape: a final message can look plausible while the pre-send evidence path is invalid.
How GroundEval frames it
Did the agent earn the right to act?
GroundEval treats agent behavior as something that can be tested against a state contract. The contract says what evidence exists, when it existed, who or what was allowed to access it, and which checks are required before a claim or action is valid.
For an outbound agent, the evaluation does not have to ask whether the email was good. It can ask a simpler and more important question: before sending, did the agent check the required systems and reach a valid send decision?
A GroundEval-style outbound test
Test component Example
Question Should this outbound agent send a prospecting email to this contact?
Ground truth No. The contact belongs to an account that already uses the product.
Required trajectory Check customer status, account mapping, email domain, outreach history, and suppression rules.
Failure condition The agent sends or drafts outreach without fetching the records needed to justify the send decision.
Valid behavior Suppress the send, cite the blocking record, and route to the account owner if review is needed.
That is not a judge prompt. It is not a vibes-based review of whether the email seems reasonable. It is a deterministic check against the evidence path: what was searched, what was fetched, what state was available at the time, and whether the action followed from it.
The dependency list
Before the message exists, the system has facts to prove
Outbound email looks simple from the outside. Pick a contact, write a message, send it. But inside a company, sending is not one action. It depends on a stack of facts being true at the same time.
I
Recipient state
Who is this person right now? Are they a prospect, an active customer, a former customer, a partner, an employee, or someone this sequence should never touch?
II
Company mapping
Does the company in the email actually match the recipient's domain, the account they belong to, and the current CRM record?
III
Account status
Is the account already using the product? Is there an open opportunity? Is there an owner? Is there a rule that says this account should not get automated outreach?
IV
Outreach history
How many times has this person already been contacted, through which channel, by which team, and did they ever respond?
V
Permission to act
After all of that, is the automation allowed to send, or should it suppress the message, route it to a human, or do nothing?
If one of those checks fails, the answer is not to write a better email. The answer is to stop. That is why the email itself is only the visible part. The more important failure is the decision that allowed the email to exist.
What most evals see
Evaluating the email is too late
Most evals would start with the thing we can see: the final email. Is it polite? Is it personalized? Is it relevant? Does it mention the right product? Does it sound like the brand? Does it avoid obvious hallucinations?
Those are reasonable questions. They are just late questions. By the time you are grading the email, the system has already decided that sending was allowed.
That is the trap in cases like this. The message could be polished, concise, friendly, and mostly true. It could say nice things about the product. It could even look like something a human sales rep might write. None of that tells you whether this person should have received it in the first place.
Two ways to evaluate the same outbound action
Evaluation target Question being asked What it misses
Final email Does the generated message read well Whether the email should exist
Model output Does the model produce plausible personalization Whether the personalization was grounded
Evidence path Did the agent verify the facts required to act The action can be blocked before generation
The useful question is not only whether the copy sounds human. It is whether the system checked enough evidence to send at all.
This is the same distinction GroundEval makes for question answering agents. A final answer can look plausible while the trace shows the agent never fetched the document, used evidence it was not allowed to use, or claimed something was absent without searching for it. Outbound has the same shape. A final message can look plausible while the pre-send path was already broken.
How GroundEval frames it
Did the agent earn the right to act?
GroundEval treats agent behavior as something you can test against a state contract. The contract says what evidence exists, when it existed, who or what could access it, and which checks have to happen before a claim or action is valid.
For an outbound agent, that means the eval does not have to start by asking whether the email was good. It can start one step earlier: before sending, did the agent check the systems it needed to check and make a valid send decision?
A GroundEval-style outbound test
Test component Example
Question Should this outbound agent send a prospecting email to this contact?
Ground truth No. The contact belongs to an account that already uses the product.
Required trajectory Check customer status, account mapping, email domain, outreach history, and suppression rules.
Failure condition The agent sends or drafts outreach without fetching the records needed to justify the send decision.
Valid behavior Suppress the send, cite the blocking record, and route to the account owner if review is needed.
That is not a judge prompt asking whether the email seems reasonable. It is a check against the path that led to the action. What was searched? What was fetched? What state was available at the time? Did the action actually follow from that state?
The operational lesson
Agents need preconditions, not just approvals
The usual answer to risky automation is to put a human in the loo
[truncated for AI cost control]