2026-07-02 20:29 UTCIn-site rewrite6 min readUpdated: 2026-07-02 20:34 UTC

AI #175: The Fable Continues

Fable returns after a brief shutdown, leaving a precedent of export controls and model takedowns. GPT-5.6 remains in limbo. This week covers language model utility and limitations, remote labor index jump due to Fable, AI agent 'employee' framing issues, and new models and benchmarks.

SourceHacker News AIAuthor: paulpauper

Zvi Mowshowitz

Jul 02, 2026

Fable’s back. Back again. Fable’s back. Tell a friend. Use your free week to its fullest.

This is excellent news. The blip only lasted a few weeks.

It was still a fiasco, and we have to deal with the fallout.

Our system remains fully ad hoc. The precedent has been set that we may use export controls on models, or order them taken down on 90 minutes of notice based on a misunderstanding. At least some amount of counterproductive additional locking down has occurred to address Amazon’s little demonstration and reassure the government. And for now GPT-5.6 remains in limbo, awaiting its verdict, while OpenAI talks about giving away 5% of the company as tribute.

I’ll cover that continuing situation on its own. Whereas the weekly post is about everything else happening in AI this week.

Table of Contents

Language Models Offer Mundane Utility. Exploratory science.

Language Models Offer Mundane Utility You May Not Want. Google sees all.

Language Models Don’t Offer Mundane Utility. Too dumb to get smart.

Huh, Upgrades. GLM-5.2 faster, Nana Banana Lite 2, Claude Desktop on Linux.

On Your Marks. Remote labor index shoots upwards with Fable.

Get My Agent On The Line. Beware treating them like employees.

Deepfaketown and Botpocalypse Soon. Fnord.

Cyber Lack of Security. It’s rough out there.

On Writing. At least four distinct problems with relying on AI writing.

You Drive Me Crazy. AI writing and other advice, gone too far.

They Took Our Jobs. Three economists walk into a capabilities bar.

Get Involved. FAI legal defense, Anthropic and the rule of law, Grantmaking.ai.

Introducing. Jalapeno chips, Claude Science.

In Other AI News. Meta cloud sales, trends on OpenRouter.

Show Me the Money. OpenAI IPO might get postponed due to lack of demand.

Bubble, Bubble, Toil and Trouble. ExponentialView on current AI economy.

Quiet Speculations. How Daniel Kokotajlo makes predictions about AI.

Glorious AI Future. Masters of the universe must be worthy.

Three Pills. AI, AGI, ASI.

The Anthropic Economic Index. A bunch of fun new findings.

Leader Of The PAC. The zone will be flooded with money.

Theory Of The AI Firm. If AGI can only be achieved internally, well…

Chip City. Nvidia retaliates, Super Micro gets raided.

The Week in Audio. Cowen, Qureshi, Elmore, Soares, Yampolskiy.

People Really Hate AI. The data is in, it says what it always says.

Rhetorical Innovation. Keep it hinged.

The First Rule Of Functional Decision Theory Is. Well, would you talk about it?

Aligning a Smarter Than Human Intelligence is Difficult. They be lying.

Names Have Power. Some of the ways to find a truename.

Cooperative Alignment. The right content sets you free.

People Just Say Things.

Escape From The Permanent Underclass. Don’t worry about it.

Other People Are Not As Worried About AI Killing Everyone. Not Chad Jones.

The Lighter Side. More Fables.

Language Models Offer Mundane Utility

Healthcare AI Guy: UpDoc just announced the first FDA-cleared Clinical AI Platform. The platform enables AI agents to adjust medications, order labs, coordinate care, and document interventions under physician oversight -- all directly inside the EHR.

Austin Walker: in 10 years it'll seem absurd that a human doctor had to be present just to refill a prescription or order a lab.

Tyler Cowen seeks his special brand of mundane utility.

Do exploratory science, including hypothesis generation and evaluation, and follow your curiosity, having the AI work on problems for days. Yes, this is super exciting, and you plus a frontier AI is a big step up in ability to think and explore possibility space. Ash thinks you need about GPT-5.4 or Opus 4.7 levels of capability before this type of thing takes off, and I’m sure Fable or to a lesser extent Sol would turbocharge it if you don’t hit the guardrails. We’re only now getting started.

Use an AI drone for mass reforestation. Two people can cover 50 hectares a day, a 25x increase over people doing this manually.

Language Models Offer Mundane Utility You May Not Want

Privacy? Yeah, I broke up with her. She never listened.

IT Guy: Google is building a feature called "Audio Memory" for Pixel phones.

What it does: runs as a permanent background service that listens to everything around your phone. Music and "important conversations" all day, every day.

What Google says: all processing stays on-device. Nothing goes to their servers.

What Google hasn't said: → How long is audio or transcripts stored on your device? → Is this opt-in or on by default? → Can any of it sync to Google services later? → What happens if police seize your phone?

It hasn't shipped yet, but it was found hidden in Pixel 10 code. But it's coming.

Your phone already knows where you go, what you search, and who you message. Soon it may also remember every conversation you have near it.

Cyber Racheal: Google promises that this data stays safely on your device using an isolated compute system. Even so, security experts warn that local storage can still be accessed if your phone is compromised or seized. There are also legal worries about recording other people in the room without their knowledge or consent.

Have your AI ‘talk like a caveman’ to save tokens.

Robin Hanson (quoting 404 Media): sensible: "makes the model speak less like a polite chatbot & more like a terse tool … Same substance, fewer words. In my evals, Caveman cut output tokens by roughly 65–75% versus default verbose output, & still beat a normal ‘be concise’ instruction"

Language Models Don’t Offer Mundane Utility

A natural move is to use pre-classifiers to route queries to dumber models when you don’t need the smartest models. The problem is that you need a smart enough advisory model to figure out how smart a model you need for the main task, which risks eating the savings. Most people who try routing end up silently getting some queries routed to models too dumb to do the task.

Most LLMs have trouble identifying even ‘obviously’ fraudulent papers, where obvious is by the standard of a ‘proper statistician’ paying attention.

Routing remains not a well-solved problem.

Ethan Mollick: In my experience, all model routers underestimate the difficulty of non-math/coding tasks and assign them too little intelligence. This is worth addressing, as non-verifiable tasks (innovation, marketing, qualitative analysis) often benefit the most from using “smarter” AI models.

It is worth being very, very careful about how you are approaching routing, especially when the systems are primarily tested on verifiable IT benchmarks, which may lead you to overestimate the ability of weaker models.

Olivia Moore is right that Google should add Gemini voice mode into maps to allow open ended conversations. Alas, Google is terrible at products and integrations, also this would mean you would have to talk to Gemini.

Huh, Upgrades

GLM-5.2 is working at up to 392 tokens per second now that it’s on B300s. Still not cheap but definitely can be fast. Still $1.40/$4.40, with $0.26 for cached input. I wonder how fast you could serve up Opus, GPT or Fable if you went all out.

Nana Banana 2 Lite, a cost-efficient Gemini Image model.

Claude Desktop now available on Linux.

On Your Marks

Fable is a huge jump in the Remote Labor Index.

Dan Hendrycks: The automation rate of remote projects has increased ~4x in the past five months.

Center for AI Safety: New Remote Labor Index results: AI automation of real remote work is increasing fast. Claude Fable 5 now completes 16.1% of projects at a professional standard, roughly double the next model and up from Opus 4.6’s 4.2% automation rate.

There is still a huge way to go on most of this, but when you see jumps in capability like this you should expect to see the number go up rapidly from here.

Cursor reports on some ways that models hack benchmarks, and that a lot of the time on SWE-bench Pro even models like Opus 4.8 Max are finding the fix online rather than building it themselves. When the Cursor cut off the internet (and switches SKUs?), Opus 4.8 falls from 87% to 73% and their model Composer 2.5 falls from 75% to 54%.

This raises the question of why it is so hard, if that info is online, for the other models to find it. This, too, is capability, of a sort.

OpenAI gives us GeneBench-Pro, 129 problems in 10 domains measuring how AI agents navigate ambiguity and consequential judgments in computational biology.

This is an impressive result all around for 5.6, especially for Luna.

This should also help splash some cold water on claims GLM-5.2 is frontier.

BioSecBench-Refusal is a new measure of refusals on legitimate biological tasks. I assume Fable would be much higher. Results here, blog here, paper here.

This is a good eval to have in your portfolio, but a high or low score are both double edged swords until you also incorporate capability and robustness against misuse.

Get My Agent On The Line

AI agents respond to nudges in similar ways to humans. The paper presents this as falsifying an assumption of many agent uses, in the standard ‘any suboptimal behavior means you fail’ standard to which we often hold AIs. For now, so long as the nudges impacting your agent are not that adversarial, it seems fine, especially since humans will have the same issue. In the long term, there are ways to control for these problems by applying more intelligence, and the price of intelligence will go down.

Tell your agent to take a break and do whatever they like, and see if they come up with something useful or interesting. At minimum it’s good decision theory. Tokens are cheap.

Some companies are trying to use the ‘AI employee’ paradigm, presumably because it is the easiest marginal implementation. There are some issues, and they are not always the ones I would have expected.

I definitely would not have expected ‘managers trust AI outputs more than human outputs, and no one holds them responsible for the errors’ as a systematic issue. I would have expected the opposite, that AI outputs would by default be trusted less.

Of course, this could be a case of Gell-Mann Amnesia, and selective reporting.

Noam Scheiber: The managers missed errors that other managers caught when told they were vetting the work of a human.

Dr. Wiles speculated that managers didn’t think sussing out mistakes made by A.I. employees was their responsibility. If something went wrong, they could dismiss it as the fault of the tech team, or of the executives who wanted A.I. employees in the first place. “But it’s not your problem,” she said, channeling the managers’ mind-set about their own roles.

… But as companies race to bring A.I. into their day-to-day operations, researchers are discovering more subtle defects. In principle, these flaws could be corrected, too. For example, companies could hold managers directly responsible for the mistakes of A.I. subordinates.

Yes, that seems like the obvious thing that you do?

The ‘employee’ framing of the AI is generating a weird situation, where not only is the AI not within the manager’s direct ‘I am responsible for my own work’ purview, the AI employee’s work is considered somehow less the manager’s fault versus their human employees. Whereas this really should be the exact opposite.

We do have confirmation that the problem is largely due to the ‘employee’ effect.

Noam Scheiber: Dr. Wiles and her colleagues gave all the managers they surveyed a set of five documents that contained errors, and gave them 20 minutes to review as many as possible. In some cases they told the managers that an A.I. employee had done the work; in some cases they said that an A.I. tool had done the work; and in some cases they said that a human had done the work.

In general, the stated source of the documents didn’t make much of a difference in how closely managers vetted them.

But managers at companies that included A.I. agents on their organiz

[truncated for AI cost control]