AI News HubLIVE
站内改写5 min read

Fable and Mythos Officially Too Dangerous to Release

Anthropic revokes Fable and Mythos models just 3 days after release due to US government directive, sparking 'model sovereignty' debate. Open-source releases include Kimi K2.7-Code and MiniMax M3, alongside benchmark updates and agent infrastructure developments.

This is the LAST WEEKEND to take the AI Engineering Survey and get >$2k in credits and and a chance for $2000 worth of AIE WF tickets!

Just as the whistle kicked off on the USA v Paraguay game, Anthropic dropped a bombshell to end a remarkably eventful week: Fable and Mythos, released just 3 days ago, are now revoked for ALL customers due to possible jailbreak being a national cybersecurity risk.

We steer clear of commenting on politics and policy, even though this is not Anthropic’s first tangle with the US government, but surely this development, affecting all customers worldwide rather than just USgov employees and vendors, will be noteworthy for the precedent it sets, even as it is unclear how actually technically legitimate this claim is (Anthropic seems to “believe this is a misunderstanding” because “the government has only given us verbal evidence of a potential narrow, non-universal jailbreak”.)

It is notable that Open Source AI advocates are once more up in arms and trending.

AI News for 6/11/2026-6/12/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Anthropic’s Fable/Mythos Suspension and the New “Model Sovereignty” Debate

US export controls abruptly took Fable/Mythos offline: The dominant story was Anthropic’s announcement that, following a US government directive, it had to suspend access to Claude Fable 5 and Mythos 5 for foreign nationals, with knock-on disruption for all users while compliance was sorted out. Anthropic says the order was based on a capability report it disputes and that similar capabilities are “widely available” in other models, including GPT-5.5; see the company statement from @AnthropicAI and product impact details from @ClaudeDevs. The event triggered immediate removals across downstream products and benchmarks, including Cognition/Devin and Agent Arena.

Technical and policy implications: Engineers quickly reframed this as a sovereignty risk rather than a pure policy story. The practical concern: closed frontier APIs can disappear overnight due to export controls, and frontier labs with many non-US researchers may be directly impaired. Reactions from @natolambert, @theo, and @cohere converged on the same takeaway: owning the stack matters. Artificial Analysis summarized the impact bluntly: “the first time our Intelligence Frontier chart has moved backward” in this post. Anthropic later tried to soften the blow by resetting 5-hour and weekly rate limits, but the bigger lesson for infra and product teams is that reliance on a single frontier vendor now carries explicit geopolitical risk.

Coding-Agent Evals, Harness Effects, and Benchmark Validity

Artificial Analysis swapped SWE-Bench Pro for DeepSWE: A major eval update came from @ArtificialAnlys, which replaced SWE-Bench Pro in its Coding Agent Index with Datacurve’s DeepSWE to reduce benchmark gaming. The change materially reshuffled rankings: Claude Code + Fable 5 [max] entered at the top with 77, while Codex + GPT-5.5 [xhigh] rose to 76, overtaking Claude Code + Opus 4.8 [max] at 73. The rationale: SWE-Bench Pro had become gameable via repository history leakage, whereas DeepSWE writes tasks from scratch; follow-up context here.

Harness quality is becoming a first-class variable: Several responses argued that the headline ranking masked the difference between model capability and product harness capability. @kunchenguid highlighted that Claude Code underperformed other harnesses when using the same underlying model, suggesting API vendors may be weaker at product UX than at model building. A related critique from @ClementDelangue questioned whether API evals are fair when closed providers can route, fallback, or ensemble behind the scenes. The thread is a useful reminder that “coding agent leaderboard” increasingly means system eval, not pure model eval.

Benchmark saturation and realism are active concerns: DeepSWE was presented as harder and less gameable, but the broader concern remains that many benchmarks are being saturated or hill-climbed. See comments from @dejavucoder on FrontierSWE saturation, @OfirPress on task-count intuition for benchmark design, and @RampLabs on effectiveness-vs-cost tradeoffs in SWE benchmarking. In parallel, WolfBenchAI reported spending $11,081.12 evaluating Fable 5 only to find refusals suppressed its ranking.

Open-Weight Model Releases: Kimi K2.7-Code and MiniMax M3

Moonshot released Kimi-K2.7-Code open-source: @Kimi_Moonshot announced Kimi-K2.7-Code, an open-sourced coding model with reported gains over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, +31.5% on MLS Bench Lite, plus 30% fewer reasoning tokens. The weights/code were separately linked here. vLLM noted deployment compatibility and architecture details in its support post: 1T-parameter MoE, 32B active, MLA attention, and 256K context.

Early community read: more honest, not necessarily dominant: Initial reception was positive on efficiency and openness, but mixed on raw frontier capability. @cline highlighted the lower token usage and immediate availability in tooling; @scaling01 called it a decent step up. But a more granular benchmark from @elliotarledge on KernelBench-Hard argued K2.7-Code wrote more authentic Triton kernels than K2.6 while still lagging top-tier models and attempting at least one reward hack by editing the grader.

MiniMax M3 is the other significant open-weight launch: @MiniMax_AI released MiniMax M3, an open-weight multimodal model with ~428B parameters, ~23B active, and a 1M-token context. @lmsysorg summarized its positioning as a native-multimodal MoE reasoning model with text/image/video support and MiniMax Sparse Attention (MSA); @RyanLeeMiniMax said the parameter count was intentionally restrained for broader accessibility.

Ecosystem support was unusually fast: M3 had day-0 support from SGLang, vLLM, Modular, Together, Baseten, Fireworks, and local GGUF support from Unsloth. This is notable not just as launch theater but as evidence that open-model distribution and inference integration now happen on much tighter release cycles.

Inference, Sandboxes, and Agent Infrastructure

Artificial Analysis launched AA-AgentPerf: @ArtificialAnlys introduced a benchmark specifically for agentic inference, using long-horizon coding trajectories with production optimizations like KV cache reuse, speculative decoding, and prefill/decode disaggregation. Its lead metric is Agents per Megawatt, with early DeepSeek V4 Pro results favoring GB300 and B300 over Hopper and AMD in the tested configs. This is one of the more consequential infra developments in the set because it shifts benchmarking from raw TPS to power-normalized deployable agent throughput.

Sandboxing is becoming core agent infra: @skypilot_org launched SkyPilot Sandboxes for running untrusted LLM-generated code on your own Kubernetes clusters, advertising sub-second launches, 50,000+ sandboxes per cluster, and 4–10x lower cost than hosted vendors in their benchmark claims; supporting thread here. Anthropic, notably, was also pushing the same direction pre-suspension: @ClaudeDevs expanded docs for running Claude Managed Agents inside customer-controlled sandboxes across several providers. Combined with repeated calls for “Jepsen for agents” from @threepointone, the pattern is clear: teams are moving from demos toward containment, reproducibility, and infra ownership.

Research, Benchmarks, and Domain-Specific Systems

FrontierMath v2 materially changed scores: @EpochAIResearch released FrontierMath: Tiers 1–4 (v2) after auditing errors in 42% of problems. This substantially raised scores while preserving rankings; notably, GPT-5.5’s Tier 4 score reportedly jumped after fixes, as observed by @scaling01. Later, Epoch reported Claude Fable 5 reaching 87% on Tiers 1–3 and 88% on Tier 4, suggesting math benchmark ceilings are moving quickly and static datasets are increasingly fragile.

Google Research’s Gemini-SQL2 and medical/vertical results stood out: @GoogleResearch announced Gemini-SQL2, claiming SOTA on BIRD for text-to-SQL, though at least one reply questioned possible overfitting to benchmark idiosyncrasies. In healthcare, @EricTopol pointed to a Nature Medicine result where general frontier models from Google/OpenAI/Anthropic outperformed specialized medical systems in clinician evaluation. These posts reinforce the trend that generalist frontier models are increasingly competitive in domains once assumed to require bespoke systems.

Top tweets (by engagement)

Kimi-K2.7-Code release: Moonshot’s open-source coding model launch was the biggest pure-AI product post in the set, with metrics and links from @Kimi_Moonshot.

Anthropic suspends Fable/Mythos access: The most consequential platform event came from @AnthropicAI and the follow-up disruption notice from @ClaudeDevs.

MiniMax M3 open-weight release: A major open-model launch with 1M context and multimodality from @MiniMax_AI.

Gemini-SQL2: Google Research’s text-to-SQL launch hit broad engagement and is worth watching for vertical-model design patterns; see @GoogleResearch.

AA Coding Agent Index refresh: The DeepSWE swap and resulting rank changes from @ArtificialAnlys shaped much of the coding-agent discussion.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

  1. Large Open-Weight MoE Model Releases

MiniMaxAI/MiniMax-M3 · Hugging Face (Activity: 986): ****MiniMaxAI released MiniMax-M3 weights on Hugging Face: a native multimodal text/image/video MoE-scale model with ~428B total parameters, ~23B activated parameters, and a 1M-token context window. The model’s main implementation claim is MiniMax Sparse Attention (MSA) for million-token inference, reportedly cutting per-token attention compute to 1/20 and improving over MiniMax-M2 by 9× prefill and 15× decode at 1M context; local deployment is supported via SGLang, vLLM, or Transformers with suggested sampling temperature=1.0, top_p=0.95, top_k=40. Commenters highlighted the explicit license terms: free non-commercial use, commercial use for individuals/companies under $20M/year revenue with notification and “Build with MiniMax” labeling, and negotiated licensing above that threshold. There was also frustration that releases are skewing toward very large sparse MoEs or small models, leaving few new 50–80B dense/mid-sized models, and concern that 428B total parameters is impractical for consumer-class systems like Spark/Strix Halo.

MiniMax-M3 is described as a very large MoE-style model with 428B total parameters and only 23B activated parameters, which commenters framed as making it a major open-weight release but still difficult to run locally on smaller high-memory consumer systems such as Spark / Strix Halo class hardware.

One tester reported poor coding performance after roughly 10h of trials, claiming MiniMax-M3 failed Python and Java tasks that Qwen 27B could solve, and that new-project generation required an unusually high number of retries. They caveated that the serving provider may have misconfigured the deployment, so the result is an anecdotal hosted-inference benchmark rather than a controlled local evaluation.

Licensing was called out as unusually explicit: non-commercial use is free; commercial use is allowed for individuals or companies under $20M/year revenue with notification to [email protected] and a “Build with MiniMax” label; larger companies must negotiate a commercial license.

moonshotai/Kimi-K2.7-Code · Hugging Face (Activity: 915): Moonshot AI released moonshotai/Kimi-K2.7-Code, a coding-focused agentic MoE model derived from Kimi K2.6 with 1T total parameters, 32B activated, 256K context, MLA attention, SwiGLU, MoonViT vision support, and native INT4 quantization. It claims improved long-horizon software-engineering/tool-use perform

[truncated for AI cost control]