2026-06-23 06:19 UTCIn-site rewrite6 min readUpdated: 2026-06-23 13:43 UTC

SpaceX is already a $28B/yr Neocloud

This issue covers SpaceX's third GPU rental deal with Reflection AI, OpenAI Daybreak's expanded cyber security program, Sakana Fugu's orchestration release and the benchmark transparency backlash, GLM-5.2's breakthrough as an open-weight agent-competent model, Google's Interactions API GA, Baseten's $1.5B Series F, and the growing emphasis on evaluating agents as systems.

SourceLatent Space

Congrats due to Baseten, who officially announced their leaked $13B Series F.

Today had a smattering of midsize news across OpenAI Daybreak and Gemini Interactions and Sakana Fugu, but probably the trend to watch and hang your hat on is SpaceX’s THIRD GPU rental deal, this time with Reflection AI:

Combined with the well publicized Anthropic and Google deals (hmmm… who’s missing from this customer list? Why?), one might be wondering just how far SpaceX has to go. Jamin Ball from already tallied up like for like:

In Summary, $2.32B / month, >$10 / hour for Blackwells (which is a very high rate)

That annualizes to $28B a year, roughly twice the current revenue of Coreweave, which is holding strong at a $60B valuation today a year after their IPO.

AI News for 6/20/2026-6/22/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI Daybreak, GPT-5.5-Cyber, and the policy/security split

OpenAI expanded its cyber stack beyond vuln discovery into remediation: OpenAI announced an expanded Daybreak program with a Codex Security plugin, the full GPT-5.5-Cyber model for trusted defenders, a Cyber Partner Program, and Patch the Planet for securing critical OSS. Follow-on posts added concrete scope: 30M+ commits scanned, 30K+ codebases covered, 70K+ reviewer-marked fixes, and 500K+ additional fixes detected automatically; major projects like cURL, Go, Python, Sigstore, and pyca/cryptography are in scope; and the plugin supports deep scans, threat modeling, patch generation, and export into existing workflows. The notable shift is from “find bugs” to closed-loop patch generation with human review.

Capability claims are colliding with export-control logic: OpenAI is explicitly claiming SOTA on CyberGym for GPT-5.5-Cyber via @sama, while the public debate around Anthropic’s restricted Mythos/Fable access continued. @BlackHC asked the obvious policy question: if OpenAI’s latest cyber model is stronger, why is it not under equivalent controls? @shashj also added an important correction to the Mythos story: NSA references to “hours, not weeks” were tied to red-teaming efforts with initial access assumptions, and those red teams reportedly no longer have Mythos access. The result is a widening gap between model capability reporting and coherent governance criteria.

Sakana Fugu’s orchestration release and the benchmark transparency backlash

Fugu reframes “model release” as learned orchestration over a model pool: Sakana introduced Fugu, presenting it as a single API that learns model selection, delegation, verification, and synthesis across multiple frontier models; Vercel quickly added Fugu Ultra to AI Gateway. The product thesis resonated with engineers who already see real systems moving toward orchestration layers: @levie called routing/orchestration a likely high-value layer, and @audreyt reported Fugu Ultra working well as a planner/advisor paired with a fast driver loop. Sakana then published a sequence of use cases—autoresearch, finance, blindfold chess, CAD—arguing that test-time coordination can beat monolithic calls on long-horizon tasks (1, 2, 3, 4).

The critique was immediate: opaque baselines, missing cost accounting, and questionable reporting: The most detailed teardown came from @eliebakouch, who argues Fugu is essentially a router/classifier plus a preplanned multi-step workflow system, with several core issues: it trails Opus on SWE-Bench Pro by ~10 points, compares against anonymized “Model A/B/C,” omits token/cost reporting for best-of-N style orchestration, and should be compared against other test-time scaling setups rather than plain base models. Skepticism escalated further with @BlancheMinerva, who challenged Sakana’s trustworthiness based on prior incidents and alleged impossible performance claims in earlier work. The release still matters technically, but the discussion shifted from “is orchestration useful?” to “how should we evaluate and disclose orchestration systems?”

GLM-5.2’s breakout: open-weight agents, infra adoption, and real-harness wins

GLM-5.2 is emerging as the first open-weight model broadly treated as frontier-adjacent for agentic work: Multiple posts converged on the same story. Artificial Analysis put GLM-5.2 at #3 overall on GDPval-AA at 1524 Elo, behind only Claude Fable 5 and Opus 4.8, and level with or ahead of some proprietary models; they also highlighted GLM as the leading open-weight model and a strong point on the AA-Briefcase cost/performance frontier. @natolambert called it a possible “DeepSeek moment” for agents, while @AravSrinivas argued it revives serious interest in open source because it “passes the blind test” on median production knowledge work.

The strongest evidence came from actual harnesses, not abstract benchmark charts: Cline tested GLM-5.2 and Opus 4.8 on a real bug in the Cline repo using the same harness and found GLM was slower and more tool-call-heavy, but cheaper ($0.41 vs $0.81) and more robust in verification: it cleaned up dead code and confirmed the production build, while Opus left type errors that passed tests. @askalphaxiv said GLM-5.2 is the first open-weights model they’ve tried that can do real autoresearch tasks, including async vs colocated RL training runs over two 8xH100 nodes. At the tooling layer, @_xjdr described promoting GLM to the default model in ncode, after spending the weekend hardening capacity, parsing tool streams, and splitting endpoints for standard vs 1M context sessions; a second thread details the surprisingly large amount of model-specific parser and harness work needed to onboard an OSS model cleanly (details).

Distribution and serving velocity were unusually high: GLM-5.2 landed on AWS Marketplace, in Baseten’s library with >280 tok/s and <0.8s TTFT, in Droid via Fireworks, in LangChain’s deepagents code, and across many providers—one count put it at 20. There is also a growing ecosystem of practical guides, like running GLM-5.2 inside Claude Code via Baseten’s OpenAI-compatible endpoint. The meta-point is that open model quality now clears the threshold where inference vendors and agent tool builders will optimize aggressively around it.

Agent infrastructure: Gemini Interactions API, Hermes expansion, and harness-first engineering

Google promoted the Interactions API to its primary Gemini interface for agents: Google and @OfficialLoganK announced the Interactions API is now GA and the new default for Gemini models and agents. The feature set is notable: one API for models and agents, background async execution, expanded tool support, multimodal generation, managed agents, and an isolated remote Linux sandbox called Antigravity per @_philschmid. That makes Google’s stack look increasingly like a first-party answer to the “agent harness” problem, not just a model endpoint.

Skills, communication protocols, and stateful sessions are becoming first-class infra concerns: To smooth migration, Google shipped an installable Gemini Interactions skill that teaches coding agents the new SDK patterns and current model versions. In parallel, @omarsar0 highlighted a useful survey of nine open-source agent communication protocols, noting an emerging standard around hybrid payloads plus session-state persistence, while decentralized discovery remains immature. The common theme: teams are standardizing around stateful, tool-rich, long-running agent workflows, but not yet on the full protocol stack.

Hermes continues to gain surface area as a local/personal agent platform: Hermes updates included iMessage access without a Mac, Raft integration as an external agent in a shared workspace, and most significantly GUI control for Windows or Linux desktop apps with any model. The repo also crossed 200K stars, reinforcing that a lot of developer energy is going into agent UX and harness ergonomics, not just base model quality.

Inference economics, infrastructure scale, and the shift toward “owned intelligence”

Baseten’s $1.5B Series F is a direct bet on post-trained open models and inference as the enterprise control plane: Baseten and CEO @amiruci argued that companies increasingly want to own their intelligence layer: run open or specialized models, post-train on their own data/evals, and retain control over continual learning. Their customer list—Abridge, Cursor, Decagon, Harvey, Notion, OpenEvidence, etc.—shows this is already happening at the application layer. This aligns with the day’s broader evidence: stronger open models plus better infra are turning post-training from a frontier-lab specialty into an app-company competency.

Compute leasing is becoming a strategic market of its own: Reports that Reflection signed a $6.3B compute deal with SpaceX for GB300 access were widely discussed; @jaminball contextualized it alongside SpaceX/xAI’s other large compute deals with Anthropic and Google, noting implied Blackwell pricing above $10/hour and 90-day out clauses. If accurate, this makes “neocloud” capacity and GPU brokerage an increasingly important strategic layer between model builders and hardware supply.

Top tweets (by engagement):

OpenAI Daybreak / GPT-5.5-Cyber: @OpenAI, @sama

GLM-5.2 real-world validation: @cline

Google’s Interactions API GA: @Google

Baseten Series F / owned intelligence thesis: @amiruci

Sakana Fugu release: @SakanaAILabs

Benchmarks, eval methodology, and the move from static scores to real workflows

Judge reliability is under fresh scrutiny: @dair_ai summarized a large LLM-as-a-Judge audit across 21 judges, nine providers, and about 541K judgments. The key result is methodological: exact-match agreement materially overstates judge quality, while switching to Cohen’s kappa deflates agreement by 33–41 points on MT-Bench, with judge rankings shifting significantly. That’s a strong warning for teams using judge models as internal eval infrastructure.

There is increasing pressure to evaluate agents as systems, not chatbots: Jules framed this explicitly: the goal is not just an agent that reacts, but one that notices, anticipates, and partners. Relatedly, @rseroter highlighted the distinction between using a coding agent and engineering an autonomous coding harness. The most substantive posts of the day—GLM in Cline, OpenAI Daybreak, Fugu criticism—were all really about system behavior under tools, memory, verification, and long-horizon execution, not raw single-turn IQ.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

GLM-5.2 Price/Performance and Homelab Deployment

GLM-5.2 is on DeepSWE (Activity: 606): The image is a DeepSWE cost-vs-score benchmark chart for coding agents/models, linked here: image. It highlights GLM-5.2 [max] at 44% DeepSWE with an average cost of $3.92/task, placing it below top closed models like GPT-5.x/Claude variants in score but in a relatively strong cost-performance position, especially given the post’s note that DeepSeek pricing may be outdated due to a later 75% discount. The post contextualizes DeepSWE against ArtificialAnalysis coding-agent scores and SWE-rebench, while noting prior DeepSWE criticism was partly retracted by its original author. Commenters were cautiously positive about GLM-5.2, arguing it “feels” competitive with Sonnet/Kimi and notable for being an open-weight model in the same broad conversation as Opus/GPT-class systems. There was also criticism of the chart design—especially the reversed cost axis with zero on the right—and some amusement that Gemini appears to underperform open models on this benchmark.

A commenter interprets the DeepSWE result as roughly matching hands-on experience: GLM-5.2 feels stronger than Claude Sonnet and Kimi, but still behind Opus 4.8/GPT-5.5. They emphasize the technical significance that GLM-5.2 is an open-weight frontier-adjacent model that can be

[truncated for AI cost control]