2026-06-05 06:44 UTCIn-site rewrite6 min readUpdated: 2026-06-30 13:03 UTC

AI News: Not Much Happened Today

Today's AI news covers NVIDIA's Nemotron 3 Ultra and 3.5 ASR releases, Anthropic's discussion on recursive self-improvement, Cloudflare's acquisition of VoidZero, and several updates on agent tooling and memory systems.

SourceLatent Space

Anthropic is seeing Sparks of RSI, OpenAI’s ChatGPT has finally crossed 1B MAU ~5 months behind schedule and improved memory, and SpaceXAI is explaining its IPO to people who might not know they will be forced into buying it.

None of which are as important as getting your AIEWF tickets and hotels and tuning in to the latest pod with Andon Labs!

$2k in credits and free AIE WF tickets!","cta":null,"showBylines":true,"size":"sm","isEditorNode":true,"title":"Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs","publishedBylines":[],"post_date":"2026-06-04T20:39:18.514Z","cover_image":"https://substack-video.s3.amazonaws.com/video_upload/post/200614482/1621f1b3-afdf-4e73-96ad-7e9344965086/transcoded-1780580537.png","cover_image_alt":null,"canonical_url":"https://www.latent.space/p/andon","section_name":null,"video_upload_id":null,"id":200614482,"type":"podcast","reaction_count":7,"comment_count":0,"publication_id":1084089,"publication_name":"Latent.Space","publication_logo_url":"https://substackcdn.com/image/fetch/$s_!DbYa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73b0838a-bd14-46a1-801c-b6a2046e5c1e_1130x1130.png","belowTheFold":false,"youtube_url":null,"show_links":null,"feed_url":null}">

AI News for 6/3/2026-6/4/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

NVIDIA’s Nemotron 3 Ultra and 3.5 ASR Release

Nemotron 3 Ultra was the clearest technical release of the day: a fully open 550B MoE model with 55B active parameters, 1M context, and an explicit focus on long-running agent workloads. NVIDIA says it is up to 5x faster and 30% lower cost for agentic tasks, with weights, synthetic data, reward checkpoints, quantized variants, and training recipes released under OpenMDW 1.1 (NVIDIA launch, NVIDIAAI open artifacts, Pavlo Molchanov thread). The architecture combines hybrid Mamba/attention, LatentMoE, and native MTP, with pretraining done in NVFP4 over 20T tokens—notable because it pushes low-precision pretraining into a new scale regime (tech notes, scaling discussion).

Benchmarks and serving story were unusually strong for an open release. @ArtificialAnlys measured 47.7 on its Intelligence Index using NVIDIA’s recommended NVFP4 inference weights (48.2 in BF16), making it the strongest US open-weights model they’ve tested, though still behind Kimi K2.6. More interestingly, they reported 400+ output tok/s via BlackBox, and separately showed Nemotron 3 Ultra sitting on the Pareto frontier for task latency vs. performance on Terminal-Bench-style evaluations under turn limits (latency analysis, BlackBox throughput). The model shipped day 0 across the stack: vLLM, Modal, Together, Fireworks, Ollama cloud, Baseten, CoreWeave/W&B, Cline, Prime Intellect, and Nous Portal.

Nemotron 3.5 ASR was the quieter but practical companion release: an open streaming ASR model with a single 0.6B checkpoint, 40 language-locale combinations, and sub-100ms latency, built on a cache-aware FastConformer / RNN-T style design optimized for voice agents and streaming speech workloads (Piotr Zelasko, Together, fal availability).

Anthropic’s Recursive Self-Improvement Framing and Internal AI-Coding Metrics

Anthropic published the most-discussed policy/research note of the day, arguing that current systems show early signs of recursive self-improvement (RSI)—not yet full autonomy in research direction, but clear evidence that AI is accelerating AI development (Anthropic post). The headline operational claims were concrete: 80%+ of merged code at Anthropic is now authored by Claude, the typical engineer ships 8x more code per quarter than in prior years, and on internal open-ended engineering tasks Claude’s success rate rose from roughly 26% to 76% in six months (code metric, Alex Albert summary).

The most striking empirical datapoint was Anthropic’s recurring “speed up a small model training script” test: Claude Opus 4 averaged about 3x speedup, while Mythos Preview reportedly achieved ~52x (Anthropic benchmark claim, correction on dates). Anthropic also says Mythos gave better “what to do next” research suggestions than humans 64% of the time in sessions where the researcher had taken a wrong turn (research-next-step result). Their broader thesis: automating problem selection is still unresolved, but automating large portions of implementation and iteration is already happening.

The governance angle mattered as much as the productivity claims. Anthropic explicitly wrote that “it would be good for the world to have the option to slow or temporarily pause frontier AI development,” framing verification and coordination mechanisms as increasingly urgent if RSI-like dynamics continue (Anthropic governance statement, discussion, commentary). This landed amid criticism that Anthropic recently weakened parts of its Responsible Scaling Policy thresholds around bio/chemical risk, according to @CRSegerie. Separately, a coalition including Altman, Amodei, Hassabis, and Baker backed mandatory DNA synthesis screening and recordkeeping in the US, arguing AI is eroding biological knowledge barriers (letter summary).

Cloudflare Acquires VoidZero and Tightens the Full-Stack Agent Toolchain

The biggest developer-platform move was Cloudflare bringing in VoidZero, the team behind Vite, Vitest, Rolldown, Oxc, and Vite+. Cloudflare and VoidZero emphasized that Vite remains open source, MIT, and vendor-neutral, with Cloudflare also committing $1M to a fund for independent Vite ecosystem development (Cloudflare, Vite statement, Evan You).

The strategic read from developers was that this gives Cloudflare tighter control over an increasingly agent-friendly application stack: frontend/build tooling, runtime, storage, inference, deployment primitives, and security in one place. @wesbos framed it as Cloudflare assembling “a tidy package they can hand to an LLM to make a site,” which is directionally consistent with Cloudflare’s own push on agents, MCP, sandboxes, AI search, payments, and observability in a unified platform (Cloudflare agents docs overview).

Agents, Harnesses, Memory, and Evaluation Infrastructure

Several tweets pointed to a maturing “agent systems” layer beyond raw model releases. A recurring theme was that the bottleneck is increasingly the harness/orchestrator, not just prompting. A popular clip summarized the Claude Code workflow as “I don’t prompt Claude anymore, I write loops,” while @omarsar0 described reverse-engineering dynamic workflows into his own orchestrator for branching research, verification, triage, data synthesis, and eval generation. The common idea: higher-order control loops, not one-shot prompts, are becoming the real unit of work.

Tooling around those loops also improved. LangSmith Sandboxes reached GA with Dockerfile snapshots, interactive consoles, TCP tunneling, and standard Linux tooling. Hugging Face pushed two adjacent ideas: a Kernels distribution path for custom kernels on the Hub (announcement) and stronger support for storing agent traces as first-class artifacts, echoed by @ClementDelangue. @julien_c released SynthTraces, a minimal harness that generated 2,000+ synthetic coding-agent session traces by having an open model play the coding agent and a local model simulate the user.

Evaluation also shifted toward real-world agent work. Arena launched Agent Arena / Agent Mode, measuring agentic performance from millions of live sessions with tools like web search, filesystem, bash, and image generation. Their current ranking puts GPT-5.5 first, followed by Claude Opus 4.7, GLM-5.1, Gemini 3.1 Pro, and Kimi-K2.6, with methodology based on task success, steerability, recovery, user praise/complaint, and tool hallucination across 300K+ tasks, 2M+ tool calls, and 40M lines of code (launch, methodology). On the enterprise side, Cognition introduced an AI Productivity Guarantee for Devin—up to $10M in covered usage if the product doesn’t produce positive engineering value—backed by an internal measurement system over 258 enterprise sessions spanning tasks up to 64+ hours (guarantee, technical writeup).

Memory, Multimodality, and Model/Benchmark Updates

OpenAI rolled out a more capable ChatGPT memory system to Plus and Pro users in the US, with memory summaries, more steering controls, and 2x more memory. The company framed this as a longer-running research arc from saved memory to “dreaming” to the current system (OpenAI, controls, Christina Kim explanation). Related developer-side updates included moderation scores in the Responses and Completions APIs (OpenAIDevs) and a heavily shared demo of the new Codex iOS app plugin for viewing and testing apps in-browser with hot reload (OpenAIDevs demo).

A few other model/data releases are worth noting. Gemma 4 12B continued to draw attention both as a local coding model replacement and in highly compressed form: Unsloth released a 2-bit GGUF at 4.66 GB. @_philschmid highlighted an architectural explainer on how Gemma 4 handles text/images/audio without separate encoders. In multimodal research, @skalskip92 flagged Molmo2 as a strong open VLM candidate at CVPR, supporting video pointing, tracking, counting, and multi-image reasoning. For document understanding, ParseBench from LlamaIndex introduced an open benchmark with 2,000+ human-verified pages and 167K+ test rules across tables, charts, faithfulness, formatting, and grounding (benchmark announcement).

Top Tweets (by engagement, filtered for technical relevance)

Anthropic on RSI and internal automation: Claude now writes 80%+ of merged code at Anthropic, engineers ship 8x more code, and the company says AI accelerating AI development is becoming plausible (Anthropic).

OpenAI memory upgrade: a more capable ChatGPT memory system with summaries, steering controls, and 2x more memory for Plus/Pro users in the US (OpenAI).

Cloudflare + VoidZero: Cloudflare brings in the VoidZero team while keeping Vite MIT and vendor-neutral, plus a $1M OSS fund for the ecosystem (Cloudflare, Vite).

Nemotron 3 Ultra launch: open 550B/55B-active hybrid MoE for long-running agents, with full recipes and unusually strong speed claims (NVIDIA).

Cursor canvases + context explorer: sharable canvases for apps/reports/internal tools and an interactive breakdown of where agent context is spent (Cursor).

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

Gemma 4 12B Release and Benchmarks

google/gemma-4-12B · Hugging Face (Activity: 1610): Google DeepMind released google/gemma-4-12B as part of the Gemma 4 open-weights family, spanning E2B, E4B, 12B, 26B A4B, and 31B variants with dense and MoE architectures, instruction-tuned/pretrained checkpoints, multimodal input, multilingual support across 140+ languages, and context windows up to 256K tokens. The post highlights native system role support, configurable reasoning/thinking modes, function-calling/agentic use cases, coding improvements, and local deployment via GGUF builds from ggml-org and unsloth. A top comment links Maarten Grootendorst’s visual guide, specifically calling out the model’s “encoder-free architecture.” Commenters are mainly interested in empirical coding performance, with one explicitly wanting to test whether Gemma 4 12B can beat Qwen 3.5 9B on coding tasks. No concrete benchmark results were provided in the comments.

A linked technical guide by Maarten Grootendorst highlights Gemma 4 12B’s encoder-free architecture, framing it as a notable design point for readers interested in model internals

Several commenters positioned Gemma 4 12B as a practical size tier between smaller Gemma variants like E4B and larger models such as 26B, with one user also noting interest in whether it can outperform Qwen 3.5 9B on coding tas

[truncated for AI cost control]