2026-05-28 07:26 UTCIn-site rewrite6 min readUpdated: 2026-06-30 13:03 UTC

[AINews] Cognition raises $1B in $26B Series D

Cognition raises $1B at a $26B valuation, projecting >$1B ARR by year-end. The article covers inference efficiency trends, agent engineering, continual learning, new benchmarks, model releases, and coding agent productization.

SourceLatent Space

We last wrote about Cognition in September’s $10B Series C when Smol.ai also joined Cognition and AINews was eventually moved here to Latent Space. 8 months later, it is worth 2.5x more, and officially the largest remaining independent agent lab in AI, a thesis we mapped out last year. With official ARR disclosures (now projecting >$1B ARR by EOY) you can map out the growth, which looks oddly similar to the WTF Happened in 2025 charts (this isn’t a coincidence):

In the enterprise SaaS business, ARR is a trailing indicator of utilization, as are the logos of some of the toughest/most discerning customers in the enterprise and startup ecosystem (including Exa and Modal, featured last week)

We will release more on the Cognition podcast tomorrow.

AI News for 5/26/2026-5/27/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Inference Efficiency, Serving Architectures, and Cost Curves

Inference optimization is increasingly architectural, not just kernel-level: EAGLE 3.1 improves speculative decoding robustness by stabilizing hidden-state feedback and reducing attention drift at deeper decode steps, with explicit emphasis on long-context acceptance length and real-world serving reliability; the team also highlighted collaboration with vLLM and TorchSpec. At the kernel/system layer, Perplexity open-sourced a rebuilt Unigram tokenizer that cuts CPU utilization 5–6× and reaches 63 µs at 514 tokens with zero heap allocations, while Qwen3.5 on TokenSpeed reportedly hits 580 tokens/s for agentic workloads via joint optimization across Alibaba, LightSeek, NVIDIA, Mooncake, and FlashAttention-4 contributors. Supporting libraries also improved: MaxSim v2 adds backprop and reports 10.33× faster on H200 and 11.94× on A100 versus naïve PyTorch.

Price cuts are being justified by structural KV-cache and attention changes: Several posts converged on the same theme: recent API price cuts from Chinese labs look sustainable because they reflect lower serving cost per token, not temporary subsidy. @kimmonismus summarized how DeepSeek V4-Pro uses hybrid attention with Compressed Sparse Attention and Heavily Compressed Attention to bring 1M-token KV cache to ~10% of V3.2 and single-token inference FLOPs to 27%, while still routing 49B active params out of 1.6T total. Xiaomi’s MiMo similarly reduces cache traffic using SWA plus hierarchical cache management. That was corroborated directly by @_LuoFuli, who said MiMo’s deepest input-cache-hit price cut comes from 5× cached token capacity, roughly 80% lower caching cost, and an architectural 1:7 Full:SWA sparsity ratio. The broader takeaway: long-context inference economics are now being pushed by attention design + cache hierarchy + routing, not just cheaper hardware.

Agents, Harnesses, Memory, and Continual Learning

The stack is shifting from “model quality” to “model-harness-memory fit”: A substantial cluster of tweets focused on practical agent engineering. LangChain shipped Deep Agents v0.6 with Delta Channels, cutting checkpoint storage for a 200-turn coding session from 5.3 GB to 129 MB, and also launched computer use in Fleet, plus Context Hub for versioned agent context/skills. LangSmith Engine was framed as automating the eval → diagnosis → fix loop, with multiple practitioners emphasizing its value for turning trace feedback into reusable online/offline evaluators. In parallel, @Vtrivedy10 made the clearest formulation of the day: task-harness fit matters as much as model quality, and bespoke vertical systems outperform generic harnesses by narrowing tools, prompts, and context to the task.

Continual learning is re-emerging as a product category, not just a research topic: The biggest announcement here was Trajectory’s launch: a platform for using product usage signals and agent traces to continuously post-train large agentic models, with $15M in funding and design partners including Clay, Harvey, Decagon, Mercor, and Rogo. Baseten said it supports these deployments with FP8/NVFP4 quantization and autoscaled H100 infra, including a cited overnight deployment of a 397B-parameter model. The same trend appeared in open tooling: an open-source memory-centric agent built on LangChain/LangGraph was praised by multiple builders for explicit retrieval/storage/reasoning/learning separation, and RLM’s minimal training harness shows small teams can now RL-tune long-context agents in a day on 8×A100. The throughline is that “post-deployment learning” is moving from aspiration to infra.

Benchmarks, Scaling Laws, and Training Methods

New benchmarks are increasingly about long-horizon, messy, real-world workflows: DeepSWE was highlighted as a SWE/agent benchmark with 113 tasks across 91 repos in 5 languages, using a minimalist bash-only harness and shorter prompts that nevertheless require 5.5× more code and touch 7 files on average than SWE-Bench Pro. In enterprise operations, Artificial Analysis and IBM launched ITBench-AA, an SRE benchmark over Kubernetes incident response where all frontier models scored below 50%; Claude Opus 4.7 led at 47%, GPT-5.5 followed at 46%, and GLM-5.1 Reasoning led open weights at 40%. Another useful reliability angle came from AgingBench, which frames deployed agent degradation as a lifespan problem caused by compression, interference, and memory updates.

Training efficiency research remains active across both theory and systems: Sakana AI’s DiffusionBlocks was one of the most technically interesting releases: it reinterprets forward passes as diffusion-like denoising steps so deep nets can be trained one block at a time, dramatically reducing memory while matching end-to-end performance across ViTs, DiTs, masked diffusion, autoregressive transformers, and recurrent-depth transformers. On the RL systems side, Snowflake introduced ZoRRo, claiming up to 3.5× faster long-context RL and 3.2× longer context windows by eliminating redundant rollout computation, alongside the specialized Arctic-Text2SQL-R2 enterprise SQL model. On the theory front, Tiberiu Musat’s preprint argues minimum neural weight norm matches minimum program length up to a log factor for fixed-precision networks, while Unified Neural Scaling Law proposes a multivariate functional form intended to extrapolate neural scaling behavior more accurately than prior fits.

Model and Modality Releases: Biology, Vision, OCR, and Embedded AI

Protein modeling had a standout day: ESMFold2 was announced as an open scientific engine for protein structure prediction and design, with strong reported results on protein interactions and antibodies, plus an accompanying atlas of 6.8B proteins and 1.1B predicted structures. The release emphasized both practical design outcomes—miniprotein binders and single-chain antibodies across five therapeutic targets—and mechanistic interpretability findings about emergent protein representations. The release was echoed by @proteinrosh and contextualized by @cgeorgiaw, who noted the atlas exceeds AlphaFold DB in scale.

A wave of smaller but practical multimodal/open releases landed: Google DeepMind shared the white paper for Gemini Embedding 2, described as a native multimodal embedding model supporting unified representations over text, image, audio, and video. NVIDIA’s LocateAnything combines Qwen2.5-3B + Moon-ViT for high-speed grounding, with a claimed 10× speedup for dense object detection. Hugging Face integrated Roboflow’s RF-DETR, positioning it as real-time detection/segmentation that outperforms YOLO-style systems. For document pipelines, Surya OCR 2 ships as a 650M model with 83.3% OLMOCR bench, 87% on an internal 91-language benchmark, and 5 pages/s on RTX 5090; LiteParse v2 rewrites parsing in Rust for up to 100× speedups and edge/browser deployment via WASM. On-device AI also got a nod with Google’s new Coral board for local speech, vision, and control demos.

Developer Platforms, Enterprise Controls, and Coding-Agent Productization

Coding agents are consolidating into full product stacks with enterprise controls: OpenAI continued tightening Codex’s product surface: GPT-5.2 and GPT-5.3-Codex are being sunset in Codex in favor of GPT-5.5, while enterprise features now include private MCP connectivity over outbound-only HTTPS, Workload Identity Federation, and expanded Admin API controls for spend alerts, allowlists, retention policies, and hosted tool management. OpenAI also published a concrete case study on self-improving tax agents with Codex, centered on tracing reviewer corrections back into evals and fixes.

Competition in coding agents is now visibly about reliability, workflow breadth, and enterprise adoption: Claude Code shared a reliability/performance update and easier bug-report capture, while GitHub kept pushing the “agentized IDE” direction with Copilot Dev Days and MCP positioning. The biggest commercial datapoint was Cognition: >$1B raised at a $26B valuation, enterprise usage up >10× YTD, and $492M run-rate revenue, paired with a growing customer list and strong endorsements from users like Exa. Meanwhile, smaller infra/product moves suggest the ecosystem is broadening: Cua Driver for Windows brings background computer use to Windows agents; Cloudflare’s agent platform was repeatedly praised for “fractional computing” economics; and Grok Build’s worktree support targets multi-agent code swarms at repo scale.

Top tweets (by engagement)

Cognition’s scale-up: Cognition announced >$1B raised, $26B valuation, and $492M run-rate revenue, one of the clearest signals yet that coding agents are converting into large enterprise businesses.

Claude Code reliability push: Anthropic’s ClaudeDevs posted a high-engagement update on responsiveness, reliability, and better feedback collection—evidence that product quality and trust are now central battlegrounds.

Sakana AI’s DiffusionBlocks: @hardmaru drew major attention to block-wise training that can match end-to-end performance while dramatically lowering memory requirements.

ESMFold2 release: @alexrives announced one of the day’s most substantive science releases: open protein modeling at atlas scale with therapeutic design implications.

OpenAI enterprise controls + MCP: @OpenAIDevs on private MCP and related admin/security updates reflects where frontier APIs are competing for large-org adoption.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

Low-Bit Local AI on Consumer Hardware

PrismML just released Binary and Ternary Bonsai Image 4B: 1-bit/ternary text-to-image diffusion transformers that can even run 100% locally in your browser on WebGPU. (Activity: 759): PrismML released Binary and Ternary Bonsai Image 4B, described as 1-bit/ternary text-to-image diffusion-transformer variants with ~3GB checkpoints, Apache-2.0 licensing, and a WebGPU browser demo (HF collection, demo). The post compares them to FLUX.2 Klein 4B at ~16GB; a top technical comment claims Bonsai Image is primarily a quantized/post-trained derivative of FLUX.2 Klein 4B, with insufficient attribution outside the whitepaper. The main debate is attribution/branding: one commenter argues PrismML is rebranding quantized/fine-tuned base models as “Bonsai” while minimizing credit to original labs, comparing it to releasing a quant of Qwen as a new model. Another commenter asks whether it can run on CPU with 16GB RAM, but no technical answer is provided in the supplied comments.

A commenter alleges PrismML’s “Bonsai-Image” is not a newly trained base model, but a binary/ternary quantization of FLUX.2 Klein 4B with additional post-training to recover quality. They argue the project’s HF demo/model pages and GitHub omit clear attribution to the original FLUX model/team, with the original model reportedly mentioned only in the whitepaper.

A technical usability note says th

[truncated for AI cost control]