AI News HubLIVE
Public articles 11Collected articles 12Trust 87Refresh 720 min
Health HealthySource type ResearchFull-text rights Full text allowedLast ingested 2026-06-06ID ahead-of-aiStatus Enabled

Public Substack newsletter; free posts allowed.

Latest public articles

LLM Research Papers: The 2026 List (January to May)

The author continues an annual tradition by curating and categorizing notable LLM research papers from January to May 2026, covering architecture, training, inference efficiency, reasoning, reinforcement learning, agent systems, and more, with emphasis on hybrid architecture trends and representative works like Nemotron 3.

  • The list focuses on reasoning models, RL, efficient inference, agent systems, and other trending areas
  • Hybrid architectures (e.g., alternating Mamba and attention layers) are a key 2026 trend
In-site article

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

From Gemma 4 to DeepSeek V4, this article explores how new open-weight LLMs are reducing long-context costs through architectures like cross-layer KV sharing, per-layer embeddings, attention budgeting, compressed convolutional attention, and mHC.

  • Gemma 4 introduces cross-layer KV sharing, cutting KV cache size in half while maintaining quality.
  • Per-layer embeddings boost model capacity with minimal computational overhead.
In-site article

My Workflow for Understanding LLM Architectures

A learning-oriented workflow for understanding new open-weight model releases, starting with official reports but relying more on config files and reference code due to less detailed papers.

  • Start with official technical reports, but papers are often less detailed now
  • Inspect config files and reference implementations on Hugging Face Model Hub
In-site article

Components of A Coding Agent

An overview of the six core components that make coding agents effective, including live repo context, prompt caching, tool use, context management, session memory, and subagent delegation, explaining how these harness features enhance LLM performance in coding tasks.

  • Coding agents improve LLM performance through a harness that provides live repo context, structured tools, and memory management.
  • The six components are live repo context, prompt caching, tool access, context reduction, session memory, and subagent delegation.
In-site article

A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026

A comprehensive review and comparison of ten open-weight large language model releases from January to February 2026, including Arcee Trinity, Kimi K2.5, Step 3.5 Flash, Qwen3-Coder-Next, GLM-5, MiniMax M2.5, Nanbeige 4.1, Qwen3.5, Ling 2.5, Tiny Aya, and an update on Sarvam. The article focuses on architectural similarities and differences, highlighting trends like hybrid attention, multi-token prediction, and mixture-of-experts.

  • Ten open-weight LLMs released in Jan-Feb 2026 compared with architectural focus
  • Hybrid attention and multi-token prediction emerge as key efficiency trends
In-site article

Categories of Inference-Time Scaling for Improved LLM Reasoning

Inference-time scaling is one of the most effective ways to improve answer quality in deployed LLMs. This article categorizes various inference-time scaling techniques and provides an overview of recent papers, including chain-of-thought prompting, self-consistency, best-of-N ranking, rejection sampling with a verifier, self-refinement, and search over solution paths. The author shares personal experiments from drafting a book chapter on the topic.

  • Inference-time scaling improves model performance by allocating more compute and time during inference
  • Key methods include chain-of-thought, self-consistency, best-of-N, rejection sampling, and more
In-site article

The State Of LLMs 2025: Progress, Problems, and Predictions

A comprehensive review of large language models in 2025, covering key developments like DeepSeek R1's reasoning via RLVR/GRPO, the rise of inference-time scaling and tool use, the problem of benchmark overfitting (benchmaxxing), and predictions for 2026 including diffusion models and broader RLVR applications.

  • DeepSeek R1's open-weight reasoning model using RLVR/GRPO dominated the year, shifting focus to post-training scaling.
  • Inference-time scaling and tool use emerged as major drivers of LLM progress beyond traditional pre-training scaling.
In-site article

LLM Research Papers: The 2025 List (July to December)

The author shares a curated list of interesting research papers from July to December 2025, categorized by topics like reasoning models, reinforcement learning, and architectures, as a thank-you to supporters.

  • Curated list of research papers from July to December 2025
  • Categorized into reasoning models, RL, architectures, etc.
In-site article

From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates

This article provides an in-depth analysis of DeepSeek V3.2's technical evolution, covering architectural changes (including the sparse attention mechanism DSA), reinforcement learning updates (such as GRPO improvements, self-verification and self-refinement), and the development of hybrid reasoning models. V3.2 matches the performance of GPT-5 and Gemini 3.0 Pro and is released as an open-weight model, making it a significant milestone.

  • DeepSeek V3.2 adopts the same sparse attention mechanism (DSA) as V3.2-Exp, greatly improving long-context efficiency.
  • Self-verification and self-refinement techniques from DeepSeekMath V2 are integrated, substantially enhancing mathematical reasoning capabilities.
In-site article

Beyond Standard LLMs

This article explores alternatives to standard autoregressive decoder-style transformers for large language models, including linear attention hybrids, text diffusion models, code world models, and small recursive transformers. It analyzes the strengths and limitations of each approach and discusses their potential impact on efficiency, reasoning, and modeling performance.

  • Linear attention hybrids like Qwen3-Next and Kimi Linear use Gated DeltaNet to reduce computational complexity but must balance efficiency with reasoning accuracy.
  • Text diffusion models enable parallel token generation but suffer quality degradation and tool integration challenges, making them unlikely to replace autoregressive models soon.
In-site article

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

This article explains four main approaches to evaluating large language models: multiple-choice benchmarks (like MMLU), verifiers for free-form answers, leaderboards based on user preferences (like Chatbot Arena), and LLM-as-a-judge evaluations. It includes from-scratch code implementations and discusses the trade-offs of each method.

  • Multiple-choice benchmarks test knowledge recall but don't reflect real-world use.
  • Verifiers allow free-form answers but require verifiable domains like math.
In-site article

All sources