The author continues an annual tradition by curating and categorizing notable LLM research papers from January to May 2026, covering architecture, training, inference efficiency, reasoning, reinforcement learning, agent systems, and more, with emphasis on hybrid architecture trends and representative works like Nemotron 3.
The list focuses on reasoning models, RL, efficient inference, agent systems, and other trending areas
Hybrid architectures (e.g., alternating Mamba and attention layers) are a key 2026 trend
From Gemma 4 to DeepSeek V4, this article explores how new open-weight LLMs are reducing long-context costs through architectures like cross-layer KV sharing, per-layer embeddings, attention budgeting, compressed convolutional attention, and mHC.
Gemma 4 introduces cross-layer KV sharing, cutting KV cache size in half while maintaining quality.
Per-layer embeddings boost model capacity with minimal computational overhead.
A learning-oriented workflow for understanding new open-weight model releases, starting with official reports but relying more on config files and reference code due to less detailed papers.
Start with official technical reports, but papers are often less detailed now
Inspect config files and reference implementations on Hugging Face Model Hub
An overview of the six core components that make coding agents effective, including live repo context, prompt caching, tool use, context management, session memory, and subagent delegation, explaining how these harness features enhance LLM performance in coding tasks.
Coding agents improve LLM performance through a harness that provides live repo context, structured tools, and memory management.
The six components are live repo context, prompt caching, tool access, context reduction, session memory, and subagent delegation.
A comprehensive review and comparison of ten open-weight large language model releases from January to February 2026, including Arcee Trinity, Kimi K2.5, Step 3.5 Flash, Qwen3-Coder-Next, GLM-5, MiniMax M2.5, Nanbeige 4.1, Qwen3.5, Ling 2.5, Tiny Aya, and an update on Sarvam. The article focuses on architectural similarities and differences, highlighting trends like hybrid attention, multi-token prediction, and mixture-of-experts.
Ten open-weight LLMs released in Jan-Feb 2026 compared with architectural focus
Hybrid attention and multi-token prediction emerge as key efficiency trends
Inference-time scaling is one of the most effective ways to improve answer quality in deployed LLMs. This article categorizes various inference-time scaling techniques and provides an overview of recent papers, including chain-of-thought prompting, self-consistency, best-of-N ranking, rejection sampling with a verifier, self-refinement, and search over solution paths. The author shares personal experiments from drafting a book chapter on the topic.
Inference-time scaling improves model performance by allocating more compute and time during inference
Key methods include chain-of-thought, self-consistency, best-of-N, rejection sampling, and more
A comprehensive review of large language models in 2025, covering key developments like DeepSeek R1's reasoning via RLVR/GRPO, the rise of inference-time scaling and tool use, the problem of benchmark overfitting (benchmaxxing), and predictions for 2026 including diffusion models and broader RLVR applications.
DeepSeek R1's open-weight reasoning model using RLVR/GRPO dominated the year, shifting focus to post-training scaling.
Inference-time scaling and tool use emerged as major drivers of LLM progress beyond traditional pre-training scaling.
The author shares a curated list of interesting research papers from July to December 2025, categorized by topics like reasoning models, reinforcement learning, and architectures, as a thank-you to supporters.
Curated list of research papers from July to December 2025
Categorized into reasoning models, RL, architectures, etc.
This article provides an in-depth analysis of DeepSeek V3.2's technical evolution, covering architectural changes (including the sparse attention mechanism DSA), reinforcement learning updates (such as GRPO improvements, self-verification and self-refinement), and the development of hybrid reasoning models. V3.2 matches the performance of GPT-5 and Gemini 3.0 Pro and is released as an open-weight model, making it a significant milestone.
DeepSeek V3.2 adopts the same sparse attention mechanism (DSA) as V3.2-Exp, greatly improving long-context efficiency.
Self-verification and self-refinement techniques from DeepSeekMath V2 are integrated, substantially enhancing mathematical reasoning capabilities.
This article explores alternatives to standard autoregressive decoder-style transformers for large language models, including linear attention hybrids, text diffusion models, code world models, and small recursive transformers. It analyzes the strengths and limitations of each approach and discusses their potential impact on efficiency, reasoning, and modeling performance.
Linear attention hybrids like Qwen3-Next and Kimi Linear use Gated DeltaNet to reduce computational complexity but must balance efficiency with reasoning accuracy.
Text diffusion models enable parallel token generation but suffer quality degradation and tool integration challenges, making them unlikely to replace autoregressive models soon.
This article explains four main approaches to evaluating large language models: multiple-choice benchmarks (like MMLU), verifiers for free-form answers, leaderboards based on user preferences (like Chatbot Arena), and LLM-as-a-judge evaluations. It includes from-scratch code implementations and discusses the trade-offs of each method.
Multiple-choice benchmarks test knowledge recall but don't reflect real-world use.
Verifiers allow free-form answers but require verifiable domains like math.