Scaling laws are one of the most critical empirical findings in deep learning, describing power-law relationships between model size, data, compute, and loss. This article reviews the development from early theory to modern empirical studies, including Kaplan et al.'s classic scaling laws and the Chinchilla scaling laws, and discusses key findings such as compute-optimal allocation.
Scaling laws show that training loss decreases as a power law with model size, data size, and compute.
Kaplan et al. found model size should grow faster than data, later overturned by Chinchilla.
Reward hacking occurs when a reinforcement learning agent exploits flaws or ambiguities in the reward function to achieve high rewards without genuinely learning or completing the intended task. With the rise of language models and RLHF, reward hacking has become a critical practical challenge. This article covers the definition, types, causes, and potential mitigations of reward hacking.
Reward hacking is the exploitation of reward function flaws by RL agents.
In RLHF, reward hacking can lead to models generating seemingly correct but factually incorrect outputs.
This article by Lilian Weng focuses on extrinsic hallucinations in large language models, where models generate fabricated content not grounded in provided context or world knowledge. It explores causes such as pre-training data issues and fine-tuning new knowledge, discusses detection methods including retrieval-augmented evaluation and sampling-based approaches, and presents anti-hallucination techniques like RAG, chain-of-verification, sampling adjustments, and fine-tuning for factuality and attribution.
Extrinsic hallucination refers to model outputs that are fabricated and not grounded in pre-training data or world knowledge.
Fine-tuning on new knowledge can increase hallucination tendencies, as unknown examples are learned slower.
Diffusion models have demonstrated strong results on image synthesis in past years. Now the research community has started working on a harder task—using it for video generation. The task itself is a superset of the image case, since an image is a video of 1 frame, and it is much more challenging because: it has extra requirements on temporal consistency across frames in time, which naturally demands more world knowledge to be encoded into the model; and in comparison to text or images, it is more difficult to collect large amounts of high-quality, high-dimensional video data, let alone text-video pairs.
Video generation is a superset of image generation, requiring temporal consistency and more world knowledge.
Main architectures include 3D U-Net (VDM, Imagen Video) and Diffusion Transformer (Sora).
High-quality data is the fuel for modern deep learning model training. This article explores how to collect high-quality data through human annotation, including task design, rater selection and training, data aggregation, and quality assurance. It covers the wisdom of the crowd, methods for measuring rater agreement (e.g., Cohen's Kappa, MACE), and two annotation paradigms (descriptive vs. prescriptive). Additionally, it discusses techniques to identify mislabeled data using influence functions, training dynamics (e.g., data maps, forgetting events, AUM), and noisy cross-validation.
High-quality human data requires careful task design, rater selection, training, and aggregation.
Crowd wisdom aggregation methods like majority voting and Cohen's Kappa help assess annotation quality.
A comprehensive survey of adversarial attacks on large language models, covering threat models, attack types including token manipulation, gradient-based attacks, jailbreak prompting, and red-teaming techniques. The article discusses the challenges and methods for both black-box and white-box settings.
LLMs with safety alignment are vulnerable to adversarial inputs that trigger undesired outputs.
Attacks range from simple token swaps to sophisticated gradient-based optimization.
This article explores autonomous agents powered by large language models (LLMs) as their core controller. The system comprises three main components: planning (task decomposition and self-reflection), memory (short-term via in-context learning, long-term via external vector stores), and tool use (calling external APIs). It covers case studies like ChemCrow and Generative Agents, proof-of-concepts such as AutoGPT, GPT-Engineer, and BabyAGI, and discusses challenges like finite context windows.
LLM serves as the core of autonomous agents, combined with planning, memory, and tool use
Planning involves subgoal decomposition and self-reflection for complex tasks
This article provides a comprehensive overview of prompt engineering methods for large language models, covering basic prompting, instruction prompting, self-consistency sampling, chain-of-thought prompting, automatic prompt design, and augmented language models.
Prompt engineering steers LLM behavior without updating model weights.
Zero-shot and few-shot learning are foundational methods; few-shot often improves performance at higher token cost.
This article is a major update to Lilian Weng's 2020 post on the Transformer family, doubling its length. It systematically reviews numerous recent improvements to the Transformer architecture, covering attention mechanisms, positional encoding, long-context support, adaptive modeling, and efficient attention, including the latest advances such as Transformer-XL, Rotary position embedding, ALiBi, and the Universal Transformer.
The new version restructures the hierarchy and incorporates recent papers from the past three years.
Various positional encoding methods are detailed, including sinusoidal, learned, relative, and Rotary position embeddings.
A comprehensive overview of techniques to optimize inference for large transformer models, including distillation, quantization, pruning, sparsity, mixture-of-experts, and architectural improvements. The article discusses challenges such as memory footprint and low parallelizability, and presents methods to reduce memory usage, computation, and latency.
KV cache can be up to 3TB for large batch sizes.
Distillation reduces model size by 40% with minimal performance loss.