EleutherAI researchers introduce reasoning interpolation, a technique to detect early signs of reward hacking in reinforcement learning. It uses fine-tuned donor models to generate natural exploit-eliciting reasoning prefixes and importance sampling to estimate hack probabilities. While absolute estimates are unreliable early in training, the trend in importance sampling predictions achieves perfect AUC in a controlled setting, suggesting promise as a monitoring signal for RL safety.
Reasoning interpolation generates natural reasoning prefixes that elicit reward hacking by fine-tuning a donor model on exploits without reasoning tokens.
Importance sampling underestimates absolute hack rates by orders of magnitude early in training, but the trend is highly predictive of which exploit types will emerge.
EleutherAI announces Deep Ignorance, a study showing that filtering pretraining data prevents unsafe knowledge (e.g., biorisk) without losing general performance, and remains robust against fine-tuning attacks. Using string blocklists and ML classifiers, they trained 6.9B models from scratch and found that filtered models resist tampering, though they still learn from in-context information. The paper proposes data filtering as a foundational layer for open-weight model safety.
Filtering pretraining data reduces biorisk knowledge to near random chance without degrading general benchmarks.
Filtered models are tamper-resistant: even fine-tuning on expert biorisk data does not restore baseline performance.
Attention probes are a novel method for classifying internal states of language models by using an attention layer to aggregate hidden states, avoiding pooling. Multi-head variants (especially 8 heads) outperform mean probes on most datasets, and training code is open-source.
Attention probes use an attention layer with learnable position bias to aggregate hidden states instead of pooling.
Multi-head attention probes (8 heads) outperform mean and last-token probes on most datasets.
EleutherAI researchers tested local volume measurement for detecting model misalignment and anomalous datapoints, found it uncompetitive, and are pivoting to data attribution.
Local volume measurement estimates behavioral sensitivity to weight perturbations.
On the POSER benchmark, weight perturbations were far less effective than activation perturbations for detecting misaligned models.
In this post, we will study inductive biases of the parameter-function map of random neural networks using star domain volume estimates. This builds on the ideas introduced in Estimating the Probability of Sampling a Trained Neural Network at Random and Neural Redshift: Random Networks are not Random Functions (henceforth NRS).
Inductive biases are crucial for generalization but difficult to capture with a single measure.
Star domain local volume estimation is used to probe parameter-function map geometry at initialization.
EleutherAI announces the Common Pile v0.1, an 8TB dataset of public domain and openly licensed text, aiming to promote transparency and open science in AI research. The dataset was built in collaboration with multiple institutions, and the trained Comma v0.1 models perform comparably to those trained on unlicensed data.
The Common Pile v0.1 is an 8TB dataset of publicly licensed and public domain text, released by EleutherAI and partners.
It addresses the lack of transparency in AI training data, enabling reproducible research and accountability.
EleutherAI explores using Product Key Memory (PKM) to improve sparse coders. PKM transcoders train faster and are slightly more interpretable than TopK transcoders for moderate expansion factors, though they underperform at extreme scales.
PKM transcoders train faster and achieve competitive reconstruction loss for expansion factors up to 256x.
PKM reduces encoder parameters by decomposing the input dimension, enabling quicker forward passes.
Research shows that TopK sparse autoencoders (SAEs) trained on the same data with different random seeds share only about 53% of learned features. Many unshared latents are interpretable. Narrower SAEs have higher feature overlap, while larger SAEs show decreased overlap, consistent with feature splitting and absorption phenomena.
Only ~53% of features are shared between independently trained SAEs
This work explores using natural language interpretations of sparse autoencoder (SAE) latents to simulate activations in LLMs. The authors find that current interpretations can identify less than 50% of active latents, and despite high specificity, the extreme imbalance between active and inactive latents leads to many false positives. Predicting activation values from interpretations shows only weak correlations. The results indicate that natural language interpretations are not yet reliable for simulating model activations.
Current interpretations of SAE latents identify less than 50% of active latents.
High specificity (90%) is insufficient due to class imbalance; 99.9%+ needed.