Research shows that LLM-as-a-judge panels suffer from correlated errors, drastically reducing their informational value. Tests with 9 frontier models from 7 families found only 2 effective independent votes, with accuracy 8-22 points lower than the ideal. The best single model matches or outperforms the full panel, and adding judges or using better aggregation helps little.
A panel of 9 LLMs effectively provides only about 2 independent votes; roughly 75% of nominal independence is lost due to correlated mistakes on the same items.
Actual accuracy falls 8-22 percentage points short of what independent voting would achieve; the best single judge matches or outperforms the full panel.
When annotators disagree on a label, the disagreement itself carries signal—and the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from ChaosNLI and find that entropy correlation requires 20-50 annotators to converge, while KL divergence saturates at 10. Soft labels capture item-specific signal that label smoothing cannot replicate.
Fine-tuning NLI models on label distributions reveals metric-dependent saturation.
Entropy correlation converges with 20-50 annotators; KL divergence saturates at 10.
Apple announces its third generation of Foundation Models, a family of five models built with Google, including on-device and server-based models, with a focus on privacy and new architectures like sparsely activated models and instruction-following pruning. The models power new Siri and intelligent tools, and show significant quality improvements in evaluations.
Apple introduces five new foundation models: two on-device (AFM 3 Core and AFM 3 Core Advanced) and three server-based (AFM 3 Cloud, ADM 3 Cloud for images, and AFM 3 Cloud Pro).
AFM 3 Core Advanced features a novel sparsely activated architecture that stores most weights in flash memory, enabling larger effective model sizes on-device.
Apple is showcasing new research at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 in Denver, June 3-7. The company is sponsoring the conference and presenting work on video generation, multimodal understanding, image compression, and more.
Apple will present multiple research papers at CVPR 2026, including STARFlow-V, AToken, and Velox.
Scheduled activities include keynote talks, invited talks, poster sessions, and booth presentations.
Streaming vision-language models (VLMs) continuously generate responses from an instruction and an online frame stream, crucial for real-time visual assistants. Existing benchmarks focus on offline evaluation, missing metrics like proactiveness and consistency. VSAS-Bench introduces a new framework with over 18,000 dense annotations, synchronous/asynchronous protocols, and reveals that conventional VLMs can be adapted without extra training to outperform dedicated streaming models.
VSAS-Bench is the first benchmark to comprehensively evaluate streaming VLMs in real-time, emphasizing proactiveness and consistency.
It features over 18,000 temporally dense annotations across diverse domains and tasks.
Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the Key-Value (KV) cache grows linearly with the extended dialogue history, causing the model’s memory footprint to quickly exceed device limits. While recent KV cache compression methods attempt to reduce memory usage, most apply cache eviction after processing the entire context, incurring unbounded peak memory usage. Additionally, query-dependent eviction narrows the cache semantics to a single query, leading to failure cases in multi-turn conversations. In this paper, we introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and performs episode-specific KV cache eviction. Across three LongConvQA benchmarks (LongMemEval, Realtalk, and LoCoMo), EpiCache improves accuracy by up to 30%, achieves near full-cache accuracy under 4-6x compression, and reduces latency and peak memory by up to 2.4x and 3.7x, respectively.
EpiCache is a training-free KV cache management framework for long conversational QA under fixed memory budgets.
It limits cache growth via block-wise prefill and preserves topic context via episodic KV compression.
Apple researchers propose BalCapRL, a balanced reinforcement learning framework that jointly optimizes correctness, coverage, and linguistic quality for MLLM image captioning. By introducing GDPO-style reward-decoupled normalization and length-conditional reward masking, BalCapRL achieves significant gains: +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across LLaVA-1.5 and Qwen2.5-VL models.
Existing RL captioning methods trade off utility, coverage, and linguistic quality
BalCapRL proposes multi-objective optimization across three core dimensions
Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing “bottleneck” rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, shifting the objective from “maximize sum” to “maximize consistency.” We show via Taylor expansion that a LogSumExp (SoftMin) operator effectively acts as a smooth variance penalty. We evaluate RVPO on rubric-based medical and scientific reasoning with up to 17 concurrent LLM-judged reward signals (Qwen2.5-3B/7B/14B) and on tool-calling with rule-based constraints (Qwen2.5-1.5B/3B). By preventing the model from neglecting difficult constraints to exploit easier objectives, RVPO improves overall scores on HealthBench (0.261 vs. 0.215 for GDPO at 14B, p < 0.001) and maintains competitive accuracy on GPQA-Diamond without the late-stage degradation observed in other multi-reward methods, demonstrating that variance regularization mitigates constraint neglect across model scales without sacrificing general capabilities.
Current multi-objective RLHF methods use arithmetic mean reward aggregation, leading to constraint neglect.
RVPO penalizes reward variance via a SoftMin operator, encouraging consistency over sum maximization.
Velox is a framework for learning latent representations of 4D objects from unstructured dynamic point clouds. It compresses spatiotemporal color point clouds into dynamic shape tokens, supervised by a 4D surface decoder for geometry and a Gaussian decoder for appearance. Strong performance is demonstrated on video-to-4D generation, 3D tracking, and cloth simulation.
Velox learns compressed 4D representations from unstructured dynamic point clouds.
Uses dynamic shape tokens with a 4D surface decoder for geometry and a Gaussian decoder for appearance.
Apple hosted a two-day workshop in early 2026 focusing on privacy-preserving machine learning and AI, bringing together researchers to discuss advances in federated learning, foundation model privacy, attacks and security, and more.
Apple emphasizes privacy as a fundamental right and the need for privacy-preserving AI research.
Workshop covered three key areas: Private Learning and Statistics, Foundation Models and Privacy, and Attacks and Security.
We propose HeadsUp, a scalable feed-forward method for reconstructing high-quality 3D Gaussian heads from large-scale multi-camera setups. Our method employs an efficient encoder-decoder architecture that compresses input views into a compact latent representation, which is then decoded into UV-parameterized 3D Gaussians anchored to a neutral head template. This UV representation decouples the number of 3D Gaussians from the number and resolution of input images, enabling training with many high-resolution views. Trained on an internal dataset of over 10,000 subjects—an order of magnitude larger than existing datasets—HeadsUp achieves state-of-the-art reconstruction quality and generalizes to novel identities without test-time optimization. We analyze scaling behavior and demonstrate downstream applications for novel identity generation and expression-based animation.
HeadsUp uses a UV-parameterized 3D Gaussian representation to efficiently reconstruct heads from multi-view images.
Encoder-decoder architecture compresses views into a latent code and decodes to Gaussians on a template. This decouples Gaussian count from input resolution.
Apple ML research proposes Text-Conditional JEPA (TC-JEPA), which uses image captions to reduce prediction uncertainty in masked feature prediction, leading to more semantically meaningful visual representations. It outperforms contrastive methods on diverse tasks, especially those requiring fine-grained understanding and reasoning.
TC-JEPA conditions on image captions to lower the uncertainty of predicting masked features, encouraging semantic representations.
A fine-grained text conditioner modulates predicted patch features via sparse cross-attention over text tokens.
Apple ML Research presents a paper at CVPR 2026 systematically studying key modeling choices for practical learned image codecs jointly optimized for perceptual quality and runtime. Using performance-aware neural architecture search, they build a new codec that achieves 2.3–3x bitrate savings over traditional codecs like AV1 and VVC, and 20–40% over learned alternatives, with encoding/decoding speeds of 230ms/150ms on iPhone 17 Pro Max for 12MP images.
Comprehensive study of key modeling choices including novel techniques for practical learned image codecs.
Performance-aware neural architecture search over millions of backbones to optimize for on-device runtime.
SpecMD is a standardized framework developed by Apple researchers to benchmark and evaluate expert caching policies for Mixture-of-Experts (MoE) models. The study reveals that MoE expert access patterns do not follow temporal locality, leading to the proposal of a new eviction policy called Least-Stale, which reduces collision misses by up to 85× compared to LRU and achieves 88% hit rates with 34.7% reduction in time-to-first-token on OLMoE.
SpecMD provides a standardized benchmarking framework for MoE expert caching policies across different hardware configurations.
The study finds that MoE expert access patterns are inconsistent with temporal locality assumptions like LRU and LFU.
Apple ML Research introduces iTARFlow, an iterative denoising approach that enhances Normalizing Flows for image generation, achieving competitive results on ImageNet at multiple resolutions.
iTARFlow combines autoregressive generation with iterative denoising.
It maintains a likelihood-based objective during training, unlike diffusion models.
Apple ML Research introduces SFI-Bench, a video-based benchmark with over 1,700 questions to evaluate multimodal LLMs on spatial and functional reasoning. Experiments show current models struggle to integrate spatial memory with functional knowledge, highlighting a critical bottleneck.
SFI-Bench goes beyond geometric perception to evaluate functional understanding.
Tasks include conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting.
Apple ML Research proposes Stochastic KV Routing, a training technique where layers randomly attend to their own or a preceding layer's KV states, enabling adaptive depth-wise cache sharing. This reduces KV cache memory footprint without throughput loss and can even improve performance in data-constrained settings.
KV cache memory is a major bottleneck for serving LLMs
Previous work focuses on temporal compression; this work leverages depth dimension
Apple and Purdue University propose PORTool, a policy optimization algorithm that uses a rewarded rollout tree and step-level importance estimates to address credit assignment in multi-tool reasoning, improving accuracy while reducing tool calls.
PORTool generates a rewarded rollout tree to compare alternative tool-use decisions in the same context.
Step importance is estimated via a correctness-dominant signal and auxiliary execution success signal.
Apple researchers propose an inference-time evaluation method that introduces a reviewer agent to assess provisional tool calls before execution, enabling real-time error correction. Evaluated on BFCL and τ2-Bench, the method achieves +5.5% and +7.1% improvements, and introduces Helpfulness-Harmfulness metrics to quantify the tradeoff of corrections.
Moves evaluation from post-hoc analysis into the inference-time execution loop for real-time correction.
Introduces Helpfulness-Harmfulness metrics to quantify net benefits of reviewer feedback.
Apple is presenting new research at the annual ICASSP, taking place in Barcelona, Spain, from May 4 to 8, 2026. The company is sponsoring the conference and will showcase three papers covering audio-visual speech models, spatial audio generation, and speculative decoding.
Apple will present three research papers at ICASSP 2026 covering multilingual self-supervised speech models, object-aware stereo audio generation, and principled coarse-grained acceptance for speculative decoding.
Apple's booth #P2 will be open on May 4 from 19:00-21:30 and May 5-8 from 09:00-17:00 (CEST).
Researchers at Apple and Gallaudet University developed a pseudo-annotation pipeline to address the scarcity of high-quality annotated sign language data. Their approach uses a fingerspelling recognizer, isolated sign recognizer, and a K-Shot LLM to generate likely annotations from signed video and English input. They achieved state-of-the-art results on FSBoard (6.7% CER) and ASL Citizen (74% top-1 accuracy) and are releasing nearly 500 human-annotated videos and over 300 hours of pseudo-annotations.
Lack of annotated data limits AI sign language interpretation; new datasets like ASL STEM Wiki and FLEURS-ASL have hundreds of hours but are underutilized due to annotation costs.
The pipeline uses a fingerspelling recognizer, isolated sign recognizer (ISR), and K-Shot LLM to produce ranked annotations with time intervals.
Apple ML Research presents STARFlow-V, a normalizing flow-based video generator offering end-to-end learning, robust causal prediction, and native likelihood estimation. It uses a global-local architecture in spatiotemporal latent space, flow-score matching, and video-aware Jacobi iteration to achieve strong visual fidelity and temporal consistency, demonstrating the first evidence that normalizing flows can produce high-quality autoregressive video generation.
STARFlow-V is a normalizing flow-based video generation model challenging diffusion model dominance.
It employs a global-local architecture to reduce error accumulation and supports text/image/video-to-video generation.
Apple ML Research introduces DSO (Direct Steering Optimization), a reinforcement learning-based method that learns linear transformations to steer model activations, effectively mitigating bias in VLMs and LLMs. It achieves state-of-the-art fairness-capabilities trade-off with inference-time control.
DSO uses reinforcement learning to learn linear transformations for steering activations at inference time to mitigate bias.
It achieves state-of-the-art trade-off between fairness and performance on both VLMs and LLMs.
Apple Machine Learning Research introduces Sonata, a lightweight adapter that uses self-consistency prediction to dynamically allocate thinking budgets during inference, reducing thinking tokens by 20-80% while maintaining accuracy, or improving accuracy by up to 5% with the same token cost.
Use self-consistency as a proxy to determine when extended thinking is needed.
Propose Sonata, a lightweight adapter that predicts self-consistency during query prefilling to allocate thinking budgets dynamically.
LaDiR combines VAE and latent diffusion to enable iterative refinement of reasoning trajectories, improving accuracy, diversity, and interpretability over autoregressive methods.
LaDiR uses VAE to encode reasoning steps into latent thought blocks.
Latent diffusion with blockwise bidirectional attention enables holistic refinement.
Apple researchers present StereoFoley, a framework that generates semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz from video. Using a synthetic data pipeline and fine-tuning, it achieves object-aware stereo imaging.
StereoFoley is the first end-to-end framework for object-aware stereo video-to-audio generation.
It overcomes dataset limitations via a synthetic data pipeline with object tracking and dynamic panning.
The paper investigates how conditional diffusion models achieve compositional generalization, specifically length generalization—generating images with more objects than seen during training. Experiments on CLEVR show success in some cases but not all, pointing to the importance of local conditional scores. The authors prove an equivalence between compositional structure and local conditional scores, and demonstrate that enforcing locality enables generalization in failing models. Analysis of SDXL reveals spatial locality but absent conditional locality in pixel space, yet evidence of local conditional scores in feature space.
Length generalization in conditional diffusion is achievable in some settings but not universally.
Local conditional scores are a key mechanism for compositional generalization, equivalent to conditional projective composition.