A study audits video, image, and audio deepfake benchmarks using linear probes on frozen self-supervised representations, finding that general-purpose representations can approach bespoke detector performance, suggesting benchmarks may reward general modality understanding rather than forensic skills.
Deepfake detectors excel on benchmarks but fail in real-world scenarios.
Linear probes on frozen self-supervised representations approximate bespoke detector performance.
This paper proposes a differentiable architecture search method to automatically discover the optimal fusion scheme for combining image and prompt tokens in visual prompt tuning. The approach jointly optimizes learnable prompts and their fusion mechanisms, introducing affine transformation and cross-attention as new fusion schemes. Experiments on 34 datasets demonstrate consistent improvements over baselines, revealing that hybrid fusion better leverages layer semantics in Vision Transformers.
Formulates prompt fusion scheme selection as a bi-level optimization problem solved via differentiable architecture search.
Introduces affine transformation and cross-attention as two new fusion mechanisms to enrich the search space.
Researchers introduce the Turbid Underwater Baseline (TUB) dataset and a new metric, PCD, to quantify information loss in underwater scenes with extreme turbidity. PCD correlates strongly with instance segmentation performance, outperforming common metrics.
TUB dataset includes 1,320 images under extreme turbidity with over 16,000 segmentation masks.
Proposed PCD metric is contrast-invariant and based on phase congruency maps.
GeMoE models token routing as a Minimum Description Length problem, using gating entropy to adaptively select experts, achieving 99.5% performance retention while increasing expert activation sparsity by 36.5%.
Static Top-k routing in MoE wastes resources by not adapting to input complexity.
GeMoE frames routing as an MDL problem and uses gating entropy to measure token complexity.
This study extends fMRI cognitive taskonomy from single-source to multi-source transfer across 23 Human Connectome Project task states, using Boolean Integer Programming (BIP) for budget-constrained task allocation. Training 1,127 models reveals directional, paradigm-structured single-source transfer and composition-dependent multi-source transfer. BIP prioritizes working-memory states (0-back and 2-back) under budget constraints, reflecting integrated perceptual, attentional, and executive processes. Findings highlight a cross-paradigm-limited motor cluster and high-priority working-memory states.
Extends fMRI taskonomy from one-to-one to many-to-one transfer with budget-constrained dependencies
Uses Boolean Integer Programming to analyze budget-constrained task allocation across 23 task states
This paper introduces an innovative multi-task deep learning model that accurately predicts penetration state, depth, and weld seam morphology in laser penetration welding. The model uses weld pool images from a CMOS camera and welding parameters, integrating spatiotemporal features via CNNs and state space models. Test results show 99.35% accuracy for penetration state, 1.79 mm error for depth, and 95.65% accuracy for weld cross-section reconstruction.
Multi-task deep learning model integrates spatiotemporal features and welding parameters to predict penetration state, depth, and morphology.
CMOS camera captures weld pool images; model uses CNNs and state space models for spatiotemporal feature extraction.
Researchers developed a self-supervised framework using airborne LiDAR and optical imagery to estimate tree-level above-ground biomass in urban areas. The method achieved high accuracy in crown delineation and biomass estimation, revealing urban carbon stocks and changes over time without manual annotation.
Framework uses leaf-off airborne LiDAR and near-infrared orthophotography to estimate tree biomass at crown level.
Dual-stream cross-attention network with pseudo-labels achieved 0.84 Dice score for tree segmentation.
This paper proposes LCG, a framework for long-context multi-image generation using Sparse Relational Attention (SRA) and Routing Consistency Constraint (RCC), along with a large synthetic dataset LCCD. Experiments show LCG outperforms baselines in prompt alignment and character consistency.
LCG uses Sparse Relational Attention (SRA) to selectively attend to core features, ensuring efficient propagation of semantic and layout information.
Routing Consistency Constraint (RCC) aligns structural patterns via identity-aware masks, reducing appearance drift in complex multi-character scenes.
A hybrid approach combining image processing and deep learning assesses fruit freshness. An algorithm quantifies spoilage (0-100), and a CNN performs binary classification. Logistic regression fuses both outputs, later enabling the image processing algorithm to classify without CNN. Achieved >90% accuracy on apples and oranges with real-time performance and low computational requirements. Limitation requires isolated fruit on white/transparent background.
Image processing algorithm scores spoilage from 0 to 100.
CNN trained for binary fresh/rotten classification.
DocArena is a fully automated data curation pipeline that uses multimodal large language models (MLLMs) to transform raw documents into controllable, scalable training environments for document search agents. It requires no human annotation, generates reasoning-intensive QA pairs, and produces the DocArena-79K dataset spanning 8,336 documents across 16 domains and 49 languages. Experiments show that agents trained on DocArena achieve state-of-the-art performance on both retrieval accuracy and QA quality.
DocArena automates the construction of training environments from raw documents using MLLM-based visual perception without human annotation.
The DocArena-79K dataset covers 16 domains and 49 languages with 8,336 documents.
Most vision-language-action (VLA) models are reactive, predicting the next action from the current observation alone, which limits generalization under distribution shift. This paper proposes Reflective VLA, which conditions decisions on a context of observation-action-consequence triplets, exposing deployment-specific action-effect mappings. On LIBERO-Plus and LIBERO-Plus-Hard, it improves success rates by 5.4 and 4.2 percentage points, with ablations confirming action consequences as the key.
Proposes Reflective VLA using observation-action-consequence triplets as context
Routes all modalities through a VLM with shared attention for reasoning
This paper proposes a novel neural network quantization method that learns quantization-aware linear paths to find midpoints in low-loss subspaces, achieving performance comparable to quantization-aware training without using the straight-through estimator or explicit discretization during training.
Quantization degrades performance because discrete constraints perturb parameters away from the optimum.
Low-loss full-precision solutions belong to connected low-loss subspaces.
This study evaluates Multimodal Large Language Models (MLLMs) on assistive AI tasks including currency recognition, scene text QA, and multilingual reading. The authors built NetraLink, a system using a head-mounted GoPro to collect real-world egocentric data, and created a benchmark. Findings reveal strengths and limitations of current MLLMs for vision-language assistive technologies.
MLLMs show promise for assistive AI but have limitations in complex scenarios
NetraLink system uses head-mounted GoPro to capture real-world egocentric data
Visual storytelling requires image sequences aligned with narrative prompts and consistent characters. Existing training-free methods rely on structured prompts that repeat full descriptions each time, deviating from natural storytelling. FreeStory introduces entity-grounded feature reuse to maintain character consistency under free-form prompts, and presents FreeStoryBench, achieving state-of-the-art performance among training-free methods.
FreeStory is training-free and uses entity-grounded feature reuse for character consistency under free-form prompts.
Introduces FreeStoryBench, a benchmark covering single- and multi-character stories.
Wan-Streamer is a native-streaming, end-to-end interactive foundation model designed for real-time, low-latency, full-duplex audio-visual interaction. It seamlessly models language, audio, and video within a single Transformer using block-causal attention for incremental streaming, without external modules. It achieves ~200ms model-side and ~550ms total interaction latency, enabling sub-second duplex communication.
Single Transformer unifies language, audio, and video input/output for end-to-end interaction.
Block-causal attention and low-latency multimodal token scheduling enable streaming units as short as 160ms at 25fps.
Chorus II introduces a cross-request sparsity reuse framework that reuses sparse attention masks from similar historical requests to avoid online mask prediction, with optional feature reuse and guidance enhancement, achieving a 2.16× speedup while preserving generation quality.
Addresses the computational cost of diffusion models for image-to-video generation by reusing sparse attention patterns across similar requests.
Yuvion VL is a family of multimodal large language models purpose-built for content and AI safety, treating safety as an inherently adversarial and multimodal problem. It features an automated data pipeline with adversarial-aware synthesis and multi-stage quality control, a three-stage training pipeline including continued pretraining for cross-modal alignment, instruction post-training, and reasoning post-training, plus a novel Confuse-then-Contrast Fine-Tuning framework. The YVRE benchmark set evaluates safety, adversarial robustness, and real-world capabilities. Yuvion VL-32B achieves industry-leading safety performance, surpassing open-source and closed-source models while maintaining general capabilities.
Yuvion VL is designed from the ground up for adversarial robustness in multimodal safety tasks.
Training includes continued pretraining, instruction post-training, reasoning post-training, and Confuse-then-Contrast Fine-Tuning.
This paper proposes a Noise-Aware Boundary-Enhanced Generative Learning (NBGL) framework for ultrasound speckle reduction. It integrates a speckle reduction branch and a boundary enhancement branch, with a noise-aware interaction weight generation module that uses 3D Laplacian filtering and median absolute deviation estimation to adapt to varying noise levels. Evaluated on 141 3D transvaginal ultrasound volumes, NBGL outperforms state-of-the-art methods across six noise levels.
NBGL combines generative learning and boundary enhancement to suppress speckle while preserving anatomical boundaries.
A noise-aware module estimates noise level via 3D Laplacian filtering to generate adaptive interaction weights.
Advances in generative AI have made image falsification highly realistic, demanding trustworthy authentication systems. Existing forensic detectors lack interpretability, while vision-language models (VLMs) provide explanations but cannot exploit forensic traces for reliable detection. We propose Forensic Knowledge Graphs (FKGs), a unified framework integrating forensic evidence extraction, structured reasoning, and human-interpretable explanation. Our FKG structure encodes forensic traces along with their causal dependencies and links to scene content. To generate accurate FKGs, we introduce a novel forensic authentication network and an Iterative Context Refinement strategy that guides VLMs to produce faithful, grounded explanations. We also present FKG-50K, a dataset of 50,000 realistic forgeries with ground-truth FKGs. Experiments demonstrate that FKG outperforms both forensic detectors and VLMs in detection, forgery identification and localization, and forensic justification.
Researchers propose TheProfessor, a multi-teacher extension of PromptKD for distilling vision-language models. Using an ensemble of a domain-finetuned teacher and a zero-shot teacher, confidence-weighted ensembling improves harmonic mean accuracy by 1.77 points on average, with significant gains on domain-shifted datasets like EuroSAT.
TheProfessor extends PromptKD with a two-teacher ensemble: domain-finetuned PromptSRC ViT-L/14 and zero-shot EVA-CLIP-L/14.
Confidence-weighted ensembling achieves 89.28 average HM across four datasets, up from 87.52.
REALM provides the first unified red-teaming benchmark for physical-world vision-language models, integrating 12 attack methods, 3 defenses, and 13 models to enable fair comparison of vulnerabilities. Key findings include the effectiveness of text and typographic attacks and the limited protective role of model scale.
REALM is the first unified benchmark for red-teaming physical-world VLMs.
It integrates 12 attack methods, 3 defenses, and 13 models under a black-box threat model.
Vision-Language Models (VLMs) are brittle to negation, relying on shallow co-occurrence and susceptible to misleading cues. HANCLIP restructures the embedding space with hyperbolic geometry and an angular triplet objective to encode negations, trained on 20K quadruplets, improving negation benchmarks without degrading standard performance.
VLMs exhibit fragility to negation due to shallow word co-occurrence and misleading textual cues.
HANCLIP uses hyperbolic formulation and angular triplet objective to explicitly model negative semantics.
ABACUS is a unified vision-language model that handles object counting, crowd counting, referring-expression counting, and count-faithful image generation without any benchmark-specific training. Built on a 3B-parameter foundation model, it introduces three key innovations: density-aware adaptive zooming with objectness maps, a boundary-aware count policy via GRPO, and a cycle-consistent GRPO strategy. It achieves state-of-the-art results across seven benchmarks, outperforming task-specific specialists and larger generalist models.
ABACUS is a unified vision-language model for multiple counting tasks and count-faithful image generation. It requires no benchmark-specific training.
Built on a 3B-parameter foundation model, it incorporates density-aware adaptive zooming, boundary-aware GRPO policy, and cycle-consistent GRPO strategy.
A paradigm shift from spatial to spectral feature processing for small object detection, introducing a frequency-guided framework with three lightweight modules that achieves superior performance with 1/6 the parameters of YOLOv11.
Small object detection is hindered by spatial-domain detectors discarding high-frequency details
Proposes a shift from spatial to spectral processing with a frequency-guided framework
Recent work finds that attention distributions used for vision-language consistency in VLMs suffer from decoding drift and structural token biases. To address this, researchers propose PV-TAM, which leverages prompt-side semantics and peak attention distributions to evaluate alignment, outperforming answer-side baselines across multiple datasets.
Decoding drift and structural tokens cause attention misalignment in VLM consistency evaluation
PV-TAM uses prompt-side semantics and peak attention to measure alignment, filtering modality boundary markers
Modern video diffusion models achieve higher generation quality through scaling, but this also increases inference cost. The Sol Video Inference Engine is a training-free, agent-based acceleration framework that integrates five techniques—cache, sparse attention, token pruning, quantization, and kernel fusion—for instance-specific optimization. Tested on three models of varying sizes, it achieves over 2x end-to-end acceleration with near-lossless VBench quality.
Video diffusion inference acceleration faces instance-specific challenges; optimal strategies vary by model, hardware, and configuration.
Sol Engine uses an agent architecture with parallel skill agents optimizing five techniques, integrated into a global stack.
This paper presents a geometry-informed computer vision pipeline that automates overtaking event detection from a single bicycle-mounted camera without multi-sensor configurations or explicit calibration. Validated on 315 real-world events, it achieves 97.8% recall with zero false positives. The system identifies overtaking intentions a mean of 2.44 seconds before vehicle passage, with 84.1% of events exceeding the 1.5-second human reaction time threshold. A preliminary calibration-free lateral distance estimation approach achieves mean absolute errors of 13-14 cm, sufficient for safety categorization.
Proposes a geometry-informed computer vision pipeline for automated overtaking detection from a single bicycle camera
Uses RT-DETR and ByteTrack with a three-stage geometric validation module
Researchers propose TeleMorpher, a one-shot framework for simultaneous motion and location editing in videos using diffusion models. It disentangles protagonist and background, uses pose warping with motion priors, and introduces new evaluation metrics. Experiments show superior performance on in-the-wild videos and the TaiChi dataset.
TeleMorpher enables simultaneous motion and location editing in a single shot.
The framework leverages motion priors and ground truth motion for precise editing.
The paper proposes learning an asynchronous schedule for denoising in multi-representation latent diffusion models. It introduces a schedule-corrected objective and a flexible parametric class that is convex and monotone. The schedule is learned with minimal additional compute (<1%). On ImageNet 256x256, the method achieves FID 1.05 in 200 epochs (matching a 800-epoch baseline) and FID 1.02 in 600 epochs (outperforming a 1B-parameter model). Unguided results also show significant improvements.
Proposes learning asynchronous schedules for multi-representation diffusion
Schedule-corrected objective and convex monotone parameterization