2026-06-17原文2 min readUpdated: 2026-06-17

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

Mixture-of-Experts Multimodal Large Language Models (MoE-MLLMs) offer remarkable performance but incur high GPU memory costs, making compression essential. Existing expert-level mixed-precision quantization degrades on MoE-MLLMs due to two biases: vision tokens dominate cross-modal expert selection, and redundant tokens skew intra-vision frequency. MODE addresses this by decomposing selection frequency by modality, filtering redundant tokens, and evaluating per-modality quantization sensitivity, using integer linear programming to assign bit-widths per expert. At W3A16, average performance loss is limited to 2.9%.

SourcearXiv Machine LearningAuthor: Yuanteng Chen, Peisong Wang, Zhilei Liu, Nanxin Zeng, Yuantian Shao, Shiqiang Lang, Tao Liu, Chuangyi Li, Qinghao Hu, Gang Li, Jing Liu, Jian Cheng

[2606.17118] MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

[Submitted on 15 Jun 2026]

Title:MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

View a PDF of the paper titled MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs, by Yuanteng Chen and 11 other authors

View PDF HTML (experimental)

Abstract:Mixture-of-Experts Multimodal Large Language Models (MoE-MLLMs) offer remarkable performance but incur prohibitive GPU memory costs, making compression essential. Among PTQ methods, expert-level mixed-precision quantization has proven effective for MoE-LLMs, yet suffers notable degradation on MoE-MLLMs due to two overlooked biases in expert importance estimation. (1) At the cross-modal level, the numerical dominance of vision tokens causes expert selection frequency to be dominated by vision tokens, masking experts that are critical to the text modality; (2) at the intra-vision level, the large proportion of redundant vision tokens further skew frequency statistics, obscuring experts critical for informative visual content. To bridge gaps, we propose MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE-MLLMs that decomposes expert selection frequency by modality, filters redundant vision tokens to obtain denoised visual frequency, and further evaluates quantization sensitivity per modality as a complementary signal to frequency-based estimation. These signals are integrated into an Integer Linear Programming formulation to assign per-expert bit-widths under a given budget. Extensive experiments show that MODE is particularly well-suited for MoE-MLLMs, limiting average performance loss to within 2.9% at W3A16, with larger gains at the extreme 2-bit setting.

Comments: 18 pages, 8 figures

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2606.17118 [cs.LG]

(or arXiv:2606.17118v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2606.17118

arXiv-issued DOI via DataCite

Submission history

From: Yuanteng Chen [view email] [v1] Mon, 15 Jun 2026 10:59:11 UTC (1,302 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs, by Yuanteng Chen and 11 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.LG

new | recent | 2026-06

Change to browse by:

cs cs.AI

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)