PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Ajay Jaiswal, Li Shen, Xiaolong Ma, Shiwei Liu, Lu Yin^✉,

University of Surrey, Clemson University, Meituan, University of Texas at Austin, Sun Yat-sen University, The University of Arizona, University of Oxford, ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center
TMLR
^✉Indicates Corresponding author

Paper Code arXiv

Overview

The core idea of CD-MoE is to Condense multiple routed experts in MoE models into several fixed experts, transforming sparse layers back into dense layers to achieve inference acceleration and memory savings. Our key contributions are:

Proposed the Condense concept, consolidating multiple experts (e.g., 64) in certain layers into a smaller number (e.g., 6).
Employed a greedy search based on JS divergence to select the most suitable experts and layers for condensation, maintaining 90%+ of the original accuracy.
Through further lightweight fine-tuning, restored model quality to 95%+ of the pre-pruning level.

Method Illustration

Left: Original Deepseek MoE layer — tokens are dynamically routed to different experts via the gating network.

Right: Our ConDense-MoE layer — all tokens are routed to the same set of condensed experts, achieving:

Parameter & memory savings via significant expert reduction
Inference acceleration by removing the routing overhead

A Key Findings

We observe that different layers exhibit varying degrees of output embedding shift after Condense (especially the shallowest and deepest layers are most affected). Therefore, we propose using greedy search to select the most suitable layers for Condense.

Main Results

Upper: Results on DeepSeekMoE-16B; Lower: Results on Qwen1.5-MoE-A2.7B

CD-MoE-S: Only shared experts retained; CD-MoE-SR: Both shared experts and condensed routed experts retained.

Baseline methods Block Drop and Layer Drop aggressively prune all experts from selected layers. As a result, quality degradation becomes more severe as the pruning ratio increases.

In contrast, Condense preserves quality much more effectively by retaining and consolidating important experts rather than completely discarding them.

Light-weight Supervised Finetuning Results

SFT Results on LLaDA-8B-Base and Dream-7B-Base

After lightweight fine-tuning targeting only the Condense layers, model quality is further restored.

BibTeX

@misc{cao2025condensedontjustprune,
      title={Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning}, 
      author={Mingyu Cao and Gen Li and Jie Ji and Jiaqi Zhang and Xiaolong Ma and Shiwei Liu and Lu Yin},
      year={2025},
      eprint={2412.00069},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2412.00069}, 
}
}