Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Lijiang Li1, Zuwei Long2, Yunhang Shen2, Heting Gao2, Haoyu Cao2, Xing Sun2,
Caifeng Shan1, Ran He3, Chaoyou Fu1, †
1Nanjing University, 2Tencent Youtu Lab, 3CASIA
Corresponding Author

Introduction


While recent multimodal large language models (MLLMs) have made impressive strides, they mostly employ a conventional autoregressive architecture as their backbone, leaving significant room for exploring effective and efficient alternatives in architectural design. We introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach enables support for not only bimodal tasks but also more complex scenarios involving multiple modalities. Our main contribusions are:

  • First Any-to-Any Mask-based Discrete Diffusion Model. Omni-Diffusion is the pioneering any-to-any multimodal language model constructed purely using mask-based discrete diffusion model. It demonstrates robust performance on various tasks involving multiple modalities, illustrating the potential of discrete diffusion model in multimodal intelligence systems.
  • Diffusion-Centric Training and Inference Framework. We develop specialized training and inference techniques tailored to the characteristic of mask-based diffusion models. For training, we implement an attenuated tail-pad masking strategy to facilitate variable-length generation and a three-stage progressive training pipeline for effective multi-modality alignment. For inference, we introduce position penalty to constrain generation order and enhance visual quality, alongside a special token pre-infilling strategy to improve spoken dialogue performance.


Teaser

Figure 1: Overview of Omni-Diffusion.

Examples



• Example-1: Text → Text+Image+Audio

• Example-2: Speech → Image

• Example-3: Text+Image → Text

• Example-4: Text → Image


Performance

• Visual Tasks

Teaser

• Speech Tasks

Teaser

BibTeX


@article{li2026omni,
    title={Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion},
    author={Li, Lijiang and Long, Zuwei and Shen, Yunhang and Gao, Heting and Cao, Haoyu and Sun, Xing and Shan, Caifeng and He, Ran and Fu, Chaoyou},
    journal={arXiv preprint arXiv:2603.06577},
    year={2026}
}