Omni-Diffusion

Introduction

While recent multimodal large language models (MLLMs) have made impressive strides, they mostly employ a conventional autoregressive architecture as their backbone, leaving significant room for exploring effective and efficient alternatives in architectural design. We introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach enables support for not only bimodal tasks but also more complex scenarios involving multiple modalities. Our main contribusions are:

First Any-to-Any Mask-based Discrete Diffusion Model. Omni-Diffusion is the pioneering any-to-any multimodal language model constructed purely using mask-based discrete diffusion model. It demonstrates robust performance on various tasks involving multiple modalities, illustrating the potential of discrete diffusion model in multimodal intelligence systems.
Diffusion-Centric Training and Inference Framework. We develop specialized training and inference techniques tailored to the characteristic of mask-based diffusion models. For training, we implement an attenuated tail-pad masking strategy to facilitate variable-length generation and a three-stage progressive training pipeline for effective multi-modality alignment. For inference, we introduce position penalty to constrain generation order and enhance visual quality, alongside a special token pre-infilling strategy to improve spoken dialogue performance.

Figure 1: Overview of Omni-Diffusion.

Examples

• Example-1: Text → Text+Image+Audio

(Text Transcription: What color is the bench?)

(Text Transcription: The bench is made of wood with a reddish-brown finish. It has horizontal slats, and the appearance to be old.)

(Text Transcription: What game is this?)

(Text Transcription: Tennis. The players are holding tennis rackets, and one of them is in motion, appearing to hit the ball.)

• Example-2: Speech → Image

(Text Transcription: Generate an image based on the provided text description. A super realistic and hyper-detailed 8k image showing a fantasy night scene with an amazing beach under the full moon, lit by dramatic lighting.)

(Text Transcription: Generate an image based on the provided text description. This image shows misty blue mountains and trees in a watercolor style, with a pacific northwest look.)

(Text Transcription: Generate an image based on the provided text description. A photorealistic picture of a Yellowstone grass and sagebrush meadow with a pine forest in the distance, surrounded by big mountains. Rays of sunshine break through the sky, creating a warm glow on the lively landscape. Distant rainclouds suggest the possibility of rainfall in the peaceful scene.)

• Example-3: Text+Image → Text

Is the glass of orange juice half empty or half full?

The glass of orange juice is half full. The liquid level is at about two-thirds of the glass, leaving the other one-third empty. Since the question asks is it half empty or half full, the glass is half full.

Please describe this image in detail.

Two individuals are seated on a concrete bench, one in a light blue shirt and the other in a red shirt, both gazing into the distance. The bench is situated on a lawn with the ocean faintly visible in the background.

• Example-4: Text → Image

Generate an image based on the provided text description.

The image shows a landscape background with double exposure glasses of wine, displaying a hyperealistic and detailed view of the subject.

Generate an image based on the provided text description.

A hyper-realistic, ultra-detailed picture of a Christian country church lit by a bright, snowy atmosphere.

Generate an image based on the provided text description.

An image of America's mountains mixed with a grand royal male eagle in double exposure, with dripping red and blue colors.

Performance

• Visual Tasks

• Speech Tasks

BibTeX


@article{li2026omni,
    title={Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion},
    author={Li, Lijiang and Long, Zuwei and Shen, Yunhang and Gao, Heting and Cao, Haoyu and Sun, Xing and Shan, Caifeng and He, Ran and Fu, Chaoyou},
    journal={arXiv preprint arXiv:2603.06577},
    year={2026}
}