Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
Abstract
The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image interleaved control in image generation models. Building on powerful text-to-image models like SD3.5, we replace the original text-only encoders by incorporating versatile multimodal information encoders such as QwenVL. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark, and matching the performance of state-of-the-art text-to-image models like SD3.5 and FLUX.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models (2025)
- I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models (2025)
- Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens (2025)
- EditAR: Unified Conditional Generation with Autoregressive Models (2025)
- EliGen: Entity-Level Controlled Image Generation with Regional Attention (2025)
- Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks (2025)
- LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper