HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation
Abstract
Despite the latest remarkable advances in generative modeling, efficient generation of high-quality 3D assets from textual prompts remains a difficult task. A key challenge lies in data scarcity: the most extensive 3D datasets encompass merely millions of assets, while their 2D counterparts contain billions of text-image pairs. To address this, we propose a novel approach which harnesses the power of large, pretrained 2D diffusion models. More specifically, our approach, HexaGen3D, fine-tunes a pretrained text-to-image model to jointly predict 6 orthographic projections and the corresponding latent triplane. We then decode these latents to generate a textured mesh. HexaGen3D does not require per-sample optimization, and can infer high-quality and diverse objects from textual prompts in 7 seconds, offering significantly better quality-to-latency trade-offs when comparing to existing approaches. Furthermore, HexaGen3D demonstrates strong generalization to new objects or compositions.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ET3D: Efficient Text-to-3D Generation via Multi-View Distillation (2023)
- Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion (2023)
- ControlDreamer: Stylized 3D Generation with Multi-View ControlNet (2023)
- A Unified Approach for Text- and Image-guided 4D Scene Generation (2023)
- PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper