Diffusers documentation

Lumina2

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.32.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Lumina2

Lumina Image 2.0: A Unified and Efficient Image Generative Model is a 2 billion parameter flow-based diffusion transformer capable of generating diverse images from text descriptions.

The abstract from the paper is:

We introduce Lumina-Image 2.0, an advanced text-to-image model that surpasses previous state-of-the-art methods across multiple benchmarks, while also shedding light on its potential to evolve into a generalist vision intelligence model. Lumina-Image 2.0 exhibits three key properties: (1) Unification – it adopts a unified architecture that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and facilitating task expansion. Besides, since high-quality captioners can provide semantically better-aligned text-image training pairs, we introduce a unified captioning system, UniCaptioner, which generates comprehensive and precise captions for the model. This not only accelerates model convergence but also enhances prompt adherence, variable-length prompt handling, and task generalization via prompt templates. (2) Efficiency – to improve the efficiency of the unified architecture, we develop a set of optimization techniques that improve semantic learning and fine-grained texture generation during training while incorporating inference-time acceleration strategies without compromising image quality. (3) Transparency – we open-source all training details, code, and models to ensure full reproducibility, aiming to bridge the gap between well-resourced closed-source research teams and independent developers.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

Using Single File loading with Lumina Image 2.0

Single file loading for Lumina Image 2.0 is available for the Lumina2Transformer2DModel

import torch
from diffusers import Lumina2Transformer2DModel, Lumina2Text2ImgPipeline

ckpt_path = "https://huggingface.co/Alpha-VLLM/Lumina-Image-2.0/blob/main/consolidated.00-of-01.pth"
transformer = Lumina2Transformer2DModel.from_single_file(
    ckpt_path, torch_dtype=torch.bfloat16
)

pipe = Lumina2Text2ImgPipeline.from_pretrained(
    "Alpha-VLLM/Lumina-Image-2.0", transformer=transformer, torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()
image = pipe(
    "a cat holding a sign that says hello",
    generator=torch.Generator("cpu").manual_seed(0),
).images[0]
image.save("lumina-single-file.png")

Using GGUF Quantized Checkpoints with Lumina Image 2.0

GGUF Quantized checkpoints for the Lumina2Transformer2DModel can be loaded via from_single_file with the GGUFQuantizationConfig

from diffusers import Lumina2Transformer2DModel, Lumina2Text2ImgPipeline, GGUFQuantizationConfig 

ckpt_path = "https://huggingface.co/calcuis/lumina-gguf/blob/main/lumina2-q4_0.gguf"
transformer = Lumina2Transformer2DModel.from_single_file(
    ckpt_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

pipe = Lumina2Text2ImgPipeline.from_pretrained(
    "Alpha-VLLM/Lumina-Image-2.0", transformer=transformer, torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()
image = pipe(
    "a cat holding a sign that says hello",
    generator=torch.Generator("cpu").manual_seed(0),
).images[0]
image.save("lumina-gguf.png")

Lumina2Text2ImgPipeline

class diffusers.Lumina2Text2ImgPipeline

< >

( transformer: Lumina2Transformer2DModel scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL text_encoder: AutoModel tokenizer: AutoTokenizer )

Parameters

  • vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
  • text_encoder (AutoModel) — Frozen text-encoder. Lumina-T2I uses T5, specifically the t5-v1_1-xxl variant.
  • tokenizer (AutoModel) — Tokenizer of class AutoModel.
  • transformer (Transformer2DModel) — A text conditioned Transformer2DModel to denoise the encoded image latents.
  • scheduler (SchedulerMixin) — A scheduler to be used in combination with transformer to denoise the encoded image latents.

Pipeline for text-to-image generation using Lumina-T2I.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

__call__

< >

( prompt: typing.Union[str, typing.List[str]] = None width: typing.Optional[int] = None height: typing.Optional[int] = None num_inference_steps: int = 30 guidance_scale: float = 4.0 negative_prompt: typing.Union[str, typing.List[str]] = None sigmas: typing.List[float] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] system_prompt: typing.Optional[str] = None cfg_trunc_ratio: float = 1.0 cfg_normalization: bool = True max_sequence_length: int = 256 ) ImagePipelineOutput or tuple

Parameters

  • prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • num_inference_steps (int, optional, defaults to 30) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • sigmas (List[float], optional) — Custom sigmas to use for the denoising process with schedulers which support a sigmas argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used.
  • guidance_scale (float, optional, defaults to 4.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
  • num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
  • height (int, optional, defaults to self.unet.config.sample_size) — The height in pixels of the generated image.
  • width (int, optional, defaults to self.unet.config.sample_size) — The width in pixels of the generated image.
  • eta (float, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others.
  • generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
  • latents (torch.Tensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for text embeddings.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. For Lumina-T2I this negative prompt should be "". If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • negative_prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for negative text embeddings.
  • output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a ~pipelines.stable_diffusion.IFPipelineOutput instead of a plain tuple.
  • attention_kwargs — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
  • callback_on_step_end (Callable, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
  • callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.
  • system_prompt (str, optional) — The system prompt to use for the image generation.
  • cfg_trunc_ratio (float, optional, defaults to 1.0) — The ratio of the timestep interval to apply normalization-based guidance scale.
  • cfg_normalization (bool, optional, defaults to True) — Whether to apply normalization-based guidance scale.
  • max_sequence_length (int, defaults to 256) — Maximum sequence length to use with the prompt.

Returns

ImagePipelineOutput or tuple

If return_dict is True, ImagePipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import Lumina2Text2ImgPipeline

>>> pipe = Lumina2Text2ImgPipeline.from_pretrained("Alpha-VLLM/Lumina-Image-2.0", torch_dtype=torch.bfloat16)
>>> # Enable memory optimizations.
>>> pipe.enable_model_cpu_offload()

>>> prompt = "Upper body of a young woman in a Victorian-era outfit with brass goggles and leather straps. Background shows an industrial revolution cityscape with smoky skies and tall, metal structures"
>>> image = pipe(prompt).images[0]

disable_vae_slicing

< >

( )

Disable sliced VAE decoding. If enable_vae_slicing was previously enabled, this method will go back to computing decoding in one step.

disable_vae_tiling

< >

( )

Disable tiled VAE decoding. If enable_vae_tiling was previously enabled, this method will go back to computing decoding in one step.

enable_vae_slicing

< >

( )

Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.

enable_vae_tiling

< >

( )

Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow processing larger images.

encode_prompt

< >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True negative_prompt: typing.Union[str, typing.List[str]] = None num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None system_prompt: typing.Optional[str] = None max_sequence_length: int = 256 )

Parameters

  • prompt (str or List[str], optional) — prompt to be encoded
  • negative_prompt (str or List[str], optional) — The prompt not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). For Lumina-T2I, this should be "".
  • do_classifier_free_guidance (bool, optional, defaults to True) — whether to use classifier free guidance or not
  • num_images_per_prompt (int, optional, defaults to 1) — number of images that should be generated per prompt
  • device — (torch.device, optional): torch device to place the resulting embeddings on
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. For Lumina-T2I, it’s should be the embeddings of the "" string.
  • max_sequence_length (int, defaults to 256) — Maximum sequence length to use for the prompt.

Encodes the prompt into text encoder hidden states.

< > Update on GitHub