Marigold Computer Vision
Marigold was proposed in Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation, a CVPR 2024 Oral paper by Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. The core idea is to repurpose the generative prior of Text-to-Image Latent Diffusion Models (LDMs) for traditional computer vision tasks. This approach was explored by fine-tuning Stable Diffusion for Monocular Depth Estimation, as demonstrated in the teaser above.
Marigold was later extended in the follow-up paper, Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis, authored by Bingxin Ke, Kevin Qu, Tianfu Wang, Nando Metzger, Shengyu Huang, Bo Li, Anton Obukhov, and Konrad Schindler. This work expanded Marigold to support new modalities such as Surface Normals and Intrinsic Image Decomposition (IID), introduced a training protocol for Latent Consistency Models (LCM), and demonstrated High-Resolution (HR) processing capability.
The early Marigold models (v1-0
and earlier) were optimized for best results with at least 10 inference steps.
LCM models were later developed to enable high-quality inference in just 1 to 4 steps.
Marigold models v1-1
and later use the DDIM scheduler to achieve optimal
results in as few as 1 to 4 steps.
Available Pipelines
Each pipeline is tailored for a specific computer vision task, processing an input RGB image and generating a corresponding prediction. Currently, the following computer vision tasks are implemented:
Pipeline | Recommended Model Checkpoints | Spaces (Interactive Apps) | Predicted Modalities |
---|---|---|---|
MarigoldDepthPipeline | prs-eth/marigold-depth-v1-1 | Depth Estimation | Depth, Disparity |
MarigoldNormalsPipeline | prs-eth/marigold-normals-v1-1 | Surface Normals Estimation | Surface normals |
MarigoldIntrinsicsPipeline | prs-eth/marigold-iid-appearance-v1-1, prs-eth/marigold-iid-lighting-v1-1 | Intrinsic Image Decomposition | Albedo, Materials, Lighting |
Available Checkpoints
All original checkpoints are available under the PRS-ETH organization on Hugging Face. They are designed for use with diffusers pipelines and the original codebase, which can also be used to train new model checkpoints. The following is a summary of the recommended checkpoints, all of which produce reliable results with 1 to 4 steps.
Checkpoint | Modality | Comment |
---|---|---|
prs-eth/marigold-depth-v1-1 | Depth | Affine-invariant depth prediction assigns each pixel a value between 0 (near plane) and 1 (far plane), with both planes determined by the model during inference. |
prs-eth/marigold-normals-v0-1 | Normals | The surface normals predictions are unit-length 3D vectors in the screen space camera, with values in the range from -1 to 1. |
prs-eth/marigold-iid-appearance-v1-1 | Intrinsics | InteriorVerse decomposition is comprised of Albedo and two BRDF material properties: Roughness and Metallicity. |
prs-eth/marigold-iid-lighting-v1-1 | Intrinsics | HyperSim decomposition of an image \(I\) is comprised of Albedo \(A\), Diffuse shading \(S\), and Non-diffuse residual \(R\): \(I = A*S+R\). |
Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the [“Reduce memory usage”] section here.
Marigold pipelines were designed and tested with the scheduler embedded in the model checkpoint.
The optimal number of inference steps varies by scheduler, with no universal value that works best across all cases.
To accommodate this, the num_inference_steps
parameter in the pipeline’s __call__
method defaults to None
(see the
API reference).
Unless set explicitly, it inherits the value from the default_denoising_steps
field in the checkpoint configuration
file (model_index.json
).
This ensures high-quality predictions when invoking the pipeline with only the image
argument.
See also Marigold usage examples.
Marigold Depth Prediction API
class diffusers.MarigoldDepthPipeline
< source >( unet: UNet2DConditionModel vae: AutoencoderKL scheduler: typing.Union[diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_lcm.LCMScheduler] text_encoder: CLIPTextModel tokenizer: CLIPTokenizer prediction_type: typing.Optional[str] = None scale_invariant: typing.Optional[bool] = True shift_invariant: typing.Optional[bool] = True default_denoising_steps: typing.Optional[int] = None default_processing_resolution: typing.Optional[int] = None )
Parameters
- unet (
UNet2DConditionModel
) — Conditional U-Net to denoise the depth latent, conditioned on image latent. - vae (
AutoencoderKL
) — Variational Auto-Encoder (VAE) Model to encode and decode images and predictions to and from latent representations. - scheduler (
DDIMScheduler
orLCMScheduler
) — A scheduler to be used in combination withunet
to denoise the encoded image latents. - text_encoder (
CLIPTextModel
) — Text-encoder, for empty text embedding. - tokenizer (
CLIPTokenizer
) — CLIP tokenizer. - prediction_type (
str
, optional) — Type of predictions made by the model. - scale_invariant (
bool
, optional) — A model property specifying whether the predicted depth maps are scale-invariant. This value must be set in the model config. When used together with theshift_invariant=True
flag, the model is also called “affine-invariant”. NB: overriding this value is not supported. - shift_invariant (
bool
, optional) — A model property specifying whether the predicted depth maps are shift-invariant. This value must be set in the model config. When used together with thescale_invariant=True
flag, the model is also called “affine-invariant”. NB: overriding this value is not supported. - default_denoising_steps (
int
, optional) — The minimum number of denoising diffusion steps that are required to produce a prediction of reasonable quality with the given model. This value must be set in the model config. When the pipeline is called without explicitly settingnum_inference_steps
, the default value is used. This is required to ensure reasonable results with various model flavors compatible with the pipeline, such as those relying on very short denoising schedules (LCMScheduler
) and those with full diffusion schedules (DDIMScheduler
). - default_processing_resolution (
int
, optional) — The recommended value of theprocessing_resolution
parameter of the pipeline. This value must be set in the model config. When the pipeline is called without explicitly settingprocessing_resolution
, the default value is used. This is required to ensure reasonable results with various model flavors trained with varying optimal processing resolution values.
Pipeline for monocular depth estimation using the Marigold method: https://marigoldmonodepth.github.io.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
__call__
< source >( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] num_inference_steps: typing.Optional[int] = None ensemble_size: int = 1 processing_resolution: typing.Optional[int] = None match_input_resolution: bool = True resample_method_input: str = 'bilinear' resample_method_output: str = 'bilinear' batch_size: int = 1 ensembling_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None latents: typing.Union[torch.Tensor, typing.List[torch.Tensor], NoneType] = None generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None output_type: str = 'np' output_uncertainty: bool = False output_latent: bool = False return_dict: bool = True ) → MarigoldDepthOutput or tuple
Parameters
- image (
PIL.Image.Image
,np.ndarray
,torch.Tensor
,List[PIL.Image.Image]
,List[np.ndarray]
), —List[torch.Tensor]
: An input image or images used as an input for the depth estimation task. For arrays and tensors, the expected value range is between[0, 1]
. Passing a batch of images is possible by providing a four-dimensional array or a tensor. Additionally, a list of images of two- or three-dimensional arrays or tensors can be passed. In the latter case, all list elements must have the same width and height. - num_inference_steps (
int
, optional, defaults toNone
) — Number of denoising diffusion steps during inference. The default valueNone
results in automatic selection. - ensemble_size (
int
, defaults to1
) — Number of ensemble predictions. Higher values result in measurable improvements and visual degradation. - processing_resolution (
int
, optional, defaults toNone
) — Effective processing resolution. When set to0
, matches the larger input image dimension. This produces crisper predictions, but may also lead to the overall loss of global context. The default valueNone
resolves to the optimal value from the model config. - match_input_resolution (
bool
, optional, defaults toTrue
) — When enabled, the output prediction is resized to match the input dimensions. When disabled, the longer side of the output will equal toprocessing_resolution
. - resample_method_input (
str
, optional, defaults to"bilinear"
) — Resampling method used to resize input images toprocessing_resolution
. The accepted values are:"nearest"
,"nearest-exact"
,"bilinear"
,"bicubic"
, or"area"
. - resample_method_output (
str
, optional, defaults to"bilinear"
) — Resampling method used to resize output predictions to match the input resolution. The accepted values are"nearest"
,"nearest-exact"
,"bilinear"
,"bicubic"
, or"area"
. - batch_size (
int
, optional, defaults to1
) — Batch size; only matters when settingensemble_size
or passing a tensor of images. - ensembling_kwargs (
dict
, optional, defaults toNone
) — Extra dictionary with arguments for precise ensembling control. The following options are available:- reduction (
str
, optional, defaults to"median"
): Defines the ensembling function applied in every pixel location, can be either"median"
or"mean"
. - regularizer_strength (
float
, optional, defaults to0.02
): Strength of the regularizer that pulls the aligned predictions to the unit range from 0 to 1. - max_iter (
int
, optional, defaults to2
): Maximum number of the alignment solver steps. Refer toscipy.optimize.minimize
function,options
argument. - tol (
float
, optional, defaults to1e-3
): Alignment solver tolerance. The solver stops when the tolerance is reached. - max_res (
int
, optional, defaults toNone
): Resolution at which the alignment is performed;None
matches theprocessing_resolution
.
- reduction (
- latents (
torch.Tensor
, orList[torch.Tensor]
, optional, defaults toNone
) — Latent noise tensors to replace the random initialization. These can be taken from the previous function call’s output. - generator (
torch.Generator
, orList[torch.Generator]
, optional, defaults toNone
) — Random number generator object to ensure reproducibility. - output_type (
str
, optional, defaults to"np"
) — Preferred format of the output’sprediction
and the optionaluncertainty
fields. The accepted values are:"np"
(numpy array) or"pt"
(torch tensor). - output_uncertainty (
bool
, optional, defaults toFalse
) — When enabled, the output’suncertainty
field contains the predictive uncertainty map, provided that theensemble_size
argument is set to a value above 2. - output_latent (
bool
, optional, defaults toFalse
) — When enabled, the output’slatent
field contains the latent codes corresponding to the predictions within the ensemble. These codes can be saved, modified, and used for subsequent calls with thelatents
argument. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a MarigoldDepthOutput instead of a plain tuple.
Returns
MarigoldDepthOutput or tuple
If return_dict
is True
, MarigoldDepthOutput is returned, otherwise a
tuple
is returned where the first element is the prediction, the second element is the uncertainty
(or None
), and the third is the latent (or None
).
Function invoked when calling the pipeline.
Examples:
>>> import diffusers
>>> import torch
>>> pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
... "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
... ).to("cuda")
>>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
>>> depth = pipe(image)
>>> vis = pipe.image_processor.visualize_depth(depth.prediction)
>>> vis[0].save("einstein_depth.png")
>>> depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction)
>>> depth_16bit[0].save("einstein_depth_16bit.png")
class diffusers.pipelines.marigold.MarigoldDepthOutput
< source >( prediction: typing.Union[numpy.ndarray, torch.Tensor] uncertainty: typing.Union[NoneType, numpy.ndarray, torch.Tensor] latent: typing.Optional[torch.Tensor] )
Parameters
- prediction (
np.ndarray
,torch.Tensor
) — Predicted depth maps with values in the range [0, 1]. The shape is $numimages imes 1 imes height imes width$ fortorch.Tensor
or $numimages imes height imes width imes 1$ fornp.ndarray
. - uncertainty (
None
,np.ndarray
,torch.Tensor
) — Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is $numimages imes 1 imes height imes width$ fortorch.Tensor
or $numimages imes height imes width imes 1$ fornp.ndarray
. - latent (
None
,torch.Tensor
) — Latent features corresponding to the predictions, compatible with thelatents
argument of the pipeline. The shape is $numimages * numensemble imes 4 imes latentheight imes latentwidth$.
Output class for Marigold monocular depth prediction pipeline.
diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_depth
< source >( depth: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] val_min: float = 0.0 val_max: float = 1.0 color_map: str = 'Spectral' )
Parameters
- depth (
Union[PIL.Image.Image, np.ndarray, torch.Tensor, List[PIL.Image.Image], List[np.ndarray], -- List[torch.Tensor]]
): Depth maps. - val_min (
float
, optional, defaults to0.0
) — Minimum value of the visualized depth range. - val_max (
float
, optional, defaults to1.0
) — Maximum value of the visualized depth range. - color_map (
str
, optional, defaults to"Spectral"
) — Color map used to convert a single-channel depth prediction into colored representation.
Visualizes depth maps, such as predictions of the MarigoldDepthPipeline
.
Returns: List[PIL.Image.Image]
with depth maps visualization.
Marigold Normals Estimation API
class diffusers.MarigoldNormalsPipeline
< source >( unet: UNet2DConditionModel vae: AutoencoderKL scheduler: typing.Union[diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_lcm.LCMScheduler] text_encoder: CLIPTextModel tokenizer: CLIPTokenizer prediction_type: typing.Optional[str] = None use_full_z_range: typing.Optional[bool] = True default_denoising_steps: typing.Optional[int] = None default_processing_resolution: typing.Optional[int] = None )
Parameters
- unet (
UNet2DConditionModel
) — Conditional U-Net to denoise the normals latent, conditioned on image latent. - vae (
AutoencoderKL
) — Variational Auto-Encoder (VAE) Model to encode and decode images and predictions to and from latent representations. - scheduler (
DDIMScheduler
orLCMScheduler
) — A scheduler to be used in combination withunet
to denoise the encoded image latents. - text_encoder (
CLIPTextModel
) — Text-encoder, for empty text embedding. - tokenizer (
CLIPTokenizer
) — CLIP tokenizer. - prediction_type (
str
, optional) — Type of predictions made by the model. - use_full_z_range (
bool
, optional) — Whether the normals predicted by this model utilize the full range of the Z dimension, or only its positive half. - default_denoising_steps (
int
, optional) — The minimum number of denoising diffusion steps that are required to produce a prediction of reasonable quality with the given model. This value must be set in the model config. When the pipeline is called without explicitly settingnum_inference_steps
, the default value is used. This is required to ensure reasonable results with various model flavors compatible with the pipeline, such as those relying on very short denoising schedules (LCMScheduler
) and those with full diffusion schedules (DDIMScheduler
). - default_processing_resolution (
int
, optional) — The recommended value of theprocessing_resolution
parameter of the pipeline. This value must be set in the model config. When the pipeline is called without explicitly settingprocessing_resolution
, the default value is used. This is required to ensure reasonable results with various model flavors trained with varying optimal processing resolution values.
Pipeline for monocular normals estimation using the Marigold method: https://marigoldmonodepth.github.io.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
__call__
< source >( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] num_inference_steps: typing.Optional[int] = None ensemble_size: int = 1 processing_resolution: typing.Optional[int] = None match_input_resolution: bool = True resample_method_input: str = 'bilinear' resample_method_output: str = 'bilinear' batch_size: int = 1 ensembling_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None latents: typing.Union[torch.Tensor, typing.List[torch.Tensor], NoneType] = None generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None output_type: str = 'np' output_uncertainty: bool = False output_latent: bool = False return_dict: bool = True ) → MarigoldNormalsOutput or tuple
Parameters
- image (
PIL.Image.Image
,np.ndarray
,torch.Tensor
,List[PIL.Image.Image]
,List[np.ndarray]
), —List[torch.Tensor]
: An input image or images used as an input for the normals estimation task. For arrays and tensors, the expected value range is between[0, 1]
. Passing a batch of images is possible by providing a four-dimensional array or a tensor. Additionally, a list of images of two- or three-dimensional arrays or tensors can be passed. In the latter case, all list elements must have the same width and height. - num_inference_steps (
int
, optional, defaults toNone
) — Number of denoising diffusion steps during inference. The default valueNone
results in automatic selection. - ensemble_size (
int
, defaults to1
) — Number of ensemble predictions. Higher values result in measurable improvements and visual degradation. - processing_resolution (
int
, optional, defaults toNone
) — Effective processing resolution. When set to0
, matches the larger input image dimension. This produces crisper predictions, but may also lead to the overall loss of global context. The default valueNone
resolves to the optimal value from the model config. - match_input_resolution (
bool
, optional, defaults toTrue
) — When enabled, the output prediction is resized to match the input dimensions. When disabled, the longer side of the output will equal toprocessing_resolution
. - resample_method_input (
str
, optional, defaults to"bilinear"
) — Resampling method used to resize input images toprocessing_resolution
. The accepted values are:"nearest"
,"nearest-exact"
,"bilinear"
,"bicubic"
, or"area"
. - resample_method_output (
str
, optional, defaults to"bilinear"
) — Resampling method used to resize output predictions to match the input resolution. The accepted values are"nearest"
,"nearest-exact"
,"bilinear"
,"bicubic"
, or"area"
. - batch_size (
int
, optional, defaults to1
) — Batch size; only matters when settingensemble_size
or passing a tensor of images. - ensembling_kwargs (
dict
, optional, defaults toNone
) — Extra dictionary with arguments for precise ensembling control. The following options are available:- reduction (
str
, optional, defaults to"closest"
): Defines the ensembling function applied in every pixel location, can be either"closest"
or"mean"
.
- reduction (
- latents (
torch.Tensor
, optional, defaults toNone
) — Latent noise tensors to replace the random initialization. These can be taken from the previous function call’s output. - generator (
torch.Generator
, orList[torch.Generator]
, optional, defaults toNone
) — Random number generator object to ensure reproducibility. - output_type (
str
, optional, defaults to"np"
) — Preferred format of the output’sprediction
and the optionaluncertainty
fields. The accepted values are:"np"
(numpy array) or"pt"
(torch tensor). - output_uncertainty (
bool
, optional, defaults toFalse
) — When enabled, the output’suncertainty
field contains the predictive uncertainty map, provided that theensemble_size
argument is set to a value above 2. - output_latent (
bool
, optional, defaults toFalse
) — When enabled, the output’slatent
field contains the latent codes corresponding to the predictions within the ensemble. These codes can be saved, modified, and used for subsequent calls with thelatents
argument. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a MarigoldNormalsOutput instead of a plain tuple.
Returns
MarigoldNormalsOutput or tuple
If return_dict
is True
, MarigoldNormalsOutput is returned, otherwise a
tuple
is returned where the first element is the prediction, the second element is the uncertainty
(or None
), and the third is the latent (or None
).
Function invoked when calling the pipeline.
Examples:
>>> import diffusers
>>> import torch
>>> pipe = diffusers.MarigoldNormalsPipeline.from_pretrained(
... "prs-eth/marigold-normals-v1-1", variant="fp16", torch_dtype=torch.float16
... ).to("cuda")
>>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
>>> normals = pipe(image)
>>> vis = pipe.image_processor.visualize_normals(normals.prediction)
>>> vis[0].save("einstein_normals.png")
class diffusers.pipelines.marigold.MarigoldNormalsOutput
< source >( prediction: typing.Union[numpy.ndarray, torch.Tensor] uncertainty: typing.Union[NoneType, numpy.ndarray, torch.Tensor] latent: typing.Optional[torch.Tensor] )
Parameters
- prediction (
np.ndarray
,torch.Tensor
) — Predicted normals with values in the range [-1, 1]. The shape is $numimages imes 3 imes height imes width$ fortorch.Tensor
or $numimages imes height imes width imes 3$ fornp.ndarray
. - uncertainty (
None
,np.ndarray
,torch.Tensor
) — Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is $numimages imes 1 imes height imes width$ fortorch.Tensor
or $numimages imes height imes width imes 1$ fornp.ndarray
. - latent (
None
,torch.Tensor
) — Latent features corresponding to the predictions, compatible with thelatents
argument of the pipeline. The shape is $numimages * numensemble imes 4 imes latentheight imes latentwidth$.
Output class for Marigold monocular normals prediction pipeline.
diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_normals
< source >( normals: typing.Union[numpy.ndarray, torch.Tensor, typing.List[numpy.ndarray], typing.List[torch.Tensor]] flip_x: bool = False flip_y: bool = False flip_z: bool = False )
Parameters
- normals (
Union[np.ndarray, torch.Tensor, List[np.ndarray], List[torch.Tensor]]
) — Surface normals. - flip_x (
bool
, optional, defaults toFalse
) — Flips the X axis of the normals frame of reference. Default direction is right. - flip_y (
bool
, optional, defaults toFalse
) — Flips the Y axis of the normals frame of reference. Default direction is top. - flip_z (
bool
, optional, defaults toFalse
) — Flips the Z axis of the normals frame of reference. Default direction is facing the observer.
Visualizes surface normals, such as predictions of the MarigoldNormalsPipeline
.
Returns: List[PIL.Image.Image]
with surface normals visualization.
Marigold Intrinsic Image Decomposition API
class diffusers.MarigoldIntrinsicsPipeline
< source >( unet: UNet2DConditionModel vae: AutoencoderKL scheduler: typing.Union[diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_lcm.LCMScheduler] text_encoder: CLIPTextModel tokenizer: CLIPTokenizer prediction_type: typing.Optional[str] = None target_properties: typing.Optional[typing.Dict[str, typing.Any]] = None default_denoising_steps: typing.Optional[int] = None default_processing_resolution: typing.Optional[int] = None )
Parameters
- unet (
UNet2DConditionModel
) — Conditional U-Net to denoise the targets latent, conditioned on image latent. - vae (
AutoencoderKL
) — Variational Auto-Encoder (VAE) Model to encode and decode images and predictions to and from latent representations. - scheduler (
DDIMScheduler
orLCMScheduler
) — A scheduler to be used in combination withunet
to denoise the encoded image latents. - text_encoder (
CLIPTextModel
) — Text-encoder, for empty text embedding. - tokenizer (
CLIPTokenizer
) — CLIP tokenizer. - prediction_type (
str
, optional) — Type of predictions made by the model. - target_properties (
Dict[str, Any]
, optional) — Properties of the predicted modalities, such astarget_names
, aList[str]
used to define the number, order and names of the predicted modalities, and any other metadata that may be required to interpret the predictions. - default_denoising_steps (
int
, optional) — The minimum number of denoising diffusion steps that are required to produce a prediction of reasonable quality with the given model. This value must be set in the model config. When the pipeline is called without explicitly settingnum_inference_steps
, the default value is used. This is required to ensure reasonable results with various model flavors compatible with the pipeline, such as those relying on very short denoising schedules (LCMScheduler
) and those with full diffusion schedules (DDIMScheduler
). - default_processing_resolution (
int
, optional) — The recommended value of theprocessing_resolution
parameter of the pipeline. This value must be set in the model config. When the pipeline is called without explicitly settingprocessing_resolution
, the default value is used. This is required to ensure reasonable results with various model flavors trained with varying optimal processing resolution values.
Pipeline for Intrinsic Image Decomposition (IID) using the Marigold method: https://marigoldcomputervision.github.io.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
__call__
< source >( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] num_inference_steps: typing.Optional[int] = None ensemble_size: int = 1 processing_resolution: typing.Optional[int] = None match_input_resolution: bool = True resample_method_input: str = 'bilinear' resample_method_output: str = 'bilinear' batch_size: int = 1 ensembling_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None latents: typing.Union[torch.Tensor, typing.List[torch.Tensor], NoneType] = None generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None output_type: str = 'np' output_uncertainty: bool = False output_latent: bool = False return_dict: bool = True ) → MarigoldIntrinsicsOutput or tuple
Parameters
- image (
PIL.Image.Image
,np.ndarray
,torch.Tensor
,List[PIL.Image.Image]
,List[np.ndarray]
), —List[torch.Tensor]
: An input image or images used as an input for the intrinsic decomposition task. For arrays and tensors, the expected value range is between[0, 1]
. Passing a batch of images is possible by providing a four-dimensional array or a tensor. Additionally, a list of images of two- or three-dimensional arrays or tensors can be passed. In the latter case, all list elements must have the same width and height. - num_inference_steps (
int
, optional, defaults toNone
) — Number of denoising diffusion steps during inference. The default valueNone
results in automatic selection. - ensemble_size (
int
, defaults to1
) — Number of ensemble predictions. Higher values result in measurable improvements and visual degradation. - processing_resolution (
int
, optional, defaults toNone
) — Effective processing resolution. When set to0
, matches the larger input image dimension. This produces crisper predictions, but may also lead to the overall loss of global context. The default valueNone
resolves to the optimal value from the model config. - match_input_resolution (
bool
, optional, defaults toTrue
) — When enabled, the output prediction is resized to match the input dimensions. When disabled, the longer side of the output will equal toprocessing_resolution
. - resample_method_input (
str
, optional, defaults to"bilinear"
) — Resampling method used to resize input images toprocessing_resolution
. The accepted values are:"nearest"
,"nearest-exact"
,"bilinear"
,"bicubic"
, or"area"
. - resample_method_output (
str
, optional, defaults to"bilinear"
) — Resampling method used to resize output predictions to match the input resolution. The accepted values are"nearest"
,"nearest-exact"
,"bilinear"
,"bicubic"
, or"area"
. - batch_size (
int
, optional, defaults to1
) — Batch size; only matters when settingensemble_size
or passing a tensor of images. - ensembling_kwargs (
dict
, optional, defaults toNone
) — Extra dictionary with arguments for precise ensembling control. The following options are available:- reduction (
str
, optional, defaults to"median"
): Defines the ensembling function applied in every pixel location, can be either"median"
or"mean"
.
- reduction (
- latents (
torch.Tensor
, optional, defaults toNone
) — Latent noise tensors to replace the random initialization. These can be taken from the previous function call’s output. - generator (
torch.Generator
, orList[torch.Generator]
, optional, defaults toNone
) — Random number generator object to ensure reproducibility. - output_type (
str
, optional, defaults to"np"
) — Preferred format of the output’sprediction
and the optionaluncertainty
fields. The accepted values are:"np"
(numpy array) or"pt"
(torch tensor). - output_uncertainty (
bool
, optional, defaults toFalse
) — When enabled, the output’suncertainty
field contains the predictive uncertainty map, provided that theensemble_size
argument is set to a value above 2. - output_latent (
bool
, optional, defaults toFalse
) — When enabled, the output’slatent
field contains the latent codes corresponding to the predictions within the ensemble. These codes can be saved, modified, and used for subsequent calls with thelatents
argument. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a MarigoldIntrinsicsOutput instead of a plain tuple.
Returns
MarigoldIntrinsicsOutput or tuple
If return_dict
is True
, MarigoldIntrinsicsOutput is returned, otherwise a
tuple
is returned where the first element is the prediction, the second element is the uncertainty
(or None
), and the third is the latent (or None
).
Function invoked when calling the pipeline.
Examples:
>>> import diffusers
>>> import torch
>>> pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained(
... "prs-eth/marigold-iid-appearance-v1-1", variant="fp16", torch_dtype=torch.float16
... ).to("cuda")
>>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
>>> intrinsics = pipe(image)
>>> vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties)
>>> vis[0]["albedo"].save("einstein_albedo.png")
>>> vis[0]["roughness"].save("einstein_roughness.png")
>>> vis[0]["metallicity"].save("einstein_metallicity.png")
>>> import diffusers
>>> import torch
>>> pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained(
... "prs-eth/marigold-iid-lighting-v1-1", variant="fp16", torch_dtype=torch.float16
... ).to("cuda")
>>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
>>> intrinsics = pipe(image)
>>> vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties)
>>> vis[0]["albedo"].save("einstein_albedo.png")
>>> vis[0]["shading"].save("einstein_shading.png")
>>> vis[0]["residual"].save("einstein_residual.png")
class diffusers.pipelines.marigold.MarigoldIntrinsicsOutput
< source >( prediction: typing.Union[numpy.ndarray, torch.Tensor] uncertainty: typing.Union[NoneType, numpy.ndarray, torch.Tensor] latent: typing.Optional[torch.Tensor] )
Parameters
- prediction (
np.ndarray
,torch.Tensor
) — Predicted image intrinsics with values in the range [0, 1]. The shape is $(numimages numtargets) imes 3 imes height imes width$ fortorch.Tensor
or $(numimages numtargets) imes height imes width imes 3$ fornp.ndarray
, wherenumtargets
corresponds to the number of predicted target modalities of the intrinsic image decomposition. - uncertainty (
None
,np.ndarray
,torch.Tensor
) — Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is $(numimages numtargets) imes 3 imes height imes width$ fortorch.Tensor
or $(numimages numtargets) imes height imes width imes 3$ fornp.ndarray
. - latent (
None
,torch.Tensor
) — Latent features corresponding to the predictions, compatible with thelatents
argument of the pipeline. The shape is $(numimages numensemble) imes (numtargets 4) imes latentheight imes latentwidth$.
Output class for Marigold Intrinsic Image Decomposition pipeline.
diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_intrinsics
< source >( prediction: typing.Union[numpy.ndarray, torch.Tensor, typing.List[numpy.ndarray], typing.List[torch.Tensor]] target_properties: typing.Dict[str, typing.Any] color_map: typing.Union[str, typing.Dict[str, str]] = 'binary' )
Parameters
- prediction (
Union[np.ndarray, torch.Tensor, List[np.ndarray], List[torch.Tensor]]
) — Intrinsic image decomposition. - target_properties (
Dict[str, Any]
) — Decomposition properties. Expected entries:target_names: List[str]
and a dictionary with keysprediction_space: str
,sub_target_names: List[Union[str, Null]]
(must have 3 entries, null for missing modalities),up_to_scale: bool
, one for each target and sub-target. - color_map (
Union[str, Dict[str, str]]
, optional, defaults to"Spectral"
) — Color map used to convert a single-channel predictions into colored representations. When a dictionary is passed, each modality can be colored with its own color map.
Visualizes intrinsic image decomposition, such as predictions of the MarigoldIntrinsicsPipeline
.
Returns: List[Dict[str, PIL.Image.Image]]
with intrinsic image decomposition visualization.