Question about the VAE's input/output channel dimensions
Hi, thank you for sharing this interesting work!
But, I have a question about the provided VAE's input and output channels.
In ldm3d-4c/vae/config.json it is written that the VAE has four input channels, which is different from the expected six channels written in the paper (RGB + RGB-like depth map). Ignoring this I just tried encoding my data with the provided VAE, but was interrupted by the following error.
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 242, in encode
h = self.encoder(x)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/diffusers/models/vae.py", line 111, in forward
sample = self.conv_in(sample)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [128, 4, 3, 3], expected input[1, 6, 512, 512] to have 4 channels, but got 6 channels instead
So, I just turned my depth map into a 8 bit image and merged it with the RGB image. But then, I was faced with yet another error
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 242, in encode
h = self.encoder(x)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/diffusers/models/vae.py", line 140, in forward
sample = down_block(sample)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 1214, in forward
hidden_states = resnet(hidden_states, temb=None)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/diffusers/models/resnet.py", line 597, in forward
hidden_states = self.norm1(hidden_states)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 273, in forward
return F.group_norm(
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/functional.py", line 2530, in group_norm
return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected weight to be a vector of size equal to the number of channels in input, but got weight of shape [128] and input of shape [128, 512, 512]
I guess I have misunderstood something? Also, did you use bicubic upsampling to scale the depthmap resolution from 384 x 384 to 512 x 512?
Hi @terryryu !!
In ldm3d-4c/vae/config.json it is written that the VAE has four input channels, which is different from the expected six channels written in the paper (RGB + RGB-like depth map).
Indeed! For this version of ldm3d, that we called "ldm3d-4c", we mapped the depth to a 1D vector, making the input a 4-channel vector (3 for RGB and 1 for Depth).
Also, did you use bicubic upsampling to scale the depthmap resolution from 384 x 384 to 512 x 512?
We are using dpt-512 on a 512 input size. The output is automatically of 512 resolution
As for the error, I am not sure. A bit hard to answer. Maybe you have a short snippet so I can try to reproduce?
Best
Estelle
Thank you for your help @estellea !
I've come up with a minimal example that encodes and decodes the RGBD output of the lemon example.
import torch
import cv2
import numpy as np
from einops import rearrange
from diffusers import AutoencoderKL
from diffusers.image_processor import VaeImageProcessorLDM3D
def load_images(rgb_path, depth_path):
rgb_img = cv2.imread(rgb_path) / 255.
depth_img = cv2.imread(depth_path, cv2.IMREAD_UNCHANGED) # ensures 16-bit is preserved
if depth_img.dtype != np.uint16:
raise ValueError("Depth image is not 16-bit!")
depth_img = depth_img / 65536.
depth_img_expanded = np.expand_dims(depth_img, axis=-1)
merged_img = np.concatenate([rgb_img, depth_img_expanded], axis=-1)
return merged_img
with torch.no_grad():
test_rgbd = load_images("/home/terryryu/Experiments/LDM3D/lemons_ldm3d_rgb.jpg", "/home/terryryu/Experiments/LDM3D/lemons_ldm3d_depth.png")
vae = AutoencoderKL.from_pretrained("/home/terryryu/Weights/LDM3D/vae/", local_files_only=True, torch_dtype=torch.float16).cuda()
vae_scale_factor = 2 ** (len(vae.config.block_out_channels) - 1)
processor = VaeImageProcessorLDM3D(vae_scale_factor=vae_scale_factor)
test_rgbd = rearrange(test_rgbd, "h w c -> 1 c h w")
test_rgbd = torch.cuda.HalfTensor(test_rgbd)
latents = vae.encode(test_rgbd).latent_dist.mode()
image = vae.decode(latents / vae.config.scaling_factor, return_dict=False)[0]
output_type = "pil"
do_denormalize = [True] * image.shape[0]
rgb, depth = processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
rgb[0].save("./minimal_test_rgb.png")
depth[0].save("./minimal_test_depth.png")
Strangely, the color of the decoded result is broken. Maybe I've made a minor mistake somewhere...?
Best,
Ryu
I have the same problem how to use the ldm3d-4c vae to reconstruct the RGB and depth image?
Before encoding, make sure to normalize both the image and the depth to the range of [-1, 1], by adding this to 'load_images':
rgb_img = 2. * rgb_img - 1.
depth_img = 2. * depth_img - 1.
also, there is no need to divide the latents by any scaling factor during the reconstruction since the latent space does not undergo scaling at this stage. However, when utilizing the diffusion aspect of the process, ensure that you scale the latent space both prior to and after the diffusion to achieve the desired results.
@terryryu
, attaching updated example:
import torch
import cv2
import numpy as np
from einops import rearrange
from diffusers import AutoencoderKL
from diffusers.image_processor import VaeImageProcessorLDM3D
def load_images(rgb_path, depth_path):
rgb_img = cv2.imread(rgb_path) / 255.
rgb_img = 2.*rgb_img - 1.
depth_img = cv2.imread(depth_path, cv2.IMREAD_UNCHANGED) # ensures 16-bit is preserved
if depth_img.dtype != np.uint16:
raise ValueError("Depth image is not 16-bit!")
depth_img = depth_img / 65536.
depth_img = 2.*depth_img - 1.
depth_img_expanded = np.expand_dims(depth_img, axis=-1)
merged_img = np.concatenate([rgb_img, depth_img_expanded], axis=-1)
return merged_img
with torch.no_grad():
test_rgbd = load_images("/home/terryryu/Experiments/LDM3D/lemons_ldm3d_rgb.jpg", "/home/terryryu/Experiments/LDM3D/lemons_ldm3d_depth.png")
vae = AutoencoderKL.from_pretrained("/home/terryryu/Weights/LDM3D/vae/", local_files_only=True, torch_dtype=torch.float16).cuda()
vae_scale_factor = 2 ** (len(vae.config.block_out_channels) - 1)
processor = VaeImageProcessorLDM3D(vae_scale_factor=vae_scale_factor)
test_rgbd = rearrange(test_rgbd, "h w c -> 1 c h w")
test_rgbd = torch.cuda.HalfTensor(test_rgbd)
latents = vae.encode(test_rgbd).latent_dist.mode()
image = vae.decode(latents, return_dict=False)[0]
output_type = "pil"
do_denormalize = [True] * image.shape[0]
rgb, depth = processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
rgb[0].save("./minimal_test_rgb.png")
depth[0].save("./minimal_test_depth.png")
Best,
Gabi