|
--- |
|
license: apache-2.0 |
|
tags: |
|
- yahoo-open-source-software-incubator |
|
--- |
|
# Salient Object-Aware Background Generation using Text-Guided Diffusion Models [![Paper](assets/arxiv.svg)](https://arxiv.org/pdf/2404.10157.pdf) |
|
This repository accompanies our paper, [Salient Object-Aware Background Generation using Text-Guided Diffusion Models](https://arxiv.org/abs/2404.10157), which has been accepted for publication in [CVPR 2024 Generative Models for Computer Vision](https://generative-vision.github.io/workshop-CVPR-24/) workshop. |
|
|
|
The paper addresses an issue we call "object expansion" when generating backgrounds for salient objects using inpainting diffusion models. We show that models such as [Stable Inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) can sometimes arbitrarily expand or distort the salient object, which is undesirable in applications where the object's identity should be preserved, such as e-commerce ads. We provide some examples of object expansion as follows: |
|
|
|
<div align="center"> |
|
<img src="assets/fig.jpg"> |
|
</div> |
|
|
|
# Inference |
|
|
|
### Load pipeline |
|
```py |
|
from diffusers import ( |
|
AutoencoderKL, |
|
ControlNetModel, |
|
DDPMScheduler, |
|
UNet2DConditionModel, |
|
UniPCMultistepScheduler, |
|
) |
|
from diffusers import StableDiffusionControlNetInpaintPipeline |
|
from transformers import AutoTokenizer, PretrainedConfig, CLIPTextModel |
|
import torch |
|
|
|
# Load our pretrained ControlNet |
|
controlnet = ControlNetModel.from_pretrained('yahoo-inc/photo-background-generation') |
|
|
|
# Load Stable Inpainting 2.0 |
|
sd_inpainting_model_name = "stabilityai/stable-diffusion-2-inpainting" |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
sd_inpainting_model_name, |
|
subfolder="tokenizer", |
|
use_fast=False, |
|
) |
|
noise_scheduler = DDPMScheduler.from_pretrained(sd_inpainting_model_name, subfolder="scheduler") |
|
text_encoder = CLIPTextModel.from_pretrained( |
|
sd_inpainting_model_name, subfolder="text_encoder", revision=None |
|
) |
|
vae = AutoencoderKL.from_pretrained(sd_inpainting_model_name, subfolder="vae", revision=None) |
|
unet = UNet2DConditionModel.from_pretrained( |
|
sd_inpainting_model_name, subfolder="unet", revision=None |
|
) |
|
|
|
# Create the SD based inpainting pipeline |
|
pipeline = StableDiffusionControlNetInpaintPipeline.from_pretrained( |
|
sd_inpainting_model_name, |
|
vae=vae, |
|
text_encoder=text_encoder, |
|
tokenizer=tokenizer, |
|
unet=unet, |
|
controlnet=controlnet, |
|
safety_checker=None, |
|
revision=None, |
|
torch_dtype=torch.float32, |
|
) |
|
pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config) |
|
pipeline = pipeline.to('cuda') |
|
pipeline.set_progress_bar_config(disable=True) |
|
``` |
|
|
|
### Load an image and extract its background and foreground |
|
```py |
|
from PIL import Image, ImageOps |
|
import requests |
|
from io import BytesIO |
|
from transparent_background import Remover |
|
|
|
def resize_with_padding(img, expected_size): |
|
img.thumbnail((expected_size[0], expected_size[1])) |
|
# print(img.size) |
|
delta_width = expected_size[0] - img.size[0] |
|
delta_height = expected_size[1] - img.size[1] |
|
pad_width = delta_width // 2 |
|
pad_height = delta_height // 2 |
|
padding = (pad_width, pad_height, delta_width - pad_width, delta_height - pad_height) |
|
return ImageOps.expand(img, padding) |
|
|
|
seed = 0 |
|
image_url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/16/Granja_comary_Cisne_-_Escalavrado_e_Dedo_De_Deus_ao_fundo_-Teres%C3%B3polis.jpg/2560px-Granja_comary_Cisne_-_Escalavrado_e_Dedo_De_Deus_ao_fundo_-Teres%C3%B3polis.jpg' |
|
response = requests.get(image_url) |
|
img = Image.open(BytesIO(response.content)) |
|
img = resize_with_padding(img, (512, 512)) |
|
|
|
# Load background detection model |
|
remover = Remover() # default setting |
|
remover = Remover(mode='base') # nightly release checkpoint |
|
|
|
# Get foreground mask |
|
fg_mask = remover.process(img, type='map') # default setting - transparent background |
|
``` |
|
|
|
### Background generation |
|
```py |
|
seed = 13 |
|
mask = ImageOps.invert(fg_mask) |
|
img = resize_with_padding(img, (512, 512)) |
|
generator = torch.Generator(device='cuda').manual_seed(seed) |
|
prompt = 'A dark swan in a bedroom' |
|
cond_scale = 1.0 |
|
with torch.autocast("cuda"): |
|
controlnet_image = pipeline( |
|
prompt=prompt, image=img, mask_image=mask, control_image=mask, num_images_per_prompt=1, generator=generator, num_inference_steps=20, guess_mode=False, controlnet_conditioning_scale=cond_scale |
|
).images[0] |
|
controlnet_image |
|
``` |
|
## Citations |
|
|
|
If you found our work useful, please consider citing our paper: |
|
|
|
```bibtex |
|
@misc{eshratifar2024salient, |
|
title={Salient Object-Aware Background Generation using Text-Guided Diffusion Models}, |
|
author={Amir Erfan Eshratifar and Joao V. B. Soares and Kapil Thadani and Shaunak Mishra and Mikhail Kuznetsov and Yueh-Ning Ku and Paloma de Juan}, |
|
year={2024}, |
|
eprint={2404.10157}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV} |
|
} |
|
``` |
|
|
|
## Maintainers |
|
|
|
- Erfan Eshratifar: [email protected] |
|
- Joao Soares: [email protected] |
|
|
|
## License |
|
|
|
This project is licensed under the terms of the [Apache 2.0](LICENSE) open source license. Please refer to [LICENSE](LICENSE) for the full terms. |