# [CVPR2024] StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On This repository is the official implementation of [StableVITON](https://arxiv.org/abs/2312.01725) > **StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On**
> [Jeongho Kim](https://scholar.google.co.kr/citations?user=ucoiLHQAAAAJ&hl=ko), [Gyojung Gu](https://www.linkedin.com/in/gyojung-gu-29033118b/), [Minho Park](https://pmh9960.github.io/), [Sunghyun Park](https://psh01087.github.io/), [Jaegul Choo](https://sites.google.com/site/jaegulchoo/) [[Arxiv Paper](https://arxiv.org/abs/2312.01725)]  [[Website Page](https://rlawjdghek.github.io/StableVITON/)]  ![teaser](assets/teaser.png)  ## TODO List - [x] ~~Inference code~~ - [x] ~~Release model weights~~ - [x] ~~Training code~~ ## Environments ```bash git clone https://github.com/rlawjdghek/StableVITON cd StableVITON conda create --name StableVITON python=3.10 -y conda activate StableVITON # install packages pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117 pip install pytorch-lightning==1.5.0 pip install einops pip install opencv-python==4.7.0.72 pip install matplotlib pip install omegaconf pip install albumentations pip install transformers==4.33.2 pip install xformers==0.0.19 pip install triton==2.0.0 pip install open-clip-torch==2.19.0 pip install diffusers==0.20.2 pip install scipy==1.10.1 conda install -c anaconda ipython -y ``` ## Weights and Data Our [checkpoint](https://kaistackr-my.sharepoint.com/:f:/g/personal/rlawjdghek_kaist_ac_kr/EjzAZHJu9MlEoKIxG4tqPr0BM_Ry20NHyNw5Sic2vItxiA?e=5mGa1c) on VITONHD have been released!
You can download the VITON-HD dataset from [here](https://github.com/shadow2496/VITON-HD).
For both training and inference, the following dataset structure is required: ``` train |-- image |-- image-densepose |-- agnostic |-- agnostic-mask |-- cloth |-- cloth_mask |-- gt_cloth_warped_mask (for ATV loss) test |-- image |-- image-densepose |-- agnostic |-- agnostic-mask |-- cloth |-- cloth_mask ``` ## Preprocessing The VITON-HD dataset serves as a benchmark and provides an agnostic mask. However, you can attempt virtual try-on on **arbitrary images** using segmentation tools like [SAM](https://github.com/facebookresearch/segment-anything). Please note that for densepose, you should use the same densepose model as used in VITON-HD. ## Inference ```bash #### paired CUDA_VISIBLE_DEVICES=4 python inference.py \ --config_path ./configs/VITONHD.yaml \ --batch_size 4 \ --model_load_path \ --save_dir #### unpaired CUDA_VISIBLE_DEVICES=4 python inference.py \ --config_path ./configs/VITONHD.yaml \ --batch_size 4 \ --model_load_path \ --unpair \ --save_dir #### paired repaint CUDA_VISIBLE_DEVICES=4 python inference.py \ --config_path ./configs/VITONHD.yaml \ --batch_size 4 \ --model_load_path t \ --repaint \ --save_dir #### unpaired repaint CUDA_VISIBLE_DEVICES=4 python inference.py \ --config_path ./configs/VITONHD.yaml \ --batch_size 4 \ --model_load_path \ --unpair \ --repaint \ --save_dir ``` You can also preserve the unmasked region by '--repaint' option. ## Training For VITON training, we increased the first block of U-Net from 9 to 13 channels (add zero conv) based on the Paint-by-Example (PBE) model. Therefore, you should download the modified checkpoint (named as 'VITONHD_PBE_pose.ckpt') from the [Link](https://kaistackr-my.sharepoint.com/:f:/g/personal/rlawjdghek_kaist_ac_kr/EjzAZHJu9MlEoKIxG4tqPr0BM_Ry20NHyNw5Sic2vItxiA?e=5mGa1c) and place it in the './ckpts/' folder first. Additionally, for more refined person texture, we utilized a VAE fine-tuned on the VITONHD dataset. You should also download the checkpoint (named as VITONHD_VAE_finetuning.ckpt') from the [Link](https://kaistackr-my.sharepoint.com/:f:/g/personal/rlawjdghek_kaist_ac_kr/EjzAZHJu9MlEoKIxG4tqPr0BM_Ry20NHyNw5Sic2vItxiA?e=5mGa1c) and place it in the './ckpts/' folder. ```bash ### Base model training CUDA_VISIBLE_DEVICES=3,4 python train.py \ --config_name VITONHD \ --transform_size shiftscale3 hflip \ --transform_color hsv bright_contrast \ --save_name Base_test ### ATV loss finetuning CUDA_VISIBLE_DEVICES=5,6 python train.py \ --config_name VITONHD \ --transform_size shiftscale3 hflip \ --transform_color hsv bright_contrast \ --use_atv_loss \ --resume_path \ --save_name ATVloss_test ``` ## Citation If you find our work useful for your research, please cite us: ``` @artical{kim2023stableviton, title={StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On}, author={Kim, Jeongho and Gu, Gyojung and Park, Minho and Park, Sunghyun and Choo, Jaegul}, booktitle={arXiv preprint arxiv:2312.01725}, year={2023} } ``` **Acknowledgements** Sunghyun Park is the corresponding author. ## License Licensed under the CC BY-NC-SA 4.0 license (https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode).