Distill CLOOB-conditioned Latent Diffusion trained on WikiArt

Model description

This is a smaller version of this model, which is a cloob-conditioned latent diffusion model fine-tuned on the WikiArt dataset, reducing the latent diffusion model size from 1.2B parameters to 105M parameters with a knowledge distillation method.

CLOOB is a model that encodes images and texts in an unified latent space, similar to what OpenAI's CLIP does. The latent diffusion model takes a CLOOB-encoded latent vector as a condition, this can be from a pompt or an image.

Intended uses & limitations

The latent diffusion model is the only difference with the teacher model, the autoencoder was not changed, nor the CLOOB model. So these are not provided in this repository.

model_student.ckpt: The latent diffusion model checkpoint

How to use

You need some dependencies from multiple repositories linked in this repository : CLOOB latent diffusion :

CLIP
CLOOB : the model to encode images and texts in an unified latent space, used for conditioning the latent diffusion.
Latent Diffusion : latent diffusion model definition
Taming transformers : a pretrained convolutional VQGAN is used as an autoencoder to go from image space to the latent space in which the diffusion is done.
v-diffusion : contains some functions for sampling using a diffusion model with text and/or image prompts.

An example code to use the model to sample images from a text prompt can be seen in a Colab Notebook, or directly in the app source code for the Gradio demo on this Space

Limitations and bias

The student latent diffusion model was trained only on images from the WikiArt dataset, but the VQGAN autoencoder, the CLOOB model and the teacher latent diffusion model all come from pretrained checkpoints and were trained on images and texts from the internet.

According to the Latent Diffusion paper: “Deep learning modules tend to reproduce or exacerbate biases that are already present in the data”.

Training data

This model was trained on the WikiArt dataset only. Only the images were used during training, no text prompt, so we did not use the information of style/genre/artist.

Training procedure

This latent diffusion model was trained with a Knowledge Distillation process with huggan/ccld_wa as a teacher model. Training of the teacher model largely followed the guidelines in JD-P's github repo. The model was fine-tuned on the Wikiart dataset for ~12 hours on 2 A6000 GPUs kindly provided by Paperspace. The knowledge distillation process was done on the WikiArt dataset as well. The training of the student model took 17 hours on 1 A6000 GPU provided by Paperspace. Here is the wandb report for this training.

Links

Model card for the teacher model on HuggingFace, trained by Jonathan Whitaker. He described the model and training procedure on his blog post
Model card for the student model on HuggingFace, trained by me. You can check my WandB report. This version has 105M parameters, against 1.2B parameters for the teacher version. It is lighter, and allows for faster inference, while maintaining some of the original model capability at generating paintings from prompts.
Gradio demo app on HuggingFace's Spaces to try out the model with an online demo app
iPython Notebook to use the model in Python
WikiArt dataset on datasets hub
GitHub repository

huggan
/

distill-ccld-wa