Question in the Text encoder setting
Hi,
I find there probably is a problem in setting up the text encoder, not sure why this occurs...
In particular, in the text encoder, the number of hidden layers is set to 23 https://huggingface.co/stabilityai/stable-diffusion-2-1/blob/main/text_encoder/config.json#L19, However, when looking into the official OpenClip H-14, the number of the hidden layer is 24 https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/ViT-H-14.json#L15, this can also be confirmed from the number of layers in the LAION CLIP ViT H-14 repo, https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/blob/main/config.json#L54
Does anyone know why the hugging face repo is setting the number of hidden layers to 23? Is this a bug, or a small trick to improve the sampling performance?
Thanks
Can this possibly be about the last projection layer being removed from/not used in SD as it takes the 77x1024 text embedding as input, not the final CLIP projection of dim 1024?