Training details
Hi,
I'm impressed by your amazing work.
Could you describe the training details (e.g., batch size, lr, scheduler, etc) for lcm-lora-sdxl model?
When I tried to train lcm-lora-sdxl model with the official diffusers's training script, the intermediate validation result images were not as good as yours.
Thanks in advance.
Did you try the exact same training setup? Dataset, hyperparameters, etc?
Thank you for your quick response.
Yes, except for the training data.
I used a subset of laion-aesthetic dataset (11K text-image pairs) provided by BK-SDM.
I shared validated generation images at 700 iterations.
This is hyper-params:
--train_data_dir=./data/laion_aes/preprocessed_11k --pretrained_teacher_model=stabilityai/stable-diffusion-xl-base-1.0 --pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix --output_dir=./results/TOY_LCM_LORA_LAION/lcm_lora_sdxl_base_24x1x1_lr_1e-4 --tracker_project_name=TOY_LCM_LORA_LAION --tracker_output_name=lcm_lora_sdxl_base_24x1x1_lr_1e-4 --mixed_precision=fp16 --resolution=1024 --train_batch_size=24 --gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam --lora_rank=64 --learning_rate=1e-4 --report_to=wandb --lr_scheduler=constant --lr_warmup_steps=0 --max_train_steps=100000 --checkpointing_steps=2000 --validation_steps=20 --seed=0 --report_to=wandb
I have another question.
Could you let me know what data is used for training lcm-lora-ssd-1b model and lcm-lora-sdxl?
When I generated some samples, the result of lcm-lora-ssd-1b showed better quality than that of lcm-lora-sdxl.
I wonder if this difference in generation quality is caused by differences in the data used for training.
For the sake of the community, it would be very helpful if you could share the training details of the lcm-lora-sdxl
and lcm-lora-ssd-1b
models.
In my case, I'm trying to create an lcm-lora version of the koala model, which is a lightweight T2I model like ssd-1b.
Thanks in advance.
For lcm-lora-ssd-1b
:
model_id = "segmind/SSD-1B"
adapter_id = "latent-consistency/lcm-lora-ssd-1b"
pipe = AutoPipelineForText2Image.from_pretrained(model_id, torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")
# load and fuse lcm lora
pipe.load_lora_weights(adapter_id)
pipe.fuse_lora()
prompt = "Portrait photo of a standing girl, photograph, golden hair, depth of field, moody light, golden hour, centered, extremely detailed, award winning photography, realistic."
image = pipe(prompt=prompt, num_inference_steps=4, guidance_scale=1).images[0]
For lcm-lora-sdxl
:
pipe2 = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
pipe2.scheduler = LCMScheduler.from_config(pipe2.scheduler.config)
pipe2.to("cuda:4")
# load and fuse lcm lora
pipe2.load_lora_weights("latent-consistency/lcm-lora-sdxl")
pipe2.fuse_lora()
prompt = "Portrait photo of a standing girl, photograph, golden hair, depth of field, moody light, golden hour, centered, extremely detailed, award winning photography, realistic."
image2 = pipe2(prompt=prompt, num_inference_steps=4, guidance_scale=1).images[0]
Thank you for your quick response.
Yes, except for the training data.
I used a subset of laion-aesthetic dataset (11K text-image pairs) provided by BK-SDM.
I shared validated generation images at 700 iterations.
This is hyper-params:
--train_data_dir=./data/laion_aes/preprocessed_11k --pretrained_teacher_model=stabilityai/stable-diffusion-xl-base-1.0 --pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix --output_dir=./results/TOY_LCM_LORA_LAION/lcm_lora_sdxl_base_24x1x1_lr_1e-4 --tracker_project_name=TOY_LCM_LORA_LAION --tracker_output_name=lcm_lora_sdxl_base_24x1x1_lr_1e-4 --mixed_precision=fp16 --resolution=1024 --train_batch_size=24 --gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam --lora_rank=64 --learning_rate=1e-4 --report_to=wandb --lr_scheduler=constant --lr_warmup_steps=0 --max_train_steps=100000 --checkpointing_steps=2000 --validation_steps=20 --seed=0 --report_to=wandb
In your example generated images, how many inference steps is that with?