Apology and Notification

#2
by John6666 - opened

Apology and Notification

Perhaps after you noticed and fixed it yourself, but I had created an extremely easy but fatal bug. My apologies.πŸ˜”

  • Wrong code (to destroy CLIP)
k.replace("vae.", "").replace("model.diffusion_model.", "")\
        .replace("text_encoders.clip_l.transformer.text_model.", "")\
        .replace("text_encoders.t5xxl.transformer.", "")
  • Correct code
k.replace("vae.", "").replace("model.diffusion_model.", "")\
        .replace("text_encoders.clip_l.transformer.", "")\
        .replace("text_encoders.t5xxl.transformer.", "")

Also, the FLUX.1 model is too large to operate in my local environment, so the results are from testing only on HF's free CPU space, but there was some behavior that I was curious about.

On some models, huggingface_hub.save_torch_state_dict freezes without sending an error or raising exception.
It is also decidedly only when saving a transformer (unet).
I traced it with print(f""), which is a paleolithic method, and it is hard to believe that RAM, CPU, or disk usage is the cause; I confirmed that it works fine until huggingface_hub.split_torch_state_dict_into_shards.
So it is probably failing in the internal safetensors.torch.save_model part.
I hope it's just a lack of specs and stuck in some weird place...
If you have problems with save_pretrained, suspect this.

Specifically I have seen this occur when saving with torch.float8_e4m3fn on the following model.
https://huggingface.co/datasets/John6666/flux1-backup-202408/blob/main/theAraminta_flux1A1.safetensors

Thanks for the discussion

Sorry to report in an unrelated repo (or not?) due to an emergency.
Thank you for your constant development.πŸ€—

P.S.

Regarding the above problem, I ran it experimentally in Zero GPU space (without any code changes and without using the GPU directly) and the problem did not occur.
I am relieved to know that it is not a bug in the library (and/or my code), but a lack of VM specs.
Although I did not see any difference in RAM (including page files) usage, it may be the result of a difference in the VM's underlying performance other than the GPU, or some burden is implicitly offloaded to the GPU or VRAM.
Anyway, sorry for the trouble.

Sign up or log in to comment