awoo
Browse filesSigned-off-by: Balazs Horvath <[email protected]>
README.md
CHANGED
@@ -63,14 +63,13 @@ The Yiff Toolkit is a comprehensive set of tools designed to enhance your creati
|
|
63 |
- [`--network_dropout`](#--network_dropout)
|
64 |
- [`--lr_scheduler`](#--lr_scheduler)
|
65 |
- [`--lr_scheduler_num_cycles`](#--lr_scheduler_num_cycles)
|
66 |
-
- [`--learning_rate`](#--learning_rate)
|
67 |
-
- [`--unet_lr`](#--unet_lr)
|
68 |
-
- [`--text_encoder_lr`](#--text_encoder_lr)
|
69 |
- [`--network_dim`](#--network_dim)
|
70 |
- [`--output_name`](#--output_name)
|
71 |
- [`--scale_weight_norms`](#--scale_weight_norms)
|
|
|
72 |
- [`--no_half_vae`](#--no_half_vae)
|
73 |
-
- [`--save_every_n_epochs`](#--save_every_n_epochs)
|
74 |
- [`--mixed_precision`](#--mixed_precision)
|
75 |
- [`--save_precision`](#--save_precision)
|
76 |
- [`--caption_extension`](#--caption_extension)
|
@@ -80,6 +79,7 @@ The Yiff Toolkit is a comprehensive set of tools designed to enhance your creati
|
|
80 |
- [`--max_train_steps`](#--max_train_steps)
|
81 |
- [`--shuffle_caption`](#--shuffle_caption)
|
82 |
- [`--sdpa` or `--xformers` or `--mem_eff_attn`](#--sdpa-or---xformers-or---mem_eff_attn)
|
|
|
83 |
- [`--sample_prompts` and `--sample_sampler` and `--sample_every_n_steps`](#--sample_prompts-and---sample_sampler-and---sample_every_n_steps)
|
84 |
- [CosXL Training](#cosxl-training)
|
85 |
- [Embeddings for 1.5 and SDXL](#embeddings-for-15-and-sdxl)
|
@@ -485,6 +485,8 @@ If you are training with multiple GPUs, ensure that the total number of prompts
|
|
485 |
<details>
|
486 |
<summary>Click to reveal training commands.</summary>
|
487 |
|
|
|
|
|
488 |
##### `accelerate launch`
|
489 |
|
490 |
For two GPUs:
|
@@ -499,6 +501,8 @@ Single GPU:
|
|
499 |
accelerate launch --num_processes=1 --num_machines=1 --gpu_ids=0 --num_cpu_threads_per_process=2 "./sdxl_train_network.py"
|
500 |
```
|
501 |
|
|
|
|
|
502 |
##### `--lowram`
|
503 |
|
504 |
If you are running running out of system memory like I do with 2 GPUs and a really fat model that gets loaded into it per GPU, this option will help you save a bit of it and might get you out of OOM hell.
|
@@ -660,6 +664,8 @@ conv_block_alphas = [conv_alpha] * num_total_blocks
|
|
660 |
|
661 |
###### `module_dropout` and `dropout` and `rank_dropout`
|
662 |
|
|
|
|
|
663 |
`rank_dropout` is a form of dropout, which is a regularization technique used in neural networks to prevent overfitting and improve generalization. However, unlike traditional dropout which randomly sets a proportion of inputs to zero, `rank_dropout` operates on the rank of the input tensor `lx`. First a binary mask is created with the same rank as `lx` with each element set to `True` with probability `1 - rank_dropout` and `False` otherwise. Then the `mask` is applied to `lx` to randomly set some of its elements to zero. After applying the dropout, a scaling factor is applied to `lx` to compensate for the dropped out elements. This is done to ensure that the expected sum of `lx` remains the same before and after dropout. The scaling factor is `1.0 / (1.0 - self.rank_dropout)`.
|
664 |
|
665 |
It’s called “rank” dropout because it operates on the rank of the input tensor, rather than its individual elements. This can be particularly useful in tasks where the rank of the input is important.
|
@@ -699,6 +705,8 @@ def forward(self, x):
|
|
699 |
return org_forwarded + lx * self.multiplier * scale
|
700 |
```
|
701 |
|
|
|
|
|
702 |
---
|
703 |
|
704 |
###### `use_tucker`
|
@@ -771,9 +779,13 @@ Specifies the alpha of each block, this too also takes 25 numbers if you don't s
|
|
771 |
|
772 |
---
|
773 |
|
|
|
|
|
|
|
|
|
774 |
##### `--network_dropout`
|
775 |
|
776 |
-
Using `weight_decompose=True` will ignore `network_dropout` and only rank and module dropout will be applied.
|
777 |
|
778 |
```python
|
779 |
--network_dropout=0 \
|
@@ -783,7 +795,11 @@ Using `weight_decompose=True` will ignore `network_dropout` and only rank and mo
|
|
783 |
|
784 |
##### `--lr_scheduler`
|
785 |
|
786 |
-
|
|
|
|
|
|
|
|
|
787 |
|
788 |
```python
|
789 |
--lr_scheduler="cosine" \
|
@@ -793,7 +809,7 @@ Using `weight_decompose=True` will ignore `network_dropout` and only rank and mo
|
|
793 |
|
794 |
##### `--lr_scheduler_num_cycles`
|
795 |
|
796 |
-
|
797 |
|
798 |
```py
|
799 |
--lr_scheduler_num_cycles=1 \
|
@@ -801,31 +817,15 @@ Using `weight_decompose=True` will ignore `network_dropout` and only rank and mo
|
|
801 |
|
802 |
---
|
803 |
|
804 |
-
##### `--learning_rate`
|
805 |
-
|
806 |
-
⚠️
|
807 |
-
|
808 |
-
```py
|
809 |
-
--learning_rate=0.0001 \
|
810 |
-
```
|
811 |
-
|
812 |
-
---
|
813 |
|
814 |
-
|
815 |
|
816 |
-
|
817 |
|
818 |
```py
|
|
|
819 |
--unet_lr=0.0001 \
|
820 |
-
```
|
821 |
-
|
822 |
-
---
|
823 |
-
|
824 |
-
##### `--text_encoder_lr`
|
825 |
-
|
826 |
-
⚠️
|
827 |
-
|
828 |
-
```py
|
829 |
--text_encoder_lr=0.0001 \
|
830 |
```
|
831 |
|
@@ -855,14 +855,22 @@ Specify the output name excluding the file extension.
|
|
855 |
|
856 |
##### `--scale_weight_norms`
|
857 |
|
858 |
-
|
859 |
|
860 |
-
|
861 |
|
862 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
863 |
|
864 |
```py
|
865 |
-
--
|
866 |
```
|
867 |
|
868 |
---
|
@@ -877,12 +885,15 @@ Disables mixed precision for the SDXL VAE and sets it to `float32`. Very useful
|
|
877 |
|
878 |
---
|
879 |
|
880 |
-
##### `--save_every_n_epochs`
|
881 |
|
882 |
-
|
|
|
|
|
|
|
883 |
|
884 |
```py
|
885 |
-
--save_every_n_epochs=
|
886 |
```
|
887 |
|
888 |
---
|
@@ -951,7 +962,7 @@ Repeats the dataset when training with captions, by default it is set to `1` so
|
|
951 |
Specify the number of steps or epochs to train. If both `--max_train_steps` and `--max_train_epochs` are specified, the number of epochs takes precedence.
|
952 |
|
953 |
```py
|
954 |
-
--max_train_steps=
|
955 |
```
|
956 |
|
957 |
---
|
@@ -974,6 +985,17 @@ The choice between `--xformers` or `--mem_eff_attn` and `--spda` will depend on
|
|
974 |
|
975 |
---
|
976 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
977 |
##### `--sample_prompts` and `--sample_sampler` and `--sample_every_n_steps`
|
978 |
|
979 |
You have the option of generating images during training so you can check the progress, the argument let's you pick between different samplers, by default it is on `ddim`, so you better change it!
|
@@ -996,6 +1018,12 @@ ddim, pndm, lms, euler, euler_a, heun, dpm_2, dpm_2_a, dpmsolver, dpmsolver++, d
|
|
996 |
|
997 |
---
|
998 |
|
|
|
|
|
|
|
|
|
|
|
|
|
999 |
</details>
|
1000 |
</div>
|
1001 |
|
|
|
63 |
- [`--network_dropout`](#--network_dropout)
|
64 |
- [`--lr_scheduler`](#--lr_scheduler)
|
65 |
- [`--lr_scheduler_num_cycles`](#--lr_scheduler_num_cycles)
|
66 |
+
- [`--learning_rate` and `--unet_lr` and `--text_encoder_lr`](#--learning_rate-and---unet_lr-and---text_encoder_lr)
|
|
|
|
|
67 |
- [`--network_dim`](#--network_dim)
|
68 |
- [`--output_name`](#--output_name)
|
69 |
- [`--scale_weight_norms`](#--scale_weight_norms)
|
70 |
+
- [`--max_grad_norm`](#--max_grad_norm)
|
71 |
- [`--no_half_vae`](#--no_half_vae)
|
72 |
+
- [`--save_every_n_epochs` and `--save_last_n_epochs` or `--save_every_n_steps` and `--save_last_n_steps`](#--save_every_n_epochs-and---save_last_n_epochs-or---save_every_n_steps-and---save_last_n_steps)
|
73 |
- [`--mixed_precision`](#--mixed_precision)
|
74 |
- [`--save_precision`](#--save_precision)
|
75 |
- [`--caption_extension`](#--caption_extension)
|
|
|
79 |
- [`--max_train_steps`](#--max_train_steps)
|
80 |
- [`--shuffle_caption`](#--shuffle_caption)
|
81 |
- [`--sdpa` or `--xformers` or `--mem_eff_attn`](#--sdpa-or---xformers-or---mem_eff_attn)
|
82 |
+
- [`--multires_noise_iterations` and `--multires_noise_discount`](#--multires_noise_iterations-and---multires_noise_discount)
|
83 |
- [`--sample_prompts` and `--sample_sampler` and `--sample_every_n_steps`](#--sample_prompts-and---sample_sampler-and---sample_every_n_steps)
|
84 |
- [CosXL Training](#cosxl-training)
|
85 |
- [Embeddings for 1.5 and SDXL](#embeddings-for-15-and-sdxl)
|
|
|
485 |
<details>
|
486 |
<summary>Click to reveal training commands.</summary>
|
487 |
|
488 |
+
---
|
489 |
+
|
490 |
##### `accelerate launch`
|
491 |
|
492 |
For two GPUs:
|
|
|
501 |
accelerate launch --num_processes=1 --num_machines=1 --gpu_ids=0 --num_cpu_threads_per_process=2 "./sdxl_train_network.py"
|
502 |
```
|
503 |
|
504 |
+
---
|
505 |
+
|
506 |
##### `--lowram`
|
507 |
|
508 |
If you are running running out of system memory like I do with 2 GPUs and a really fat model that gets loaded into it per GPU, this option will help you save a bit of it and might get you out of OOM hell.
|
|
|
664 |
|
665 |
###### `module_dropout` and `dropout` and `rank_dropout`
|
666 |
|
667 |
+
[![An AI generated image.](https://huggingface.co/k4d3/yiff_toolkit/resolve/main/static/tutorial/dropout1.png)](https://huggingface.co/k4d3/yiff_toolkit/resolve/main/static/tutorial/dropout1.png)
|
668 |
+
|
669 |
`rank_dropout` is a form of dropout, which is a regularization technique used in neural networks to prevent overfitting and improve generalization. However, unlike traditional dropout which randomly sets a proportion of inputs to zero, `rank_dropout` operates on the rank of the input tensor `lx`. First a binary mask is created with the same rank as `lx` with each element set to `True` with probability `1 - rank_dropout` and `False` otherwise. Then the `mask` is applied to `lx` to randomly set some of its elements to zero. After applying the dropout, a scaling factor is applied to `lx` to compensate for the dropped out elements. This is done to ensure that the expected sum of `lx` remains the same before and after dropout. The scaling factor is `1.0 / (1.0 - self.rank_dropout)`.
|
670 |
|
671 |
It’s called “rank” dropout because it operates on the rank of the input tensor, rather than its individual elements. This can be particularly useful in tasks where the rank of the input is important.
|
|
|
705 |
return org_forwarded + lx * self.multiplier * scale
|
706 |
```
|
707 |
|
708 |
+
The network you are training needs to support it though! See [PR#545](https://github.com/kohya-ss/sd-scripts/pull/545) for more details.
|
709 |
+
|
710 |
---
|
711 |
|
712 |
###### `use_tucker`
|
|
|
779 |
|
780 |
---
|
781 |
|
782 |
+
That concludes the `network_args`.
|
783 |
+
|
784 |
+
---
|
785 |
+
|
786 |
##### `--network_dropout`
|
787 |
|
788 |
+
This float controls the drop of neurons out of training every step, `0` or `None` is default behavior (no dropout), 1 would drop all neurons. Using `weight_decompose=True` will ignore `network_dropout` and only rank and module dropout will be applied.
|
789 |
|
790 |
```python
|
791 |
--network_dropout=0 \
|
|
|
795 |
|
796 |
##### `--lr_scheduler`
|
797 |
|
798 |
+
A learning rate scheduler in PyTorch is a tool that adjusts the learning rate during the training process. It’s used to modulate the learning rate in response to how the model is performing, which can lead to increased performance and reduced training time.
|
799 |
+
|
800 |
+
Possible values: `linear`, `cosine`, `cosine_with_restarts`, `polynomial`, `constant` (default), `constant_with_warmup`, `adafactor`
|
801 |
+
|
802 |
+
Note, `adafactor` scheduler can only be used with the `adafactor` optimizer!
|
803 |
|
804 |
```python
|
805 |
--lr_scheduler="cosine" \
|
|
|
809 |
|
810 |
##### `--lr_scheduler_num_cycles`
|
811 |
|
812 |
+
Number of restarts for cosine scheduler with restarts. It isn't used by any other scheduler.
|
813 |
|
814 |
```py
|
815 |
--lr_scheduler_num_cycles=1 \
|
|
|
817 |
|
818 |
---
|
819 |
|
820 |
+
##### `--learning_rate` and `--unet_lr` and `--text_encoder_lr`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
821 |
|
822 |
+
The learning rate determines how much the weights of the network are updated in response to the estimated error each time the weights are updated. If the learning rate is too large, the weights may overshoot the optimal solution. If it’s too small, the weights may get stuck in a suboptimal solution.
|
823 |
|
824 |
+
For AdamW the optimal LR seems to be `0.0001` or `1e-4` if you want to impress your friends.
|
825 |
|
826 |
```py
|
827 |
+
--learning_rate=0.0001 \
|
828 |
--unet_lr=0.0001 \
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
829 |
--text_encoder_lr=0.0001 \
|
830 |
```
|
831 |
|
|
|
855 |
|
856 |
##### `--scale_weight_norms`
|
857 |
|
858 |
+
Max-norm regularization is a technique that constrains the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant. It prevents the weights from growing too large and helps improve the performance of stochastic gradient descent training of deep neural nets.
|
859 |
|
860 |
+
Dropout affects the network architecture without changing the weights, while Max-Norm Regularization directly modifies the weights of the network. Both techniques are used to prevent overfitting and improve the generalization of the model. You can learn more about both in this [research paper](https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf).
|
861 |
|
862 |
+
```py
|
863 |
+
--scale_weight_norms=1.0 \
|
864 |
+
```
|
865 |
+
|
866 |
+
---
|
867 |
+
|
868 |
+
##### `--max_grad_norm`
|
869 |
+
|
870 |
+
Also known as Gradient Clipping, if you notice that gradients are exploding during training (loss becomes NaN or very large), consider adjusting the `--max_grad_norm` parameter, it operates on the gradients during the backpropagation process, while `--scale_weight_norms` operates on the weights of the neural network. This allows them to complement each other and provide a more robust approach to stabilizing the learning process and improving model performance.
|
871 |
|
872 |
```py
|
873 |
+
--max_grad_norm=1.0 \
|
874 |
```
|
875 |
|
876 |
---
|
|
|
885 |
|
886 |
---
|
887 |
|
888 |
+
##### `--save_every_n_epochs` and `--save_last_n_epochs` or `--save_every_n_steps` and `--save_last_n_steps`
|
889 |
|
890 |
+
- `--save_every_n_steps` and `--save_every_n_epochs`: A LoRA file will be created at each n-th step or epoch specified here.
|
891 |
+
- `--save_last_n_steps` and `--save_last_n_epochs`: Discards every saved file except for the last `n` you specify here.
|
892 |
+
|
893 |
+
Learning will always end with what you specify in `--max_train_epochs` or `--max_train_steps`.
|
894 |
|
895 |
```py
|
896 |
+
--save_every_n_epochs=50 \
|
897 |
```
|
898 |
|
899 |
---
|
|
|
962 |
Specify the number of steps or epochs to train. If both `--max_train_steps` and `--max_train_epochs` are specified, the number of epochs takes precedence.
|
963 |
|
964 |
```py
|
965 |
+
--max_train_steps=400 \
|
966 |
```
|
967 |
|
968 |
---
|
|
|
985 |
|
986 |
---
|
987 |
|
988 |
+
##### `--multires_noise_iterations` and `--multires_noise_discount`
|
989 |
+
|
990 |
+
⚠️
|
991 |
+
|
992 |
+
```python
|
993 |
+
--multires_noise_iterations=10 \
|
994 |
+
--multires_noise_discount=0.1 \
|
995 |
+
```
|
996 |
+
|
997 |
+
---
|
998 |
+
|
999 |
##### `--sample_prompts` and `--sample_sampler` and `--sample_every_n_steps`
|
1000 |
|
1001 |
You have the option of generating images during training so you can check the progress, the argument let's you pick between different samplers, by default it is on `ddim`, so you better change it!
|
|
|
1018 |
|
1019 |
---
|
1020 |
|
1021 |
+
So, the whole thing would look something like this:
|
1022 |
+
|
1023 |
+
```python
|
1024 |
+
|
1025 |
+
```
|
1026 |
+
|
1027 |
</details>
|
1028 |
</div>
|
1029 |
|