k4d3 commited on
Commit
f5ac613
1 Parent(s): bff67d4

Signed-off-by: Balazs Horvath <[email protected]>

Files changed (1) hide show
  1. README.md +63 -35
README.md CHANGED
@@ -63,14 +63,13 @@ The Yiff Toolkit is a comprehensive set of tools designed to enhance your creati
63
  - [`--network_dropout`](#--network_dropout)
64
  - [`--lr_scheduler`](#--lr_scheduler)
65
  - [`--lr_scheduler_num_cycles`](#--lr_scheduler_num_cycles)
66
- - [`--learning_rate`](#--learning_rate)
67
- - [`--unet_lr`](#--unet_lr)
68
- - [`--text_encoder_lr`](#--text_encoder_lr)
69
  - [`--network_dim`](#--network_dim)
70
  - [`--output_name`](#--output_name)
71
  - [`--scale_weight_norms`](#--scale_weight_norms)
 
72
  - [`--no_half_vae`](#--no_half_vae)
73
- - [`--save_every_n_epochs`](#--save_every_n_epochs)
74
  - [`--mixed_precision`](#--mixed_precision)
75
  - [`--save_precision`](#--save_precision)
76
  - [`--caption_extension`](#--caption_extension)
@@ -80,6 +79,7 @@ The Yiff Toolkit is a comprehensive set of tools designed to enhance your creati
80
  - [`--max_train_steps`](#--max_train_steps)
81
  - [`--shuffle_caption`](#--shuffle_caption)
82
  - [`--sdpa` or `--xformers` or `--mem_eff_attn`](#--sdpa-or---xformers-or---mem_eff_attn)
 
83
  - [`--sample_prompts` and `--sample_sampler` and `--sample_every_n_steps`](#--sample_prompts-and---sample_sampler-and---sample_every_n_steps)
84
  - [CosXL Training](#cosxl-training)
85
  - [Embeddings for 1.5 and SDXL](#embeddings-for-15-and-sdxl)
@@ -485,6 +485,8 @@ If you are training with multiple GPUs, ensure that the total number of prompts
485
  <details>
486
  <summary>Click to reveal training commands.</summary>
487
 
 
 
488
  ##### `accelerate launch`
489
 
490
  For two GPUs:
@@ -499,6 +501,8 @@ Single GPU:
499
  accelerate launch --num_processes=1 --num_machines=1 --gpu_ids=0 --num_cpu_threads_per_process=2 "./sdxl_train_network.py"
500
  ```
501
 
 
 
502
  ##### `--lowram`
503
 
504
  If you are running running out of system memory like I do with 2 GPUs and a really fat model that gets loaded into it per GPU, this option will help you save a bit of it and might get you out of OOM hell.
@@ -660,6 +664,8 @@ conv_block_alphas = [conv_alpha] * num_total_blocks
660
 
661
  ###### `module_dropout` and `dropout` and `rank_dropout`
662
 
 
 
663
  `rank_dropout` is a form of dropout, which is a regularization technique used in neural networks to prevent overfitting and improve generalization. However, unlike traditional dropout which randomly sets a proportion of inputs to zero, `rank_dropout` operates on the rank of the input tensor `lx`. First a binary mask is created with the same rank as `lx` with each element set to `True` with probability `1 - rank_dropout` and `False` otherwise. Then the `mask` is applied to `lx` to randomly set some of its elements to zero. After applying the dropout, a scaling factor is applied to `lx` to compensate for the dropped out elements. This is done to ensure that the expected sum of `lx` remains the same before and after dropout. The scaling factor is `1.0 / (1.0 - self.rank_dropout)`.
664
 
665
  It’s called “rank” dropout because it operates on the rank of the input tensor, rather than its individual elements. This can be particularly useful in tasks where the rank of the input is important.
@@ -699,6 +705,8 @@ def forward(self, x):
699
  return org_forwarded + lx * self.multiplier * scale
700
  ```
701
 
 
 
702
  ---
703
 
704
  ###### `use_tucker`
@@ -771,9 +779,13 @@ Specifies the alpha of each block, this too also takes 25 numbers if you don't s
771
 
772
  ---
773
 
 
 
 
 
774
  ##### `--network_dropout`
775
 
776
- Using `weight_decompose=True` will ignore `network_dropout` and only rank and module dropout will be applied.
777
 
778
  ```python
779
  --network_dropout=0 \
@@ -783,7 +795,11 @@ Using `weight_decompose=True` will ignore `network_dropout` and only rank and mo
783
 
784
  ##### `--lr_scheduler`
785
 
786
- ⚠️
 
 
 
 
787
 
788
  ```python
789
  --lr_scheduler="cosine" \
@@ -793,7 +809,7 @@ Using `weight_decompose=True` will ignore `network_dropout` and only rank and mo
793
 
794
  ##### `--lr_scheduler_num_cycles`
795
 
796
- ⚠️
797
 
798
  ```py
799
  --lr_scheduler_num_cycles=1 \
@@ -801,31 +817,15 @@ Using `weight_decompose=True` will ignore `network_dropout` and only rank and mo
801
 
802
  ---
803
 
804
- ##### `--learning_rate`
805
-
806
- ⚠️
807
-
808
- ```py
809
- --learning_rate=0.0001 \
810
- ```
811
-
812
- ---
813
 
814
- ##### `--unet_lr`
815
 
816
- ⚠️
817
 
818
  ```py
 
819
  --unet_lr=0.0001 \
820
- ```
821
-
822
- ---
823
-
824
- ##### `--text_encoder_lr`
825
-
826
- ⚠️
827
-
828
- ```py
829
  --text_encoder_lr=0.0001 \
830
  ```
831
 
@@ -855,14 +855,22 @@ Specify the output name excluding the file extension.
855
 
856
  ##### `--scale_weight_norms`
857
 
858
- [![An AI generated image.](https://huggingface.co/k4d3/yiff_toolkit/resolve/main/static/tutorial/dropout1.png)](https://huggingface.co/k4d3/yiff_toolkit/resolve/main/static/tutorial/dropout1.png)
859
 
860
- Encourages the LoRA to diversify it's training by randomly removing some weights to prevent overfitting, in the real world this is called Max-norm regularization.
861
 
862
- The network you are training needs to support it though! See [PR#545](https://github.com/kohya-ss/sd-scripts/pull/545) for more details.
 
 
 
 
 
 
 
 
863
 
864
  ```py
865
- --scale_weight_norms=1 \
866
  ```
867
 
868
  ---
@@ -877,12 +885,15 @@ Disables mixed precision for the SDXL VAE and sets it to `float32`. Very useful
877
 
878
  ---
879
 
880
- ##### `--save_every_n_epochs`
881
 
882
- ⚠️
 
 
 
883
 
884
  ```py
885
- --save_every_n_epochs=10 \
886
  ```
887
 
888
  ---
@@ -951,7 +962,7 @@ Repeats the dataset when training with captions, by default it is set to `1` so
951
  Specify the number of steps or epochs to train. If both `--max_train_steps` and `--max_train_epochs` are specified, the number of epochs takes precedence.
952
 
953
  ```py
954
- --max_train_steps=500 \
955
  ```
956
 
957
  ---
@@ -974,6 +985,17 @@ The choice between `--xformers` or `--mem_eff_attn` and `--spda` will depend on
974
 
975
  ---
976
 
 
 
 
 
 
 
 
 
 
 
 
977
  ##### `--sample_prompts` and `--sample_sampler` and `--sample_every_n_steps`
978
 
979
  You have the option of generating images during training so you can check the progress, the argument let's you pick between different samplers, by default it is on `ddim`, so you better change it!
@@ -996,6 +1018,12 @@ ddim, pndm, lms, euler, euler_a, heun, dpm_2, dpm_2_a, dpmsolver, dpmsolver++, d
996
 
997
  ---
998
 
 
 
 
 
 
 
999
  </details>
1000
  </div>
1001
 
 
63
  - [`--network_dropout`](#--network_dropout)
64
  - [`--lr_scheduler`](#--lr_scheduler)
65
  - [`--lr_scheduler_num_cycles`](#--lr_scheduler_num_cycles)
66
+ - [`--learning_rate` and `--unet_lr` and `--text_encoder_lr`](#--learning_rate-and---unet_lr-and---text_encoder_lr)
 
 
67
  - [`--network_dim`](#--network_dim)
68
  - [`--output_name`](#--output_name)
69
  - [`--scale_weight_norms`](#--scale_weight_norms)
70
+ - [`--max_grad_norm`](#--max_grad_norm)
71
  - [`--no_half_vae`](#--no_half_vae)
72
+ - [`--save_every_n_epochs` and `--save_last_n_epochs` or `--save_every_n_steps` and `--save_last_n_steps`](#--save_every_n_epochs-and---save_last_n_epochs-or---save_every_n_steps-and---save_last_n_steps)
73
  - [`--mixed_precision`](#--mixed_precision)
74
  - [`--save_precision`](#--save_precision)
75
  - [`--caption_extension`](#--caption_extension)
 
79
  - [`--max_train_steps`](#--max_train_steps)
80
  - [`--shuffle_caption`](#--shuffle_caption)
81
  - [`--sdpa` or `--xformers` or `--mem_eff_attn`](#--sdpa-or---xformers-or---mem_eff_attn)
82
+ - [`--multires_noise_iterations` and `--multires_noise_discount`](#--multires_noise_iterations-and---multires_noise_discount)
83
  - [`--sample_prompts` and `--sample_sampler` and `--sample_every_n_steps`](#--sample_prompts-and---sample_sampler-and---sample_every_n_steps)
84
  - [CosXL Training](#cosxl-training)
85
  - [Embeddings for 1.5 and SDXL](#embeddings-for-15-and-sdxl)
 
485
  <details>
486
  <summary>Click to reveal training commands.</summary>
487
 
488
+ ---
489
+
490
  ##### `accelerate launch`
491
 
492
  For two GPUs:
 
501
  accelerate launch --num_processes=1 --num_machines=1 --gpu_ids=0 --num_cpu_threads_per_process=2 "./sdxl_train_network.py"
502
  ```
503
 
504
+ ---
505
+
506
  ##### `--lowram`
507
 
508
  If you are running running out of system memory like I do with 2 GPUs and a really fat model that gets loaded into it per GPU, this option will help you save a bit of it and might get you out of OOM hell.
 
664
 
665
  ###### `module_dropout` and `dropout` and `rank_dropout`
666
 
667
+ [![An AI generated image.](https://huggingface.co/k4d3/yiff_toolkit/resolve/main/static/tutorial/dropout1.png)](https://huggingface.co/k4d3/yiff_toolkit/resolve/main/static/tutorial/dropout1.png)
668
+
669
  `rank_dropout` is a form of dropout, which is a regularization technique used in neural networks to prevent overfitting and improve generalization. However, unlike traditional dropout which randomly sets a proportion of inputs to zero, `rank_dropout` operates on the rank of the input tensor `lx`. First a binary mask is created with the same rank as `lx` with each element set to `True` with probability `1 - rank_dropout` and `False` otherwise. Then the `mask` is applied to `lx` to randomly set some of its elements to zero. After applying the dropout, a scaling factor is applied to `lx` to compensate for the dropped out elements. This is done to ensure that the expected sum of `lx` remains the same before and after dropout. The scaling factor is `1.0 / (1.0 - self.rank_dropout)`.
670
 
671
  It’s called “rank” dropout because it operates on the rank of the input tensor, rather than its individual elements. This can be particularly useful in tasks where the rank of the input is important.
 
705
  return org_forwarded + lx * self.multiplier * scale
706
  ```
707
 
708
+ The network you are training needs to support it though! See [PR#545](https://github.com/kohya-ss/sd-scripts/pull/545) for more details.
709
+
710
  ---
711
 
712
  ###### `use_tucker`
 
779
 
780
  ---
781
 
782
+ That concludes the `network_args`.
783
+
784
+ ---
785
+
786
  ##### `--network_dropout`
787
 
788
+ This float controls the drop of neurons out of training every step, `0` or `None` is default behavior (no dropout), 1 would drop all neurons. Using `weight_decompose=True` will ignore `network_dropout` and only rank and module dropout will be applied.
789
 
790
  ```python
791
  --network_dropout=0 \
 
795
 
796
  ##### `--lr_scheduler`
797
 
798
+ A learning rate scheduler in PyTorch is a tool that adjusts the learning rate during the training process. It’s used to modulate the learning rate in response to how the model is performing, which can lead to increased performance and reduced training time.
799
+
800
+ Possible values: `linear`, `cosine`, `cosine_with_restarts`, `polynomial`, `constant` (default), `constant_with_warmup`, `adafactor`
801
+
802
+ Note, `adafactor` scheduler can only be used with the `adafactor` optimizer!
803
 
804
  ```python
805
  --lr_scheduler="cosine" \
 
809
 
810
  ##### `--lr_scheduler_num_cycles`
811
 
812
+ Number of restarts for cosine scheduler with restarts. It isn't used by any other scheduler.
813
 
814
  ```py
815
  --lr_scheduler_num_cycles=1 \
 
817
 
818
  ---
819
 
820
+ ##### `--learning_rate` and `--unet_lr` and `--text_encoder_lr`
 
 
 
 
 
 
 
 
821
 
822
+ The learning rate determines how much the weights of the network are updated in response to the estimated error each time the weights are updated. If the learning rate is too large, the weights may overshoot the optimal solution. If it’s too small, the weights may get stuck in a suboptimal solution.
823
 
824
+ For AdamW the optimal LR seems to be `0.0001` or `1e-4` if you want to impress your friends.
825
 
826
  ```py
827
+ --learning_rate=0.0001 \
828
  --unet_lr=0.0001 \
 
 
 
 
 
 
 
 
 
829
  --text_encoder_lr=0.0001 \
830
  ```
831
 
 
855
 
856
  ##### `--scale_weight_norms`
857
 
858
+ Max-norm regularization is a technique that constrains the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant. It prevents the weights from growing too large and helps improve the performance of stochastic gradient descent training of deep neural nets.
859
 
860
+ Dropout affects the network architecture without changing the weights, while Max-Norm Regularization directly modifies the weights of the network. Both techniques are used to prevent overfitting and improve the generalization of the model. You can learn more about both in this [research paper](https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf).
861
 
862
+ ```py
863
+ --scale_weight_norms=1.0 \
864
+ ```
865
+
866
+ ---
867
+
868
+ ##### `--max_grad_norm`
869
+
870
+ Also known as Gradient Clipping, if you notice that gradients are exploding during training (loss becomes NaN or very large), consider adjusting the `--max_grad_norm` parameter, it operates on the gradients during the backpropagation process, while `--scale_weight_norms` operates on the weights of the neural network. This allows them to complement each other and provide a more robust approach to stabilizing the learning process and improving model performance.
871
 
872
  ```py
873
+ --max_grad_norm=1.0 \
874
  ```
875
 
876
  ---
 
885
 
886
  ---
887
 
888
+ ##### `--save_every_n_epochs` and `--save_last_n_epochs` or `--save_every_n_steps` and `--save_last_n_steps`
889
 
890
+ - `--save_every_n_steps` and `--save_every_n_epochs`: A LoRA file will be created at each n-th step or epoch specified here.
891
+ - `--save_last_n_steps` and `--save_last_n_epochs`: Discards every saved file except for the last `n` you specify here.
892
+
893
+ Learning will always end with what you specify in `--max_train_epochs` or `--max_train_steps`.
894
 
895
  ```py
896
+ --save_every_n_epochs=50 \
897
  ```
898
 
899
  ---
 
962
  Specify the number of steps or epochs to train. If both `--max_train_steps` and `--max_train_epochs` are specified, the number of epochs takes precedence.
963
 
964
  ```py
965
+ --max_train_steps=400 \
966
  ```
967
 
968
  ---
 
985
 
986
  ---
987
 
988
+ ##### `--multires_noise_iterations` and `--multires_noise_discount`
989
+
990
+ ⚠️
991
+
992
+ ```python
993
+ --multires_noise_iterations=10 \
994
+ --multires_noise_discount=0.1 \
995
+ ```
996
+
997
+ ---
998
+
999
  ##### `--sample_prompts` and `--sample_sampler` and `--sample_every_n_steps`
1000
 
1001
  You have the option of generating images during training so you can check the progress, the argument let's you pick between different samplers, by default it is on `ddim`, so you better change it!
 
1018
 
1019
  ---
1020
 
1021
+ So, the whole thing would look something like this:
1022
+
1023
+ ```python
1024
+
1025
+ ```
1026
+
1027
  </details>
1028
  </div>
1029