k4d3 commited on
Commit
1024384
1 Parent(s): 0c4f2f7

Signed-off-by: Balazs Horvath <[email protected]>

Files changed (1) hide show
  1. README.md +30 -1
README.md CHANGED
@@ -79,6 +79,7 @@ The Yiff Toolkit is a comprehensive set of tools designed to enhance your creati
79
  - [`--shuffle_caption`](#--shuffle_caption)
80
  - [`--sdpa` or `--xformers` or `--mem_eff_attn`](#--sdpa-or---xformers-or---mem_eff_attn)
81
  - [`--multires_noise_iterations` and `--multires_noise_discount`](#--multires_noise_iterations-and---multires_noise_discount)
 
82
  - [`--sample_prompts` and `--sample_sampler` and `--sample_every_n_steps`](#--sample_prompts-and---sample_sampler-and---sample_every_n_steps)
83
  - [Embeddings for 1.5 and SDXL](#embeddings-for-15-and-sdxl)
84
  - [ComfyUI Walkthrough any%](#comfyui-walkthrough-any)
@@ -745,28 +746,44 @@ If `use_tucker` is `False` or not set, or if the kernel size k_size is `(1, 1)`,
745
  An additional learned parameter that scales the contribution of the low-rank weights before they are added to the original weights. This scalar can control the extent to which the low-rank adaptation modifies the original weights. By training this scalar, the model can learn the optimal balance between preserving the original pre-trained weights and allowing for low-rank adaptation.
746
 
747
  ```python
 
748
  if use_scalar:
 
 
749
  self.scalar = nn.Parameter(torch.tensor(0.0))
750
  else:
 
 
751
  self.scalar = torch.tensor(1.0)
752
  ```
753
 
 
 
754
  ---
755
 
756
  ###### `rank_dropout_scale`
757
 
758
- A boolean flag that determines whether to scale the dropout mask to have an average value of `1` or not. This can be useful in certain situations to maintain the scale of the tensor after dropout is applied.
759
 
760
  ```python
761
  def forward(self, orig_weight, org_bias, new_weight, new_bias, *args, **kwargs):
 
762
  device = self.oft_blocks.device
 
 
763
  if self.rank_dropout and self.training:
 
 
764
  drop = (torch.rand(self.oft_blocks, device=device) < self.rank_dropout).to(
765
  self.oft_blocks.dtype
766
  )
 
 
 
767
  if self.rank_dropout_scale:
768
  drop /= drop.mean()
769
  else:
 
770
  drop = 1
771
  ```
772
 
@@ -999,6 +1016,10 @@ Each of these options modifies the attention mechanism used in the model, which
999
  - `--mem_eff_attn`: This flag enables the use of memory-efficient attention mechanisms in the model. The memory-efficient attention is designed to reduce the memory footprint during the training of transformer models, which can be particularly beneficial when working with large models or datasets.
1000
  - `--sdpa`: This option enables the use of Scaled Dot-Product Attention (SDPA) within the model. SDPA is a fundamental component of transformer models that calculates the attention scores between queries and keys. It scales the dot products by the dimensionality of the keys to stabilize gradients during training. This mechanism is particularly useful for handling long sequences and can potentially improve the model’s ability to capture long-range dependencies.
1001
 
 
 
 
 
1002
  ---
1003
 
1004
  ##### `--multires_noise_iterations` and `--multires_noise_discount`
@@ -1013,6 +1034,14 @@ The `--multires_noise_discount` parameter controls the extent to which the noise
1013
 
1014
  Please note that `--multires_noise_discount` has no effect without `--multires_noise_iterations`.
1015
 
 
 
 
 
 
 
 
 
1016
  ```python
1017
  --multires_noise_iterations=10 --multires_noise_discount=0.1
1018
  ```
 
79
  - [`--shuffle_caption`](#--shuffle_caption)
80
  - [`--sdpa` or `--xformers` or `--mem_eff_attn`](#--sdpa-or---xformers-or---mem_eff_attn)
81
  - [`--multires_noise_iterations` and `--multires_noise_discount`](#--multires_noise_iterations-and---multires_noise_discount)
82
+ - [Implementation Details](#implementation-details)
83
  - [`--sample_prompts` and `--sample_sampler` and `--sample_every_n_steps`](#--sample_prompts-and---sample_sampler-and---sample_every_n_steps)
84
  - [Embeddings for 1.5 and SDXL](#embeddings-for-15-and-sdxl)
85
  - [ComfyUI Walkthrough any%](#comfyui-walkthrough-any)
 
746
  An additional learned parameter that scales the contribution of the low-rank weights before they are added to the original weights. This scalar can control the extent to which the low-rank adaptation modifies the original weights. By training this scalar, the model can learn the optimal balance between preserving the original pre-trained weights and allowing for low-rank adaptation.
747
 
748
  ```python
749
+ # Check if the 'use_scalar' flag is set to True
750
  if use_scalar:
751
+ # If True, initialize a learnable parameter 'scalar' with a starting value of 0.0.
752
+ # This parameter will be optimized during the training process.
753
  self.scalar = nn.Parameter(torch.tensor(0.0))
754
  else:
755
+ # If the 'use_scalar' flag is False, set 'scalar' to a fixed value of 1.0.
756
+ # This means the low-rank weights will be added to the original weights without scaling.
757
  self.scalar = torch.tensor(1.0)
758
  ```
759
 
760
+ The `use_scalar` flag allows the model to determine how much influence the low-rank weights should have on the final weights. If `use_scalar` is `True`, the model can learn the optimal value for `self.scalar` during training, which multiplies the low-rank weights before they are added to the original weights. This provides a way to balance between the original pre-trained weights and the new low-rank adaptations, potentially leading to better performance and more efficient training. The initial value of `0.0` for `self.scalar` suggests that the model starts with no contribution from the low-rank weights and learns the appropriate scale during training.
761
+
762
  ---
763
 
764
  ###### `rank_dropout_scale`
765
 
766
+ A boolean flag that determines whether to scale the dropout mask to have an average value of `1` or not. This is particularly useful when you want to maintain the original scale of the tensor values after applying dropout, which can be important for the stability of the training process.
767
 
768
  ```python
769
  def forward(self, orig_weight, org_bias, new_weight, new_bias, *args, **kwargs):
770
+ # Retrieve the device that the 'oft_blocks' tensor is on. This ensures that any new tensors created are on the same device.
771
  device = self.oft_blocks.device
772
+
773
+ # Check if rank dropout is enabled and the model is in training mode.
774
  if self.rank_dropout and self.training:
775
+ # Create a random tensor the same shape as 'oft_blocks', with values drawn from a uniform distribution.
776
+ # Then create a dropout mask by checking if each value is less than 'self.rank_dropout' probability.
777
  drop = (torch.rand(self.oft_blocks, device=device) < self.rank_dropout).to(
778
  self.oft_blocks.dtype
779
  )
780
+
781
+ # If 'rank_dropout_scale' is True, scale the dropout mask to have an average value of 1.
782
+ # This helps maintain the scale of the tensor's values after dropout is applied.
783
  if self.rank_dropout_scale:
784
  drop /= drop.mean()
785
  else:
786
+ # If rank dropout is not enabled or the model is not in training mode, set 'drop' to 1 (no dropout).
787
  drop = 1
788
  ```
789
 
 
1016
  - `--mem_eff_attn`: This flag enables the use of memory-efficient attention mechanisms in the model. The memory-efficient attention is designed to reduce the memory footprint during the training of transformer models, which can be particularly beneficial when working with large models or datasets.
1017
  - `--sdpa`: This option enables the use of Scaled Dot-Product Attention (SDPA) within the model. SDPA is a fundamental component of transformer models that calculates the attention scores between queries and keys. It scales the dot products by the dimensionality of the keys to stabilize gradients during training. This mechanism is particularly useful for handling long sequences and can potentially improve the model’s ability to capture long-range dependencies.
1018
 
1019
+ ```python
1020
+ --sdpa
1021
+ ```
1022
+
1023
  ---
1024
 
1025
  ##### `--multires_noise_iterations` and `--multires_noise_discount`
 
1034
 
1035
  Please note that `--multires_noise_discount` has no effect without `--multires_noise_iterations`.
1036
 
1037
+ ###### Implementation Details
1038
+
1039
+ The `get_noise_noisy_latents_and_timesteps` function samples noise that will be added to the latents. If `args.noise_offset` is true, it applies a noise offset. If `args.multires_noise_iterations` is true, it applies multi-resolution noise to the sampled noise.
1040
+
1041
+ The function then samples a random timestep for each image and adds noise to the latents according to the noise magnitude at each timestep. This is the forward diffusion process.
1042
+
1043
+ The `pyramid_noise_like` function generates noise with a pyramid structure. It starts with the original noise and adds upscaled noise at decreasing resolutions. The noise at each level is scaled by a discount factor raised to the power of the level. The noise is then scaled back to roughly unit variance. This function is used to implement the multi-resolution noise.
1044
+
1045
  ```python
1046
  --multires_noise_iterations=10 --multires_noise_discount=0.1
1047
  ```