Classifier-free guidance resolution weighting
#51
by
snatchysquid
- opened
In section 3.4, the ControlNet paper talks about CFG-RW, quoting:
In challenging cases, e.g., when no prompts are given, adding it to both ϵuc and ϵc will completely remove CFG guidance (Figure 5b);
using only ϵc will make the guidance very strong (Figure 5c).
Our solution is to first add the conditioning image to ϵ_c and then multiply a weight wi to each connection between Stable Diffusion and ControlNet according to the resolution of each block wi = 64/hi, where hi is the size of i th block, e.g., h1 = 8, h2 = 16, ..., h13 = 64
I don't quite understand what this means and where is this implemented in the code in the github repository. My questions, therefore, are as follows:
- Do I understand correctly that what we do is train the model without any weighting, and then for the ϵ_uc we use unconditional SD without ControlNet, and for ϵ_c we use ControlNet but before adding the skip connection, we multiply the output by wi (meaning SD_Layer_i_final_output = SD_Layer_i_output + (w_i*ControlNet_Layer_i_output) )?
- If so, what is the logic and motivation for doing that? It doesn't sound trivial that this would be want we want to do.
- Finally, I'd like to know where is this implemented in the code?