xywwww commited on
Commit
a779c13
·
verified ·
1 Parent(s): dd8aed0

Upload 4 files

Browse files
Files changed (4) hide show
  1. docs/annotator.md +49 -0
  2. docs/faq.md +21 -0
  3. docs/low_vram.md +15 -0
  4. docs/train.md +276 -0
docs/annotator.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Automatic Annotations
2
+
3
+ We provide gradio examples to obtain annotations that are aligned to our pretrained production-ready models.
4
+
5
+ Just run
6
+
7
+ python gradio_annotator.py
8
+
9
+ Since everyone has different habit to organize their datasets, we do not hard code any scripts for batch processing. But "gradio_annotator.py" is written in a super readable way, and modifying it to annotate your images should be easy.
10
+
11
+ In the gradio UI of "gradio_annotator.py" we have the following interfaces:
12
+
13
+ ### Canny Edge
14
+
15
+ Be careful about "black edge and white background" or "white edge and black background".
16
+
17
+ ![p](../github_page/a1.png)
18
+
19
+ ### HED Edge
20
+
21
+ Be careful about "black edge and white background" or "white edge and black background".
22
+
23
+ ![p](../github_page/a2.png)
24
+
25
+ ### MLSD Edge
26
+
27
+ Be careful about "black edge and white background" or "white edge and black background".
28
+
29
+ ![p](../github_page/a3.png)
30
+
31
+ ### MIDAS Depth and Normal
32
+
33
+ Be careful about RGB or BGR in normal maps.
34
+
35
+ ![p](../github_page/a4.png)
36
+
37
+ ### Openpose
38
+
39
+ Be careful about RGB or BGR in pose maps.
40
+
41
+ For our production-ready model, the hand pose option is turned off.
42
+
43
+ ![p](../github_page/a5.png)
44
+
45
+ ### Uniformer Segmentation
46
+
47
+ Be careful about RGB or BGR in segmentation maps.
48
+
49
+ ![p](../github_page/a6.png)
docs/faq.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FAQs
2
+
3
+ **Q:** If the weight of a conv layer is zero, the gradient will also be zero, and the network will not learn anything. Why "zero convolution" works?
4
+
5
+ **A:** This is wrong. Let us consider a very simple
6
+
7
+ $$y=wx+b$$
8
+
9
+ and we have
10
+
11
+ $$\partial y/\partial w=x, \partial y/\partial x=w, \partial y/\partial b=1$$
12
+
13
+ and if $w=0$ and $x \neq 0$, then
14
+
15
+ $$\partial y/\partial w \neq 0, \partial y/\partial x=0, \partial y/\partial b\neq 0$$
16
+
17
+ which means as long as $x \neq 0$, one gradient descent iteration will make $w$ non-zero. Then
18
+
19
+ $$\partial y/\partial x\neq 0$$
20
+
21
+ so that the zero convolutions will progressively become a common conv layer with non-zero weights.
docs/low_vram.md ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Enable Low VRAM Mode
2
+
3
+ If you are using 8GB GPU card (or if you want larger batch size), please open "config.py", and then set
4
+
5
+ ```python
6
+ save_memory = True
7
+ ```
8
+
9
+ This feature is still being tested - not all graphics cards are guaranteed to succeed.
10
+
11
+ But it should be neat as I can diffuse at a batch size of 12 now.
12
+
13
+ (prompt "man")
14
+
15
+ ![p](../github_page/ram12.jpg)
docs/train.md ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Train a ControlNet to Control SD
2
+
3
+ You are here because you want to control SD in your own way, maybe you have an idea for your perfect research project, and you will annotate some data or have already annotated your own dataset automatically or manually. Herein, the control can be anything that can be converted to images, such as edges, keypoints, segments, etc.
4
+
5
+ Before moving on to your own dataset, we highly recommend to first try the toy dataset, Fill50K, as a sanity check. This will help you get a "feeling" for the training. You will know how long it will take for the model to converge and whether your device will be able to complete the training in an acceptable amount of time. And what it "feels" like when the model converges.
6
+
7
+ We hope that after you read this page, you will find that training a ControlNet is as easy as (or easier than) training a pix2pix.
8
+
9
+ ## Step 0 - Design your control
10
+
11
+ Let us take a look at a very simple task to control SD to fill color in circles.
12
+
13
+ ![p](../github_page/t1.png)
14
+
15
+ This is simple: we want to control SD to fill a circle with colors, and the prompt contains some description of our target.
16
+
17
+ Stable diffusion is trained on billions of images, and it already knows what is "cyan", what is "circle", what is "pink", and what is "background".
18
+
19
+ But it does not know the meaning of that "Control Image (Source Image)". Our target is to let it know.
20
+
21
+ ## Step 1 - Get a dataset
22
+
23
+ Just download the Fill50K dataset from [our huggingface page](https://huggingface.co/lllyasviel/ControlNet) (training/fill50k.zip, the file is only 200M!). Make sure that the data is decompressed as
24
+
25
+ ControlNet/training/fill50k/prompt.json
26
+ ControlNet/training/fill50k/source/X.png
27
+ ControlNet/training/fill50k/target/X.png
28
+
29
+ In the folder "fill50k/source", you will have 50k images of circle lines.
30
+
31
+ ![p](../github_page/t2.png)
32
+
33
+ In the folder "fill50k/target", you will have 50k images of filled circles.
34
+
35
+ ![p](../github_page/t3.png)
36
+
37
+ In the "fill50k/prompt.json", you will have their filenames and prompts. Each prompt is like "a balabala color circle in some other color background."
38
+
39
+ ![p](../github_page/t4.png)
40
+
41
+ ## Step 2 - Load the dataset
42
+
43
+ Then you need to write a simple script to read this dataset for pytorch. (In fact we have written it for you in "tutorial_dataset.py".)
44
+
45
+ ```python
46
+ import json
47
+ import cv2
48
+ import numpy as np
49
+
50
+ from torch.utils.data import Dataset
51
+
52
+
53
+ class MyDataset(Dataset):
54
+ def __init__(self):
55
+ self.data = []
56
+ with open('./training/fill50k/prompt.json', 'rt') as f:
57
+ for line in f:
58
+ self.data.append(json.loads(line))
59
+
60
+ def __len__(self):
61
+ return len(self.data)
62
+
63
+ def __getitem__(self, idx):
64
+ item = self.data[idx]
65
+
66
+ source_filename = item['source']
67
+ target_filename = item['target']
68
+ prompt = item['prompt']
69
+
70
+ source = cv2.imread('./training/fill50k/' + source_filename)
71
+ target = cv2.imread('./training/fill50k/' + target_filename)
72
+
73
+ # Do not forget that OpenCV read images in BGR order.
74
+ source = cv2.cvtColor(source, cv2.COLOR_BGR2RGB)
75
+ target = cv2.cvtColor(target, cv2.COLOR_BGR2RGB)
76
+
77
+ # Normalize source images to [0, 1].
78
+ source = source.astype(np.float32) / 255.0
79
+
80
+ # Normalize target images to [-1, 1].
81
+ target = (target.astype(np.float32) / 127.5) - 1.0
82
+
83
+ return dict(jpg=target, txt=prompt, hint=source)
84
+
85
+ ```
86
+
87
+ This will make your dataset into an array-like object in python. You can test this dataset simply by accessing the array, like this
88
+
89
+ ```python
90
+ from tutorial_dataset import MyDataset
91
+
92
+ dataset = MyDataset()
93
+ print(len(dataset))
94
+
95
+ item = dataset[1234]
96
+ jpg = item['jpg']
97
+ txt = item['txt']
98
+ hint = item['hint']
99
+ print(txt)
100
+ print(jpg.shape)
101
+ print(hint.shape)
102
+
103
+ ```
104
+
105
+ The outputs of this simple test on my machine are
106
+
107
+ 50000
108
+ burly wood circle with orange background
109
+ (512, 512, 3)
110
+ (512, 512, 3)
111
+
112
+ And this code is in "tutorial_dataset_test.py".
113
+
114
+ In this way, the dataset is an array-like object with 50000 items. Each item is a dict with three entry "jpg", "txt", and "hint". The "jpg" is the target image, the "hint" is the control image, and the "txt" is the prompt.
115
+
116
+ Do not ask us why we use these three names - this is related to the dark history of a library called LDM.
117
+
118
+ ## Step 3 - What SD model do you want to control?
119
+
120
+ Then you need to decide which Stable Diffusion Model you want to control. In this example, we will just use standard SD1.5. You can download it from the [official page of Stability](https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main). You want the file ["v1-5-pruned.ckpt"](https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main).
121
+
122
+ (Or ["v2-1_512-ema-pruned.ckpt"](https://huggingface.co/stabilityai/stable-diffusion-2-1-base/tree/main) if you are using SD2.)
123
+
124
+ Then you need to attach a control net to the SD model. The architecture is
125
+
126
+ ![img](../github_page/sd.png)
127
+
128
+ Note that all weights inside the ControlNet are also copied from SD so that no layer is trained from scratch, and you are still finetuning the entire model.
129
+
130
+ We provide a simple script for you to achieve this easily. If your SD filename is "./models/v1-5-pruned.ckpt" and you want the script to save the processed model (SD+ControlNet) at location "./models/control_sd15_ini.ckpt", you can just run:
131
+
132
+ python tool_add_control.py ./models/v1-5-pruned.ckpt ./models/control_sd15_ini.ckpt
133
+
134
+ Or if you are using SD2:
135
+
136
+ python tool_add_control_sd21.py ./models/v2-1_512-ema-pruned.ckpt ./models/control_sd21_ini.ckpt
137
+
138
+ You may also use other filenames as long as the command is "python tool_add_control.py input_path output_path".
139
+
140
+ This is the correct output from my machine:
141
+
142
+ ![img](../github_page/t5.png)
143
+
144
+ ## Step 4 - Train!
145
+
146
+ Happy! We finally come to the most exciting part: training!
147
+
148
+ The training code in "tutorial_train.py" is actually surprisingly simple:
149
+
150
+ ```python
151
+ import pytorch_lightning as pl
152
+ from torch.utils.data import DataLoader
153
+ from tutorial_dataset import MyDataset
154
+ from cldm.logger import ImageLogger
155
+ from cldm.model import create_model, load_state_dict
156
+
157
+
158
+ # Configs
159
+ resume_path = './models/control_sd15_ini.ckpt'
160
+ batch_size = 4
161
+ logger_freq = 300
162
+ learning_rate = 1e-5
163
+ sd_locked = True
164
+ only_mid_control = False
165
+
166
+
167
+ # First use cpu to load models. Pytorch Lightning will automatically move it to GPUs.
168
+ model = create_model('./models/cldm_v15.yaml').cpu()
169
+ model.load_state_dict(load_state_dict(resume_path, location='cpu'))
170
+ model.learning_rate = learning_rate
171
+ model.sd_locked = sd_locked
172
+ model.only_mid_control = only_mid_control
173
+
174
+
175
+ # Misc
176
+ dataset = MyDataset()
177
+ dataloader = DataLoader(dataset, num_workers=0, batch_size=batch_size, shuffle=True)
178
+ logger = ImageLogger(batch_frequency=logger_freq)
179
+ trainer = pl.Trainer(gpus=1, precision=32, callbacks=[logger])
180
+
181
+
182
+ # Train!
183
+ trainer.fit(model, dataloader)
184
+
185
+ ```
186
+ (or "tutorial_train_sd21.py" if you are using SD2)
187
+
188
+ Thanks to our organized dataset pytorch object and the power of pytorch_lightning, the entire code is just super short.
189
+
190
+ Now, you may take a look at [Pytorch Lightning Official DOC](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.trainer.trainer.Trainer.html#trainer) to find out how to enable many useful features like gradient accumulation, multiple GPU training, accelerated dataset loading, flexible checkpoint saving, etc. All these only need about one line of code. Great!
191
+
192
+ Note that if you find OOM, perhaps you need to enable [Low VRAM mode](low_vram.md), and perhaps you also need to use smaller batch size and gradient accumulation. Or you may also want to use some “advanced” tricks like sliced attention or xformers. For example:
193
+
194
+ ```python
195
+ # Configs
196
+ batch_size = 1
197
+
198
+ # Misc
199
+ trainer = pl.Trainer(gpus=1, precision=32, callbacks=[logger], accumulate_grad_batches=4) # But this will be 4x slower
200
+ ```
201
+
202
+ Note that training with 8 GB laptop GPU is challenging. We will need some GPU memory optimization at least as good as automatic1111’s UI. This may require expert modifications to the code.
203
+
204
+ ### Screenshots
205
+
206
+ The training is fast. After 4000 steps (batch size 4, learning rate 1e-5, about 50 minutes on PCIE 40G), the results on my machine (in an output folder "image_log") is
207
+
208
+ Control:
209
+
210
+ ![img](../github_page/t/ip.png)
211
+
212
+ Prompt:
213
+
214
+ ![img](../github_page/t/t.png)
215
+
216
+ Prediction:
217
+
218
+ ![img](../github_page/t/op.png)
219
+
220
+ Ground Truth:
221
+
222
+ ![img](../github_page/t/gt.png)
223
+
224
+ Note that the SD's capability is preserved. Even training on this super aligned dataset, it still draws some random textures and those snow decorations. (Besides, note that the ground truth looks a bit modified because it is converted from SD's latent image.)
225
+
226
+ Larger batch size and longer training will further improve this. Adequate training will make the filling perfect.
227
+
228
+ Of course, training SD to fill circles is meaningless, but this is a successful beginning of your story.
229
+
230
+ Let us work together to control large models more and more.
231
+
232
+ ## Other options
233
+
234
+ Beyond standard things, we also provide two important parameters "sd_locked" and "only_mid_control" that you need to know.
235
+
236
+ ### only_mid_control
237
+
238
+ By default, only_mid_control is False. When it is True, you will train the below architecture.
239
+
240
+ ![img](../github_page/t6.png)
241
+
242
+ This can be helpful when your computation power is limited and want to speed up the training, or when you want to facilitate the "global" context learning. Note that sometimes you may pause training, set it to True, resume training, and pause again, and set it again, and resume again.
243
+
244
+ If your computation device is good, perhaps you do not need this. But I also know some artists are willing to train a model on their laptop for a month - in that case, perhaps this option can be useful.
245
+
246
+ ### sd_locked
247
+
248
+ By default, sd_locked is True. When it is False, you will train the below architecture.
249
+
250
+ ![img](../github_page/t7.png)
251
+
252
+ This will unlock some layers in SD and you will train them as a whole.
253
+
254
+ This option is DANGEROUS! If your dataset is not good enough, this may downgrade the capability of your SD model.
255
+
256
+ However, this option is also very useful when you are training on images with some specific style, or when you are training with special datasets (like medical dataset with X-ray images or geographic datasets with lots of Google Maps). You can understand this as simultaneously training the ControlNet and something like a DreamBooth.
257
+
258
+ Also, if your dataset is large, you may want to end the training with a few thousands of steps with those layer unlocked. This usually improve the "problem-specific" solutions a little. You may try it yourself to feel the difference.
259
+
260
+ Also, if you unlock some original layers, you may want a lower learning rate, like 2e-6.
261
+
262
+ ## More Consideration: Sudden Converge Phenomenon and Gradient Accumulation
263
+
264
+ ![img](../github_page/ex1.jpg)
265
+
266
+ Because we use zero convolutions, the SD should always be able to predict meaningful images. (If it cannot, the training has already failed.)
267
+
268
+ You will always find that at some iterations, the model "suddenly" be able to fit some training conditions. This means that you will get a basically usable model at about 3k to 7k steps (future training will improve it, but that model after the first "sudden converge" should be basically functional).
269
+
270
+ Note that 3k to 7k steps is not very large, and you should consider larger batch size rather than more training steps. If you can observe the "sudden converge" at 3k step using batch size 4, then, rather than train it with 300k further steps, a better idea is to use 100× gradient accumulation to re-train that 3k steps with 100× batch size. Note that perhaps we should not do this *too* extremely (perhaps 100x accumulation is too extreme), but you should consider that, since "sudden converge" will *always* happen at that certain point, getting a better converge is more important.
271
+
272
+ Because that "sudden converge" always happens, lets say "sudden converge" will happen at 3k step and our money can optimize 90k step, then we have two options: (1) train 3k steps, sudden converge, then train 87k steps. (2) 30x gradient accumulation, train 3k steps (90k real computation steps), then sudden converge.
273
+
274
+ In my experiments, (2) is usually better than (1). However, in real cases, perhaps you may need to balance the steps before and after the "sudden converge" on your own to find a balance. The training after "sudden converge" is also important.
275
+
276
+ But usually, if your logic batch size is already bigger than 256, then further extending the batch size is not very meaningful. In that case, perhaps a better idea is to train more steps. I tried some "common" logic batch size at 64 or 96 or 128 (by gradient accumulation), it seems that many complicated conditions can be solved very well already.