BestWishYsh commited on
Commit
d6a2e05
Β·
verified Β·
1 Parent(s): 60490d3

Update Diffusers API

Browse files
Files changed (1) hide show
  1. README.md +126 -3
README.md CHANGED
@@ -14,17 +14,140 @@ short_description: Identity-Preserving Text-to-Video Generation
14
  disable_embedding: false
15
  ---
16
 
 
 
 
 
17
  <h1 align="center"> <a href="https://pku-yuangroup.github.io/ConsisID">Identity-Preserving Text-to-Video Generation by Frequency Decomposition</a></h1>
18
 
 
 
 
 
 
 
 
 
19
  <h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for the latest update. </h2>
20
 
 
21
  ## 😍 Gallery
22
 
23
- Identity-Preserving Text-to-Video Generation.
24
  [![Demo Video of ConsisID](https://github.com/user-attachments/assets/634248f6-1b54-4963-88d6-34fa7263750b)](https://www.youtube.com/watch?v=PhlgC-bI5SQ)
25
  or you can click <a href="https://github.com/SHYuanBest/shyuanbest_media/raw/refs/heads/main/ConsisID/showcase_videos.mp4">here</a> to watch the video.
26
 
27
- ## Space Description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  - **Repository:** [Code](https://github.com/PKU-YuanGroup/ConsisID), [Page](https://pku-yuangroup.github.io/ConsisID/), [Data](https://huggingface.co/datasets/BestWishYsh/ConsisID-preview-Data)
29
  - **Paper:** arxiv.org/abs/2411.17440
30
- - **Point of Contact:** [Shenghai Yuan]([email protected])
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  disable_embedding: false
15
  ---
16
 
17
+ <div align=center>
18
+ <img src="https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/ConsisID_logo.png?raw=true" width="150px">
19
+ </div>
20
+
21
  <h1 align="center"> <a href="https://pku-yuangroup.github.io/ConsisID">Identity-Preserving Text-to-Video Generation by Frequency Decomposition</a></h1>
22
 
23
+ <p style="text-align: center;">
24
+ <a href="https://huggingface.co/spaces/BestWishYsh/ConsisID-preview-Space">πŸ€— Huggingface Space</a> |
25
+ <a href="https://pku-yuangroup.github.io/ConsisID">πŸ“„ Page </a> |
26
+ <a href="https://github.com/PKU-YuanGroup/ConsisID">🌐 Github </a> |
27
+ <a href="https://arxiv.org/abs/2411.17440">πŸ“œ arxiv </a> |
28
+ <a href="https://huggingface.co/datasets/BestWishYsh/ConsisID-preview-Data">🐳 Dataset</a>
29
+ </p>
30
+ <p align="center">
31
  <h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for the latest update. </h2>
32
 
33
+
34
  ## 😍 Gallery
35
 
36
+ Identity-Preserving Text-to-Video Generation. (Some best prompts [here](https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/prompt.xlsx))
37
  [![Demo Video of ConsisID](https://github.com/user-attachments/assets/634248f6-1b54-4963-88d6-34fa7263750b)](https://www.youtube.com/watch?v=PhlgC-bI5SQ)
38
  or you can click <a href="https://github.com/SHYuanBest/shyuanbest_media/raw/refs/heads/main/ConsisID/showcase_videos.mp4">here</a> to watch the video.
39
 
40
+ ## πŸ€— Quick Start
41
+
42
+ This model supports deployment using the huggingface diffusers library. You can deploy it by following these steps.
43
+
44
+ **We recommend that you visit our [GitHub](https://github.com/PKU-YuanGroup/ConsisID) and check out the relevant prompt
45
+ optimizations and conversions to get a better experience.**
46
+
47
+ 1. Install the required dependencies
48
+
49
+ ```shell
50
+ # ConsisID will be merged into diffusers in the next version. So for now, you should install from source.
51
+ pip install --upgrade consisid_eva_clip pyfacer insightface facexlib transformers accelerate imageio-ffmpeg
52
+ pip install git+https://github.com/SHYuanBest/ConsisID_diffusers.git
53
+ ```
54
+
55
+ 2. Run the code
56
+
57
+ ```python
58
+ import torch
59
+ from diffusers import ConsisIDPipeline
60
+ from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
61
+ from diffusers.utils import export_to_video
62
+ from huggingface_hub import snapshot_download
63
+
64
+ snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
65
+ face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = (
66
+ prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
67
+ )
68
+ pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
69
+ pipe.to("cuda")
70
+
71
+ # ConsisID works well with long and well-described prompts. Make sure the face in the image is clearly visible (e.g., preferably half-body or full-body).
72
+ prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
73
+ image = "https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/example_images/2.png?raw=true"
74
+
75
+ id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(
76
+ face_helper_1,
77
+ face_clip_model,
78
+ face_helper_2,
79
+ eva_transform_mean,
80
+ eva_transform_std,
81
+ face_main_model,
82
+ "cuda",
83
+ torch.bfloat16,
84
+ image,
85
+ is_align_face=True,
86
+ )
87
+
88
+ video = pipe(
89
+ image=image,
90
+ prompt=prompt,
91
+ num_inference_steps=50,
92
+ guidance_scale=6.0,
93
+ use_dynamic_cfg=False,
94
+ id_vit_hidden=id_vit_hidden,
95
+ id_cond=id_cond,
96
+ kps_cond=face_kps,
97
+ generator=torch.Generator("cuda").manual_seed(42),
98
+ )
99
+ export_to_video(video.frames[0], "output.mp4", fps=8)
100
+ ```
101
+
102
+ ## πŸ› οΈ Prompt Refiner
103
+
104
+ ConsisID has high requirements for prompt quality. You can use [GPT-4o](https://chatgpt.com/) to refine the input text prompt, an example is as follows (original prompt: "a man is playing guitar.")
105
+ ```bash
106
+ a man is playing guitar.
107
+
108
+ Change the sentence above to something like this (add some facial changes, even if they are minor. Don't make the sentence too long):
109
+
110
+ The video features a man standing next to an airplane, engaged in a conversation on his cell phone. he is wearing sunglasses and a black top, and he appears to be talking seriously. The airplane has a green stripe running along its side, and there is a large engine visible behind his. The man seems to be standing near the entrance of the airplane, possibly preparing to board or just having disembarked. The setting suggests that he might be at an airport or a private airfield. The overall atmosphere of the video is professional and focused, with the man's attire and the presence of the airplane indicating a business or travel context.
111
+ ```
112
+
113
+ Some sample prompts are available [here](https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/prompt.xlsx).
114
+
115
+ ### πŸ’‘ GPU Memory Optimization
116
+
117
+ ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/SHYuanBest/bc4207c36f454f9e969adbb50eaf8258) script.
118
+
119
+ | Feature (overlay the previous) | Max Memory Allocated | Max Memory Reserved |
120
+ | :----------------------------- | :------------------- | :------------------ |
121
+ | - | 37 GB | 44 GB |
122
+ | enable_model_cpu_offload | 22 GB | 25 GB |
123
+ | enable_sequential_cpu_offload | 16 GB | 22 GB |
124
+ | vae.enable_slicing | 16 GB | 22 GB |
125
+ | vae.enable_tiling | 5 GB | 7 GB |
126
+
127
+ ```bash
128
+ # turn on if you don't have multiple GPUs or enough GPU memory(such as H100)
129
+ pipe.enable_model_cpu_offload()
130
+ pipe.enable_sequential_cpu_offload()
131
+ pipe.vae.enable_slicing()
132
+ pipe.vae.enable_tiling()
133
+ ```
134
+
135
+ warning: it will cost more time in inference and may also reduce the quality.
136
+
137
+ ## πŸ™Œ Description
138
+
139
  - **Repository:** [Code](https://github.com/PKU-YuanGroup/ConsisID), [Page](https://pku-yuangroup.github.io/ConsisID/), [Data](https://huggingface.co/datasets/BestWishYsh/ConsisID-preview-Data)
140
  - **Paper:** arxiv.org/abs/2411.17440
141
+ - **Point of Contact:** [Shenghai Yuan]([email protected])
142
+
143
+ ## ✏️ Citation
144
+ If you find our paper and code useful in your research, please consider giving a star and citation.
145
+
146
+ ```BibTeX
147
+ @article{yuan2024identity,
148
+ title={Identity-Preserving Text-to-Video Generation by Frequency Decomposition},
149
+ author={Yuan, Shenghai and Huang, Jinfa and He, Xianyi and Ge, Yunyuan and Shi, Yujun and Chen, Liuhan and Luo, Jiebo and Yuan, Li},
150
+ journal={arXiv preprint arXiv:2411.17440},
151
+ year={2024}
152
+ }
153
+ ```