ckczzj
commited on
Commit
Β·
3a6a5fa
1
Parent(s):
b104b34
Update README.md
Browse files
README.md
CHANGED
@@ -21,7 +21,7 @@ This repo contains PyTorch model definitions, pre-trained weights and inference/
|
|
21 |
|
22 |
|
23 |
|
24 |
-
##
|
25 |
|
26 |
* Jan 13, 2025: π We release the [Penguin Video Benchmark](https://github.com/Tencent/HunyuanVideo/blob/main/assets/PenguinVideoBenchmark.csv).
|
27 |
* Dec 18, 2024: πββοΈ We release the [FP8 model weights](https://huggingface.co/tencent/HunyuanVideo/blob/main/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt) of HunyuanVideo to save more GPU memory.
|
@@ -31,7 +31,7 @@ This repo contains PyTorch model definitions, pre-trained weights and inference/
|
|
31 |
|
32 |
|
33 |
|
34 |
-
##
|
35 |
|
36 |
- HunyuanVideo (Text-to-Video Model)
|
37 |
- [x] Inference
|
@@ -52,34 +52,31 @@ This repo contains PyTorch model definitions, pre-trained weights and inference/
|
|
52 |
## Contents
|
53 |
|
54 |
- [HunyuanVideo: A Systematic Framework For Large Video Generation Model](#hunyuanvideo-a-systematic-framework-for-large-video-generation-model)
|
55 |
-
- [
|
56 |
-
- [
|
57 |
-
- [𧩠Community Contributions](#-community-contributions)
|
58 |
-
- [π Open-source Plan](#-open-source-plan)
|
59 |
- [Contents](#contents)
|
60 |
- [**Abstract**](#abstract)
|
61 |
- [**HunyuanVideo Overall Architecture**](#hunyuanvideo-overall-architecture)
|
62 |
-
- [
|
63 |
- [**Unified Image and Video Generative Architecture**](#unified-image-and-video-generative-architecture)
|
64 |
- [**MLLM Text Encoder**](#mllm-text-encoder)
|
65 |
- [**3D VAE**](#3d-vae)
|
66 |
- [**Prompt Rewrite**](#prompt-rewrite)
|
67 |
-
- [
|
68 |
-
- [
|
69 |
-
- [
|
70 |
- [Installation Guide for Linux](#installation-guide-for-linux)
|
71 |
-
- [
|
72 |
-
- [
|
73 |
- [Using Command Line](#using-command-line)
|
74 |
- [Run a Gradio Server](#run-a-gradio-server)
|
75 |
- [More Configurations](#more-configurations)
|
76 |
-
- [
|
77 |
- [Using Command Line](#using-command-line-1)
|
78 |
-
- [
|
79 |
- [Using Command Line](#using-command-line-2)
|
80 |
-
- [
|
81 |
- [Acknowledgements](#acknowledgements)
|
82 |
-
- [Star History](#star-history)
|
83 |
|
84 |
---
|
85 |
|
@@ -105,7 +102,7 @@ the 3D VAE decoder.
|
|
105 |
|
106 |
|
107 |
|
108 |
-
##
|
109 |
|
110 |
### **Unified Image and Video Generative Architecture**
|
111 |
|
@@ -151,7 +148,7 @@ The Prompt Rewrite Model can be directly deployed and inferred using the [Hunyua
|
|
151 |
|
152 |
|
153 |
|
154 |
-
##
|
155 |
|
156 |
To evaluate the performance of HunyuanVideo, we selected five strong baselines from closed-source video generation models. In total, we utilized 1,533 text prompts, generating an equal number of video samples with HunyuanVideo in a single run. For a fair comparison, we conducted inference only once, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models, ensuring consistent video resolution. Videos were assessed based on three criteria: Text Alignment, Motion Quality, and Visual Quality. More than 60 professional evaluators performed the evaluation. Notably, HunyuanVideo demonstrated the best overall performance, particularly excelling in motion quality. Please note that the evaluation is based on Hunyuan Video's high-quality version. This is different from the currently released fast version.
|
157 |
|
@@ -187,7 +184,7 @@ To evaluate the performance of HunyuanVideo, we selected five strong baselines f
|
|
187 |
|
188 |
|
189 |
|
190 |
-
##
|
191 |
|
192 |
The following table shows the requirements for running HunyuanVideo model (batch size = 1) to generate videos:
|
193 |
|
@@ -204,7 +201,7 @@ The following table shows the requirements for running HunyuanVideo model (batch
|
|
204 |
|
205 |
|
206 |
|
207 |
-
##
|
208 |
|
209 |
Begin by cloning the repository:
|
210 |
|
@@ -273,13 +270,14 @@ docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyua
|
|
273 |
```
|
274 |
|
275 |
|
276 |
-
|
|
|
277 |
|
278 |
The details of download pretrained models are shown [here](ckpts/README.md).
|
279 |
|
280 |
|
281 |
|
282 |
-
##
|
283 |
|
284 |
We list the height/width/frame settings we support in the following table.
|
285 |
|
@@ -331,7 +329,7 @@ We list some more useful configurations for easy usage:
|
|
331 |
|
332 |
|
333 |
|
334 |
-
##
|
335 |
|
336 |
[xDiT](https://github.com/xdit-project/xDiT) is a Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters.
|
337 |
It has successfully provided low-latency parallel inference solutions for a variety of DiTs models, including mochi-1, CogVideoX, Flux.1, SD3, etc. This repo adopted the [Unified Sequence Parallelism (USP)](https://arxiv.org/abs/2405.07719) APIs for parallel inference of the HunyuanVideo model.
|
@@ -416,7 +414,7 @@ You can change the `--ulysses-degree` and `--ring-degree` to control the paralle
|
|
416 |
|
417 |
|
418 |
|
419 |
-
##
|
420 |
|
421 |
Using HunyuanVideo with FP8 quantized weights, which saves about 10GB of GPU memory. You can download the [weights](https://huggingface.co/tencent/HunyuanVideo/blob/main/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt) and [weight scales](https://huggingface.co/tencent/HunyuanVideo/blob/main/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8_map.pt) from Huggingface.
|
422 |
|
@@ -446,7 +444,7 @@ python3 sample_video.py \
|
|
446 |
|
447 |
|
448 |
|
449 |
-
##
|
450 |
|
451 |
If you find [HunyuanVideo](https://arxiv.org/abs/2412.03603) useful for your research and applications, please cite using this BibTeX:
|
452 |
|
|
|
21 |
|
22 |
|
23 |
|
24 |
+
## News!!
|
25 |
|
26 |
* Jan 13, 2025: π We release the [Penguin Video Benchmark](https://github.com/Tencent/HunyuanVideo/blob/main/assets/PenguinVideoBenchmark.csv).
|
27 |
* Dec 18, 2024: πββοΈ We release the [FP8 model weights](https://huggingface.co/tencent/HunyuanVideo/blob/main/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt) of HunyuanVideo to save more GPU memory.
|
|
|
31 |
|
32 |
|
33 |
|
34 |
+
## Open-source Plan
|
35 |
|
36 |
- HunyuanVideo (Text-to-Video Model)
|
37 |
- [x] Inference
|
|
|
52 |
## Contents
|
53 |
|
54 |
- [HunyuanVideo: A Systematic Framework For Large Video Generation Model](#hunyuanvideo-a-systematic-framework-for-large-video-generation-model)
|
55 |
+
- [News!!](#news)
|
56 |
+
- [Open-source Plan](#open-source-plan)
|
|
|
|
|
57 |
- [Contents](#contents)
|
58 |
- [**Abstract**](#abstract)
|
59 |
- [**HunyuanVideo Overall Architecture**](#hunyuanvideo-overall-architecture)
|
60 |
+
- [**HunyuanVideo Key Features**](#hunyuanvideo-key-features)
|
61 |
- [**Unified Image and Video Generative Architecture**](#unified-image-and-video-generative-architecture)
|
62 |
- [**MLLM Text Encoder**](#mllm-text-encoder)
|
63 |
- [**3D VAE**](#3d-vae)
|
64 |
- [**Prompt Rewrite**](#prompt-rewrite)
|
65 |
+
- [Comparisons](#comparisons)
|
66 |
+
- [Requirements](#requirements)
|
67 |
+
- [Dependencies and Installation](#οΈdependencies-and-installation)
|
68 |
- [Installation Guide for Linux](#installation-guide-for-linux)
|
69 |
+
- [Download Pretrained Models](#download-pretrained-models)
|
70 |
+
- [Single-gpu Inference](#single-gpu-inference)
|
71 |
- [Using Command Line](#using-command-line)
|
72 |
- [Run a Gradio Server](#run-a-gradio-server)
|
73 |
- [More Configurations](#more-configurations)
|
74 |
+
- [Parallel Inference on Multiple GPUs by xDiT](#parallel-inference-on-multiple-gpus-by-xdit)
|
75 |
- [Using Command Line](#using-command-line-1)
|
76 |
+
- [FP8 Inference](#fp8-inference)
|
77 |
- [Using Command Line](#using-command-line-2)
|
78 |
+
- [BibTeX](#bibtex)
|
79 |
- [Acknowledgements](#acknowledgements)
|
|
|
80 |
|
81 |
---
|
82 |
|
|
|
102 |
|
103 |
|
104 |
|
105 |
+
## **HunyuanVideo Key Features**
|
106 |
|
107 |
### **Unified Image and Video Generative Architecture**
|
108 |
|
|
|
148 |
|
149 |
|
150 |
|
151 |
+
## Comparisons
|
152 |
|
153 |
To evaluate the performance of HunyuanVideo, we selected five strong baselines from closed-source video generation models. In total, we utilized 1,533 text prompts, generating an equal number of video samples with HunyuanVideo in a single run. For a fair comparison, we conducted inference only once, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models, ensuring consistent video resolution. Videos were assessed based on three criteria: Text Alignment, Motion Quality, and Visual Quality. More than 60 professional evaluators performed the evaluation. Notably, HunyuanVideo demonstrated the best overall performance, particularly excelling in motion quality. Please note that the evaluation is based on Hunyuan Video's high-quality version. This is different from the currently released fast version.
|
154 |
|
|
|
184 |
|
185 |
|
186 |
|
187 |
+
## Requirements
|
188 |
|
189 |
The following table shows the requirements for running HunyuanVideo model (batch size = 1) to generate videos:
|
190 |
|
|
|
201 |
|
202 |
|
203 |
|
204 |
+
## Dependencies and Installation
|
205 |
|
206 |
Begin by cloning the repository:
|
207 |
|
|
|
270 |
```
|
271 |
|
272 |
|
273 |
+
|
274 |
+
## Download Pretrained Models
|
275 |
|
276 |
The details of download pretrained models are shown [here](ckpts/README.md).
|
277 |
|
278 |
|
279 |
|
280 |
+
## Single-gpu Inference
|
281 |
|
282 |
We list the height/width/frame settings we support in the following table.
|
283 |
|
|
|
329 |
|
330 |
|
331 |
|
332 |
+
## Parallel Inference on Multiple GPUs by xDiT
|
333 |
|
334 |
[xDiT](https://github.com/xdit-project/xDiT) is a Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters.
|
335 |
It has successfully provided low-latency parallel inference solutions for a variety of DiTs models, including mochi-1, CogVideoX, Flux.1, SD3, etc. This repo adopted the [Unified Sequence Parallelism (USP)](https://arxiv.org/abs/2405.07719) APIs for parallel inference of the HunyuanVideo model.
|
|
|
414 |
|
415 |
|
416 |
|
417 |
+
## FP8 Inference
|
418 |
|
419 |
Using HunyuanVideo with FP8 quantized weights, which saves about 10GB of GPU memory. You can download the [weights](https://huggingface.co/tencent/HunyuanVideo/blob/main/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt) and [weight scales](https://huggingface.co/tencent/HunyuanVideo/blob/main/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8_map.pt) from Huggingface.
|
420 |
|
|
|
444 |
|
445 |
|
446 |
|
447 |
+
## BibTeX
|
448 |
|
449 |
If you find [HunyuanVideo](https://arxiv.org/abs/2412.03603) useful for your research and applications, please cite using this BibTeX:
|
450 |
|