sayakpaul HF staff commited on
Commit
9d30a7a
·
verified ·
1 Parent(s): 8cf98bd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -132
README.md CHANGED
@@ -1,132 +1,12 @@
1
- # Q8 LTX-Video optimized for Ada
2
-
3
- This repository shows how to use the Q8 kernels from [`KONAKONA666/q8_kernels`](https://github.com/KONAKONA666/q8_kernels) with `diffusers` to optimize inference of [LTX-Video](https://huggingface.co/Lightricks/LTX-Video) on ADA GPUs. Go from 16.192 secs to 9.572 secs while reducing memory from 7GBs to 5GBs without quality loss 🤪 With `torch.compile()`, the time reduces further to 6.747 secs 🔥
4
-
5
- The Q8 transformer checkpoint is available here: [`sayakpaul/q8-ltx-video`](https://hf.co/sayakpaul/q8-ltx-video).
6
-
7
- ## Getting started
8
-
9
- Install the dependencies:
10
-
11
- ```bash
12
- pip install -U transformers accelerate
13
- git clone https://github.com/huggingface/diffusers && cd diffusers && pip install -e . && cd ..
14
- ```
15
-
16
- Then install `q8_kernels`, following instructions from [here](https://github.com/KONAKONA666/q8_kernels/?tab=readme-ov-file#installation).
17
-
18
- To run inference with the Q8 kernels, we need some minor changes in `diffusers`. Apply [this patch](https://github.com/sayakpaul/q8-ltx-video/blob/368f549ca5136daf89049c9efe32748e73aca317/updates.patch) to take those into account:
19
-
20
- ```bash
21
- git apply updates.patch
22
- ```
23
-
24
- Now we can run inference:
25
-
26
- ```bash
27
- python inference.py \
28
- --prompt="A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
29
- --negative_prompt="worst quality, inconsistent motion, blurry, jittery, distorted" \
30
- --q8_transformer_path="sayakpaul/q8-ltx-video"
31
- ```
32
-
33
- ## Why does the repo exist and some more details?
34
-
35
- There already exists [`KONAKONA666/LTX-Video`](https://github.com/KONAKONA666/LTX-Video). Then why this repo?
36
-
37
- That repo uses custom implementations of the LTX-Video pipeline components and can be hard to directly use in `diffusers`. This repo repurposes the kernels from the `q8_kernels` on the components directly from `diffusers`.
38
-
39
- <details>
40
- <summary>More details</summary>
41
-
42
- We do this by first converting the state dict of the original [LTX-Video transformer](https://huggingface.co/Lightricks/LTX-Video/tree/main/transformer). This includes FP8 quantization. This process also requires replacing:
43
-
44
- * linear layers of the model
45
- * RMSNorms of the model
46
- * GELUs of the model
47
-
48
- before the converted state dict is loaded into the model. Some layer params are kept in FP32 and some layers are not even quantized. Replacement utilities are in [`q8_ltx.py`](./q8_ltx.py).
49
-
50
- The model can then be serialized. The conversion and serialization are coded in [`conversion_utils.py`](./conversion_utils.py).
51
-
52
- During loading the model and using it for inference, we:
53
-
54
- * initialize the transformer model under a "meta" device
55
- * follow the same layer replacement scheme as detailed above
56
- * populate the converted state dict
57
- * replace the attention processors to use [the flash attention implementation](https://github.com/KONAKONA666/q8_kernels/blob/9cee3f3d4ca5ec8ab463179be32c8001e31f8f33/q8_kernels/functional/flash_attention.py) one from `q8_kernels`
58
-
59
- Refer [here](https://github.com/sayakpaul/q8-ltx-video/blob/368f549ca5136daf89049c9efe32748e73aca317/inference.py#L48) more details. Additionally, we leverage [flash-attention implementation](https://github.com/sayakpaul/q8-ltx-video/blob/368f549ca5136daf89049c9efe32748e73aca317/q8_attention_processors.py#L44) from `q8_kernels` which provides further speedup.
60
-
61
- </details>
62
-
63
- ## Performance
64
-
65
- Below numbers were obtained for `max_sequence_length=512`, `num_inference_steps=50`, `num_frames=81`, `resolution=480x704`. Rest of the arguments were fixed at their default values as noticed in the [pipeline call signature of LTX-Video](https://github.com/huggingface/diffusers/blob/4b9f1c7d8c2e476eed38af3144b79105a5efcd93/src/diffusers/pipelines/ltx/pipeline_ltx.py#L496). The numbers also don't include the VAE decoding time to solely focus on the transformer.
66
-
67
-
68
- | | **Time (Secs)** | **Memory (MB)** |
69
- |:-----------:|:-----------:|:-----------:|
70
- | Non Q8 | 16.192 | 7172.86 |
71
- | Non Q8 (+ compile) | 16.205 | - |
72
- | Q8 | 9.572 | 5413.51 |
73
- | Q8 (+ compile) | 6.747 | - |
74
-
75
- Benchmarking script is available in [`benchmark.py`](./benchmark.py). You would need to download the precomputed
76
- prompt embeddings from [here](https://huggingface.co/sayakpaul/q8-ltx-video/blob/main/prompt_embeds.pt) before running the benchmark.
77
-
78
- <details>
79
- <summary>Env</summary>
80
-
81
- ```bash
82
- +-----------------------------------------------------------------------------------------+
83
- | NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
84
- |-----------------------------------------+------------------------+----------------------+
85
- | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
86
- | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
87
- | | | MIG M. |
88
- |=========================================+========================+======================|
89
- | 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 Off | Off |
90
- | 0% 46C P8 18W / 450W | 2MiB / 24564MiB | 0% Default |
91
- | | | N/A |
92
- +-----------------------------------------+------------------------+----------------------+
93
- ```
94
-
95
- `diffusers-cli env`:
96
-
97
- ```bash
98
- - 🤗 Diffusers version: 0.33.0.dev0
99
- - Platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.39
100
- - Running on Google Colab?: No
101
- - Python version: 3.10.12
102
- - PyTorch version (GPU?): 2.5.1+cu124 (True)
103
- - Flax version (CPU?/GPU?/TPU?): not installed (NA)
104
- - Jax version: not installed
105
- - JaxLib version: not installed
106
- - Huggingface_hub version: 0.27.0
107
- - Transformers version: 4.47.1
108
- - Accelerate version: 1.2.1
109
- - PEFT version: 0.13.2
110
- - Bitsandbytes version: 0.44.1
111
- - Safetensors version: 0.4.4
112
- - xFormers version: 0.0.29.post1
113
- - Accelerator: NVIDIA GeForce RTX 4090, 24564 MiB
114
- NVIDIA GeForce RTX 4090, 24564 MiB
115
- - Using GPU in script?: <fill in>
116
- - Using distributed or parallel set-up in script?: <fill in>
117
- ```
118
-
119
- </details>
120
-
121
- > [!NOTE]
122
- > The RoPE implementation from `q8_kernels` [isn't usable as of 1st Jan 2025](https://github.com/KONAKONA666/q8_kernels/blob/9cee3f3d4ca5ec8ab463179be32c8001e31f8f33/q8_kernels/functional/rope.py#L26). So, we resort to using [the one](https://github.com/huggingface/diffusers/blob/91008aabc4b8dbd96a356ab6f457f3bd84b10e8b/src/diffusers/models/transformers/transformer_ltx.py#L464) from `diffusers`.
123
-
124
-
125
- ## Comparison
126
-
127
- Check out [this page](https://wandb.ai/sayakpaul/q8-ltx-video/runs/89h6ac5) on Weights and Biases that provides some comparative results. Generated videos are also available [here](./videos/).
128
-
129
- ## Acknowledgement
130
-
131
- KONAKONA666's works on [`KONAKONA666/q8_kernels`](https://github.com/KONAKONA666/q8_kernels) and [KONAKONA666/LTX-Video](https://github.com/KONAKONA666/LTX-Video).
132
-
 
1
+ ---
2
+ title: Q8-LTX-Video-Playground
3
+ emoji: 🧨
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: gradio # Specify the SDK, e.g., gradio or streamlit
7
+ sdk_version: "4.44.1" # Specify the SDK version if needed
8
+ app_file: app.py
9
+ pinned: false # Set to true if you want to pin this Space
10
+ ---
11
+
12
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference