Spaces:
Paused
Q8 LTX-Video optimized for Ada
This repository shows how to use the Q8 kernels from KONAKONA666/q8_kernels
with diffusers
to optimize inference of LTX-Video on ADA GPUs. Go from 16.192 secs to 9.572 secs while reducing memory from 7GBs to 5GBs without quality loss 🤪 With torch.compile()
, the time reduces further to 6.747 secs 🔥
The Q8 transformer checkpoint is available here: sayakpaul/q8-ltx-video
.
Getting started
Install the dependencies:
pip install -U transformers accelerate
git clone https://github.com/huggingface/diffusers && cd diffusers && pip install -e . && cd ..
Then install q8_kernels
, following instructions from here.
To run inference with the Q8 kernels, we need some minor changes in diffusers
. Apply this patch to take those into account:
git apply updates.patch
Now we can run inference:
python inference.py \
--prompt="A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
--negative_prompt="worst quality, inconsistent motion, blurry, jittery, distorted" \
--q8_transformer_path="sayakpaul/q8-ltx-video"
Why does the repo exist and some more details?
There already exists KONAKONA666/LTX-Video
. Then why this repo?
That repo uses custom implementations of the LTX-Video pipeline components and can be hard to directly use in diffusers
. This repo repurposes the kernels from the q8_kernels
on the components directly from diffusers
.
More details
We do this by first converting the state dict of the original LTX-Video transformer. This includes FP8 quantization. This process also requires replacing:
- linear layers of the model
- RMSNorms of the model
- GELUs of the model
before the converted state dict is loaded into the model. Some layer params are kept in FP32 and some layers are not even quantized. Replacement utilities are in q8_ltx.py
.
The model can then be serialized. The conversion and serialization are coded in conversion_utils.py
.
During loading the model and using it for inference, we:
- initialize the transformer model under a "meta" device
- follow the same layer replacement scheme as detailed above
- populate the converted state dict
- replace the attention processors to use the flash attention implementation one from
q8_kernels
Refer here more details. Additionally, we leverage flash-attention implementation from q8_kernels
which provides further speedup.
Performance
Below numbers were obtained for max_sequence_length=512
, num_inference_steps=50
, num_frames=81
, resolution=480x704
. Rest of the arguments were fixed at their default values as noticed in the pipeline call signature of LTX-Video. The numbers also don't include the VAE decoding time to solely focus on the transformer.
Time (Secs) | Memory (MB) | |
---|---|---|
Non Q8 | 16.192 | 7172.86 |
Non Q8 (+ compile) | 16.205 | - |
Q8 | 9.572 | 5413.51 |
Q8 (+ compile) | 6.747 | - |
Benchmarking script is available in benchmark.py
. You would need to download the precomputed
prompt embeddings from here before running the benchmark.
Env
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 Off | Off |
| 0% 46C P8 18W / 450W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
diffusers-cli env
:
- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.39
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.27.0
- Transformers version: 4.47.1
- Accelerate version: 1.2.1
- PEFT version: 0.13.2
- Bitsandbytes version: 0.44.1
- Safetensors version: 0.4.4
- xFormers version: 0.0.29.post1
- Accelerator: NVIDIA GeForce RTX 4090, 24564 MiB
NVIDIA GeForce RTX 4090, 24564 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
The RoPE implementation from
q8_kernels
isn't usable as of 1st Jan 2025. So, we resort to using the one fromdiffusers
.
Comparison
Check out this page on Weights and Biases that provides some comparative results. Generated videos are also available here.
Acknowledgement
KONAKONA666's works on KONAKONA666/q8_kernels
and KONAKONA666/LTX-Video.