vidxtend / README.md
benjamin-paine's picture
Update README.md
dede4bf verified
---
license: openrail
---
This repository contains a pruned and isolated pipeline for Stage 2 of [StreamingT2V](https://streamingt2v.github.io/), dubbed "VidXTend."
This model's primary purpose is extending 16-frame 256px x 256x animations by 8 frames at a time (one second at 8fps.)
```
@article{henschel2024streamingt2v,
title={StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text},
author={Henschel, Roberto and Khachatryan, Levon and Hayrapetyan, Daniil and Poghosyan, Hayk and Tadevosyan, Vahram and Wang, Zhangyang and Navasardyan, Shant and Shi, Humphrey},
journal={arXiv preprint arXiv:2403.14773},
year={2024}
}
```
# Usage
## Installation
First, install the VidXTend package into your python environment. If you're creating a new environment for VidXTend, be sure you also specify the version of torch you want with CUDA support, or else this will try to run only on CPU.
```sh
pip install git+https://github.com/painebenjamin/vidxtend.git
```
## Command-Line
A command-line utility `vidxtend` is installed with the package.
```sh
Usage: vidxtend [OPTIONS] VIDEO PROMPT
Run VidXtend on a video file, concatenating the generated frames to the end
of the video.
Options:
-fps, --frame-rate INTEGER Video FPS. Will default to the input FPS.
-s, --seconds FLOAT The total number of seconds to add to the
video. Multiply this number by frame rate to
determine total number of new frames
generated. [default: 1.0]
-np, --negative-prompt TEXT Negative prompt for the diffusion process.
-cfg, --guidance-scale FLOAT Guidance scale for the diffusion process.
[default: 7.5]
-ns, --num-inference-steps INTEGER
Number of diffusion steps. [default: 50]
-r, --seed INTEGER Random seed.
-m, --model TEXT HuggingFace model name.
-nh, --no-half Do not use half precision.
-no, --no-offload Do not offload to the CPU to preserve GPU
memory.
-ns, --no-slicing Do not use VAE slicing.
-g, --gpu-id INTEGER GPU ID to use.
-sf, --model-single-file Download and use a single file instead of a
directory.
-cf, --config-file TEXT Config file to use when using the model-
single-file option. Accepts a path or a
filename in the same directory as the single
file. Will download from the repository
passed in the model option if not provided.
[default: config.json]
-mf, --model-filename TEXT The model file to download when using the
model-single-file option. [default:
vidxtend.safetensors]
-rs, --remote-subfolder TEXT Remote subfolder to download from when using
the model-single-file option.
-cd, --cache-dir DIRECTORY Cache directory to download to. Default uses
the huggingface cache.
-o, --output FILE Output file. [default: output.mp4]
-f, --fit [actual|cover|contain|stretch]
Image fit mode. [default: cover]
-a, --anchor [top-left|top-center|top-right|center-left|center-center|center-right|bottom-left|bottom-center|bottom-right]
Image anchor point. [default: top-left]
--help Show this message and exit.
```
## Python
You can create the pipeline, automatically pulling the weights from this repository, either as individual models:
```py
from vidxtend import VidXTendPipeline
pipeline = VidXTendPipeline.from_pretrained(
"benjamin-paine/vidxtend",
torch_dtype=torch.float16,
variant="fp16",
)
```
Or, as a single file:
```py
from vidxtend import VidXTendPipeline
pipeline = VidXTendPipeline.from_single_file(
"benjamin-paine/vidxtend",
torch_dtype=torch.float16,
variant="fp16",
)
```
Use these methods to improve performance:
```
pipeline.enable_model_cpu_offload()
pipeline.enable_vae_slicing()
pipeline.set_use_memory_efficient_attention_xformers()
```
Usage is as follows:
```
# Assume images is a list of PIL Images
new_frames = pipeline(
prompt=prompt,
negative_prompt=None, # Optionally use negative prompt
image=images[-8:], # Use final 8 frames of video
input_frames_conditioning=images[:1], # Use first frame of video
eta=1.0,
guidance_scale=7.5,
output_type="pil"
).frames[8:] # Remove the first 8 frames from the output as they were used as guide for final 8
```