File size: 4,907 Bytes
29db607
 
 
6f38333
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dede4bf
6f38333
 
 
 
 
dede4bf
6f38333
 
 
 
dede4bf
 
 
 
 
 
 
 
 
6f38333
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
license: openrail
---

This repository contains a pruned and isolated pipeline for Stage 2 of [StreamingT2V](https://streamingt2v.github.io/), dubbed "VidXTend."

This model's primary purpose is extending 16-frame 256px x 256x animations by 8 frames at a time (one second at 8fps.)

```
@article{henschel2024streamingt2v,
  title={StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text},
  author={Henschel, Roberto and Khachatryan, Levon and Hayrapetyan, Daniil and Poghosyan, Hayk and Tadevosyan, Vahram and Wang, Zhangyang and Navasardyan, Shant and Shi, Humphrey},
  journal={arXiv preprint arXiv:2403.14773},
  year={2024}
}
```


# Usage

## Installation

First, install the VidXTend package into your python environment. If you're creating a new environment for VidXTend, be sure you also specify the version of torch you want with CUDA support, or else this will try to run only on CPU.

```sh
pip install git+https://github.com/painebenjamin/vidxtend.git
```

## Command-Line

A command-line utility `vidxtend` is installed with the package.

```sh
Usage: vidxtend [OPTIONS] VIDEO PROMPT

  Run VidXtend on a video file, concatenating the generated frames to the end
  of the video.

Options:
  -fps, --frame-rate INTEGER      Video FPS. Will default to the input FPS.
  -s, --seconds FLOAT             The total number of seconds to add to the
                                  video. Multiply this number by frame rate to
                                  determine total number of new frames
                                  generated.  [default: 1.0]
  -np, --negative-prompt TEXT     Negative prompt for the diffusion process.
  -cfg, --guidance-scale FLOAT    Guidance scale for the diffusion process.
                                  [default: 7.5]
  -ns, --num-inference-steps INTEGER
                                  Number of diffusion steps.  [default: 50]
  -r, --seed INTEGER              Random seed.
  -m, --model TEXT                HuggingFace model name.
  -nh, --no-half                  Do not use half precision.
  -no, --no-offload               Do not offload to the CPU to preserve GPU
                                  memory.
  -ns, --no-slicing               Do not use VAE slicing.
  -g, --gpu-id INTEGER            GPU ID to use.
  -sf, --model-single-file        Download and use a single file instead of a
                                  directory.
  -cf, --config-file TEXT         Config file to use when using the model-
                                  single-file option. Accepts a path or a
                                  filename in the same directory as the single
                                  file. Will download from the repository
                                  passed in the model option if not provided.
                                  [default: config.json]
  -mf, --model-filename TEXT      The model file to download when using the
                                  model-single-file option.  [default:
                                  vidxtend.safetensors]
  -rs, --remote-subfolder TEXT    Remote subfolder to download from when using
                                  the model-single-file option.
  -cd, --cache-dir DIRECTORY      Cache directory to download to. Default uses
                                  the huggingface cache.
  -o, --output FILE               Output file.  [default: output.mp4]
  -f, --fit [actual|cover|contain|stretch]
                                  Image fit mode.  [default: cover]
  -a, --anchor [top-left|top-center|top-right|center-left|center-center|center-right|bottom-left|bottom-center|bottom-right]
                                  Image anchor point.  [default: top-left]
  --help                          Show this message and exit.
```

## Python

You can create the pipeline, automatically pulling the weights from this repository, either as individual models:

```py
from vidxtend import VidXTendPipeline
pipeline = VidXTendPipeline.from_pretrained(
  "benjamin-paine/vidxtend",
  torch_dtype=torch.float16,
  variant="fp16",
)
```

Or, as a single file:

```py
from vidxtend import VidXTendPipeline
pipeline = VidXTendPipeline.from_single_file(
  "benjamin-paine/vidxtend",
  torch_dtype=torch.float16,
  variant="fp16",
)
```

Use these methods to improve performance:

```
pipeline.enable_model_cpu_offload()
pipeline.enable_vae_slicing()
pipeline.set_use_memory_efficient_attention_xformers()
```

Usage is as follows:

```
# Assume images is a list of PIL Images

new_frames = pipeline(
    prompt=prompt,
    negative_prompt=None, # Optionally use negative prompt
    image=images[-8:], # Use final 8 frames of video
    input_frames_conditioning=images[:1], # Use first frame of video
    eta=1.0,
    guidance_scale=7.5,
    output_type="pil"
).frames[8:] # Remove the first 8 frames from the output as they were used as guide for final 8
```