hkunzhe's picture
update README.md
19c910b verified

CogVideoX-Fun-V1.1-Reward-LoRAs

Introduction

We explore the Reward Backpropagation technique 1 2 to optimized the generated videos by CogVideoX-Fun-V1.1 for better alignment with human preferences. We provide the following pre-trained models (i.e. LoRAs) along with the training script. You can use these LoRAs to enhance the corresponding base model as a plug-in or train your own reward LoRA.

For more details, please refer to our GitHub repo.

Name Base Model Reward Model Hugging Face Description
CogVideoX-Fun-V1.1-5b-InP-HPS2.1.safetensors CogVideoX-Fun-V1.1-5b HPS v2.1 🤗Link Official HPS v2.1 reward LoRA (rank=128 and network_alpha=64) for CogVideoX-Fun-V1.1-5b-InP. It is trained with a batch size of 8 for 1,500 steps.
CogVideoX-Fun-V1.1-2b-InP-HPS2.1.safetensors CogVideoX-Fun-V1.1-2b HPS v2.1 🤗Link Official HPS v2.1 reward LoRA (rank=128 and network_alpha=64) for CogVideoX-Fun-V1.1-2b-InP. It is trained with a batch size of 8 for 3,000 steps.
CogVideoX-Fun-V1.1-5b-InP-MPS.safetensors CogVideoX-Fun-V1.1-5b MPS 🤗Link Official MPS reward LoRA (rank=128 and network_alpha=64) for CogVideoX-Fun-V1.1-5b-InP. It is trained with a batch size of 8 for 5,500 steps.
CogVideoX-Fun-V1.1-2b-InP-MPS.safetensors CogVideoX-Fun-V1.1-2b MPS 🤗Link Official MPS reward LoRA (rank=128 and network_alpha=64) for CogVideoX-Fun-V1.1-2b-InP. It is trained with a batch size of 8 for 16,000 steps.

Demo

CogVideoX-Fun-V1.1-5B

Prompt CogVideoX-Fun-V1.1-5B CogVideoX-Fun-V1.1-5B
HPSv2.1 Reward LoRA
CogVideoX-Fun-V1.1-5B
MPS Reward LoRA
Pig with wings flying above a diamond mountain
A dog runs through a field while a cat climbs a tree
Crystal cake shimmering beside a metal apple
Elderly artist with a white beard painting on a white canvas

CogVideoX-Fun-V1.1-2B

Prompt CogVideoX-Fun-V1.1-2B CogVideoX-Fun-V1.1-2B
HPSv2.1 Reward LoRA
CogVideoX-Fun-V1.1-2B
MPS Reward LoRA
A blue car drives past a white picket fence on a sunny day
Blue jay swooping near a red maple tree
Yellow curtains swaying near a blue sofa
White tractor plowing near a green farmhouse

The above test prompts are from T2V-CompBench. All videos are generated with lora weight 0.7.

Quick Start

We provide a simple inference code to run CogVideoX-Fun-V1.1-5b-InP with its HPS2.1 reward LoRA.

import torch
from diffusers import CogVideoXDDIMScheduler

from cogvideox.models.transformer3d import CogVideoXTransformer3DModel
from cogvideox.pipeline.pipeline_cogvideox_inpaint import CogVideoX_Fun_Pipeline_Inpaint
from cogvideox.utils.lora_utils import merge_lora
from cogvideox.utils.utils import get_image_to_video_latent, save_videos_grid

model_path = "alibaba-pai/CogVideoX-Fun-V1.1-5b-InP"
lora_path = "alibaba-pai/CogVideoX-Fun-V1.1-Reward-LoRAs/CogVideoX-Fun-V1.1-5b-InP-HPS2.1.safetensors"
lora_weight = 0.7

prompt = "Pig with wings flying above a diamond mountain"
sample_size = [512, 512]
video_length = 49

transformer = CogVideoXTransformer3DModel.from_pretrained_2d(model_path, subfolder="transformer").to(torch.bfloat16)
scheduler = CogVideoXDDIMScheduler.from_pretrained(model_path, subfolder="scheduler")
pipeline = CogVideoX_Fun_Pipeline_Inpaint.from_pretrained(
    model_path, transformer=transformer, scheduler=scheduler, torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()
pipeline = merge_lora(pipeline, lora_path, lora_weight)

generator = torch.Generator(device="cuda").manual_seed(42)
input_video, input_video_mask, _ = get_image_to_video_latent(None, None, video_length=video_length, sample_size=sample_size)
sample = pipeline(
    prompt,
    num_frames = video_length,
    negative_prompt = "bad detailed",
    height = sample_size[0],
    width = sample_size[1],
    generator = generator,
    guidance_scale = 7.0,
    num_inference_steps = 50,
    video = input_video,
    mask_video = input_video_mask,
).videos

save_videos_grid(sample, "samples/output.mp4", fps=8)

Limitations

  1. We observe after training to a certain extent, the reward continues to increase, but the quality of the generated videos does not further improve. The model trickly learns some shortcuts (by adding artifacts in the background, i.e., adversarial patches) to increase the reward.
  2. Currently, there is still a lack of suitable preference models for video generation. Directly using image preference models cannot evaluate preferences along the temporal dimension (such as dynamism and consistency). Further more, We find using image preference models leads to a decrease in the dynamism of generated videos. Although this can be mitigated by computing the reward using only the first frame of the decoded video, the impact still persists.

Reference

  1. Clark, Kevin, et al. "Directly fine-tuning diffusion models on differentiable rewards.". In ICLR 2024.
  2. Prabhudesai, Mihir, et al. "Aligning text-to-image diffusion models with reward backpropagation." arXiv preprint arXiv:2310.03739 (2023).