Papers
arxiv:2407.14505

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

Published on Jul 19
· Submitted by Kaiyue on Jul 24
Authors:
,
,

Abstract

Text-to-video (T2V) generation models have advanced significantly, yet their ability to compose different objects, attributes, actions, and motions into a video remains unexplored. Previous text-to-video benchmarks also neglect this important ability for evaluation. In this work, we conduct the first systematic study on compositional text-to-video generation. We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation. T2V-CompBench encompasses diverse aspects of compositionality, including consistent attribute binding, dynamic attribute binding, spatial relationships, motion binding, action binding, object interactions, and generative numeracy. We further carefully design evaluation metrics of MLLM-based metrics, detection-based metrics, and tracking-based metrics, which can better reflect the compositional text-to-video generation quality of seven proposed categories with 700 text prompts. The effectiveness of the proposed metrics is verified by correlation with human evaluations. We also benchmark various text-to-video generative models and conduct in-depth analysis across different models and different compositional categories. We find that compositional text-to-video generation is highly challenging for current models, and we hope that our attempt will shed light on future research in this direction.

Community

Paper author Paper submitter

This paper introduces the 1st compositional Text-to-video generation benchmark, T2V-CompBench.
It evaluates diverse aspects of compositionality with specifically designed metrics, covering 7 categories with 700 text prompts.
We benchmark 20 T2V models: 13 open-sourced and 7 commercial models, highlighting the challenge of compositionality for current T2V generation.
teaser.png
Project page: https://t2v-compbench.github.io/
Code available here: https://github.com/KaiyueSun98/T2V-CompBench

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

What are some of the planned enhancements for T2V compbench?

·
Paper author

Hi,
Thank you for your interest in T2V CompBench!
Firstly, we aim to improve the evaluation metrics. The current metrics can be further refined to better align with human perception.
Secondly, we plan to incorporate more categories into the text-to-video compositional generation. By expanding the range of categories, we can better evaluate the models' capabilities in handling diverse and complex scenarios, which is crucial for advancing the field.
I look forward to sharing more updates as we progress with these enhancements!

Kaiyue

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.14505 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.14505 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.14505 in a Space README.md to link it from this page.

Collections including this paper 5