ali-vilab
/

MS-Vid2Vid-XL

@@ -45,16 +45,15 @@ widgets:
 # Video-to-Video
-本项目**MS-Vid2Vid**由达摩院研发和训练，主要用于提升文生视频、图生视频的分辨率和时空连续性，其训练数据包含了精选的海量的高清视频、图像数据（最短边>720），可以将低分辨率的(16:9)的视频提升到更高分辨率（1280 * 720），可以用于任意低分辨率的的超分，本页面我们将称之为**MS-Vid2Vid-XL**。
-The **MS-Vid2Vid** project is developed and trained by Damo Academy and is primarily used to enhance the resolution and spatiotemporal continuity of text-generated videos and image-generated videos. The training data consists of a large selection of high-definition videos and image data (with a minimum short side length of 720), which can upscale low-resolution (16:9) videos to higher resolutions (1280 * 720). It can be used for arbitrary low-resolution super-resolution tasks. On this page, we refer to it as **MS-Vid2Vid-XL**.
 <center>
 <p align="center">
     <img src="https://huggingface.co/damo-vilab/MS-Vid2Vid-XL/resolve/main/assets/images/Fig_1.png"/>
     <br/>
-    Fig.1 Video-to-Video-XL
 <p></center>
@@ -64,11 +63,9 @@ The **MS-Vid2Vid** project is developed and trained by Damo Academy and is prima
 ## 模型介绍 (Introduction)
-**MS-Vid2VidL**是基于Stable Diffusion设计而得，其设计细节延续我们自研[VideoComposer](https://videocomposer.github.io)，具体可以参考其技术报告。如下示例中，左边是低分(448 * 256)，细节会存在抖动，时序一致性较差
-右边是高分(1280 * 720)，总体会平滑很多，在很多case具有较强的修正能力。
-**MS-Vid2Vid-XL** is designed based on Stable Diffusion, with design details inherited from our in-house [VideoComposer](https://videocomposer.github.io). For specific information, please refer to our technical report.
 <center>

 # Video-to-Video
+**MS-Vid2Vid-XL**旨在提升视频生成的时空连续性和分辨率，其作为I2VGen-XL的第二阶段以生成720P的视频，同时还可以用于文生视频、高清视频转换等任务。其训练数据包含了精选的海量的高清视频、图像数据（最短边>=720），可以将低分辨率的视频提升到更高分辨率（1280 * 720），且其可以处理几乎任意分辨率的视频(建议16:9的宽视频)。
+**MS-Vid2Vid-XL** aims to improve the spatiotemporal continuity and resolution of video generation. It serves as the second stage of I2VGen-XL to generate 720P videos, and can also be used for various tasks such as text-to-video synthesis and high-quality video transfer. The training data includes a large collection of high-definition videos and images (with the shortest side >=720), allowing for the enhancement of low-resolution videos to higher resolutions (1280 * 720). It can handle videos of almost any resolution (preferably 16:9 aspect ratio).
 <center>
 <p align="center">
     <img src="https://huggingface.co/damo-vilab/MS-Vid2Vid-XL/resolve/main/assets/images/Fig_1.png"/>
     <br/>
+    Fig.1 MS-Vid2Vid-XL
 <p></center>
 ## 模型介绍 (Introduction)
+**MS-Vid2Vid-XL**和I2VGen-XL第一阶段相同，都是基于隐空间的视频扩散模型(VLDM)，且其共享相同结构的时空UNet(ST-UNet)，其设计细节延续我们自研[VideoComposer](https://videocomposer.github.io)，具体可以参考其技术报告。
+**MS-Vid2Vid-XL** and the first stage of I2VGen-XL share the same underlying video latent diffusion model (VLDM). They both utilize a spatiotemporal UNet (ST-UNet) with the same structure, which is designed based on our in-house VideoComposer. For more specific details, please refer to its technical report.
 <center>