---
license: cc-by-nc-sa-4.0
pipeline_tag: image-to-video
tags:
- turing
- autonomous driving
- video generation
- world model
---
# Terra
**Terra** is a world model designed for autonomous driving and serves as a baseline model in th [ACT-Bench](https://github.com/turingmotors/ACT-Bench) framework.
Terra generates video continuations based on short video clips of approximately three frames and trajectory instructions.
A key feature of Terra is its **high adherence to trajectory instructions**, enabling accurate and reliable action-conditioned video generation.
We have developed two versions of the Terra model to date. The `v1` model, as detailed in the paper, exhibits a bias towards generating videos that veer to the right. To address this issue, we introduced the `v2` model, incorporating slight architectural modifications to mitigate this tendency and produce more balanced outputs. The performance of each model is outlined below.
||Vista|Terra(v1)|Terra(v2)|
|---|---|---|---|
|Accuracy (↑)| 0.307 |0.441 | **0.632** |
|ADE (↓)|4.50 | 3.98 | **3.86** |
|FDE (↓)|8.66|8.21| **8.05**|
## Related Links
For more technical details and discussions, please refer to:
- **Paper:** https://arxiv.org/abs/2412.05337
- **Code:** https://github.com/turingmotors/ACT-Bench
## How to use
We have verified the execution on a machine equipped with a single NVIDIA H100 80GB GPU. However, we believe it should be possible to run the model on any machine equipped with an NVIDIA GPU with 16GB or more of VRAM.
Terra consists of an Image Tokenizer, an Autoregressive Transformer, and a Video Refiner. Due to the complexity of setting up the Video Refiner, we have not include its implementation in this Hugging Face repository. Instead, **the implementation and setup instructions for the Video Refiner are provided in [ACT-Bench repository](https://github.com/turingmotors/ACT-Bench)**. Here, we provide an example of generating video continuations using the Image Tokenizer and the Autoregressive Transformer, conditioned on image frames and a template trajectory. The resulting video quality might seem suboptimal as each frame is decoded individually. To improve the visual quality, you can use Video Refiner.
### Install Packages
We use [uv](https://docs.astral.sh/uv/) to manage python packages. If you don't have uv installed in your environment, please see the document of it.
```shell
$ git clone https://huggingface.co/turing-motors/Terra
$ uv sync
```
### Action-Conditioned Video Generation without Video Refiner
```shell
$ python inference.py
```
This command generates a video using three image frames located in [`assets/conditioning_frames`](./assets/conditioning_frames/) and the `curving_to_left/curving_to_left_moderate` trajectory defined in the trajectory template file [`assets/template_trajectory.json`](./assets/template_trajectory.json).
You can find more details by referring to the [`inference.py`](./inference.py) script.
## Citation
```bibtex
@misc{arai2024actbench,
title={ACT-Bench: Towards Action Controllable World Models for Autonomous Driving},
author={Hidehisa Arai and Keishi Ishihara and Tsubasa Takahashi and Yu Yamaguchi},
year={2024},
eprint={2412.05337},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.05337},
}
```