---
license: other
datasets:
- imagenet-1k
---
[**FasterViT: Fast Vision Transformers with Hierarchical Attention**](https://arxiv.org/abs/2306.06189).
FasterViT achieves a new SOTA Pareto-front in
terms of accuracy vs. image throughput without extra training data !
Note: Please use the [**latest NVIDIA TensorRT release**](https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/index.html) to enjoy the benefits of optimized FasterViT ops.
## Quick Start
We can import pre-trained FasterViT models with **1 line of code**. First, FasterViT can be simply installed by:
```bash
pip install fastervit
```
A pretrained FasterViT model with default hyper-parameters can be created as in the following:
```python
>>> from fastervit import create_model
# Define fastervit-0 model with 224 x 224 resolution
>>> model = create_model('faster_vit_0_224',
pretrained=True,
model_path="/tmp/faster_vit_0.pth.tar")
```
`model_path` is used to set the directory to download the model.
We can also simply test the model by passing a dummy input image. The output is the logits:
```python
>>> import torch
>>> image = torch.rand(1, 3, 224, 224)
>>> output = model(image) # torch.Size([1, 1000])
```
We can also use the any-resolution FasterViT model to accommodate arbitrary image resolutions. In the following, we define an any-resolution FasterViT-0
model with input resolution of 576 x 960, window sizes of 12 and 6 in 3rd and 4th stages, carrier token size of 2 and embedding dimension of
64:
```python
>>> from fastervit import create_model
# Define any-resolution FasterViT-0 model with 576 x 960 resolution
>>> model = create_model('faster_vit_0_any_res',
resolution=[576, 960],
window_size=[7, 7, 12, 6],
ct_size=2,
dim=64,
pretrained=True)
```
Note that the above model is intiliazed from the original ImageNet pre-trained FasterViT with original resolution of 224 x 224. As a result, missing keys and mis-matches could be expected since we are addign new layers (e.g. addition of new carrier tokens, etc.)
We can simply test the model by passing a dummy input image. The output is the logits:
```python
>>> import torch
>>> image = torch.rand(1, 3, 576, 960)
>>> output = model(image) # torch.Size([1, 1000])
```
---
## Results + Pretrained Models
### ImageNet-1K
**FasterViT ImageNet-1K Pretrained Models**
Name |
Acc@1(%) |
Acc@5(%) |
Throughput(Img/Sec) |
Resolution |
#Params(M) |
FLOPs(G) |
Download |
FasterViT-0 |
82.1 |
95.9 |
5802 |
224x224 |
31.4 |
3.3 |
model |
FasterViT-1 |
83.2 |
96.5 |
4188 |
224x224 |
53.4 |
5.3 |
model |
FasterViT-2 |
84.2 |
96.8 |
3161 |
224x224 |
75.9 |
8.7 |
model |
FasterViT-3 |
84.9 |
97.2 |
1780 |
224x224 |
159.5 |
18.2 |
model |
FasterViT-4 |
85.4 |
97.3 |
849 |
224x224 |
424.6 |
36.6 |
model |
FasterViT-5 |
85.6 |
97.4 |
449 |
224x224 |
975.5 |
113.0 |
model |
FasterViT-6 |
85.8 |
97.4 |
352 |
224x224 |
1360.0 |
142.0 |
model |
### ImageNet-21K
**FasterViT ImageNet-21K Pretrained Models (ImageNet-1K Fine-tuned)**
Name |
Acc@1(%) |
Acc@5(%) |
Resolution |
#Params(M) |
FLOPs(G) |
Download |
FasterViT-4-21K-224 |
86.6 |
97.8 |
224x224 |
271.9 |
40.8 |
model |
FasterViT-4-21K-384 |
87.6 |
98.3 |
384x384 |
271.9 |
120.1 |
model |
FasterViT-4-21K-512 |
87.8 |
98.4 |
512x512 |
271.9 |
213.5 |
model |
FasterViT-4-21K-768 |
87.9 |
98.5 |
768x768 |
271.9 |
480.4 |
model |
### Robustness (ImageNet-A - ImageNet-R - ImageNet-V2)
All models use `crop_pct=0.875`. Results are obtained by running inference on ImageNet-1K pretrained models without finetuning.
Name |
A-Acc@1(%) |
A-Acc@5(%) |
R-Acc@1(%) |
R-Acc@5(%) |
V2-Acc@1(%) |
V2-Acc@5(%) |
FasterViT-0 |
23.9 |
57.6 |
45.9 |
60.4 |
70.9 |
90.0 |
FasterViT-1 |
31.2 |
63.3 |
47.5 |
61.9 |
72.6 |
91.0 |
FasterViT-2 |
38.2 |
68.9 |
49.6 |
63.4 |
73.7 |
91.6 |
FasterViT-3 |
44.2 |
73.0 |
51.9 |
65.6 |
75.0 |
92.2 |
FasterViT-4 |
49.0 |
75.4 |
56.0 |
69.6 |
75.7 |
92.7 |
FasterViT-5 |
52.7 |
77.6 |
56.9 |
70.0 |
76.0 |
93.0 |
FasterViT-6 |
53.7 |
78.4 |
57.1 |
70.1 |
76.1 |
93.0 |
A, R and V2 denote ImageNet-A, ImageNet-R and ImageNet-V2 respectively.
## Citation
Please consider citing FasterViT if this repository is useful for your work.
```
@article{hatamizadeh2023fastervit,
title={FasterViT: Fast Vision Transformers with Hierarchical Attention},
author={Hatamizadeh, Ali and Heinrich, Greg and Yin, Hongxu and Tao, Andrew and Alvarez, Jose M and Kautz, Jan and Molchanov, Pavlo},
journal={arXiv preprint arXiv:2306.06189},
year={2023}
}
```
## Licenses
Copyright © 2023, NVIDIA Corporation. All rights reserved.
This work is made available under the NVIDIA Source Code License-NC. Click [here](LICENSE) to view a copy of this license.
For license information regarding the timm repository, please refer to its [repository](https://github.com/rwightman/pytorch-image-models).
For license information regarding the ImageNet dataset, please see the [ImageNet official website](https://www.image-net.org/).
## Acknowledgement
This repository is built on top of the [timm](https://github.com/huggingface/pytorch-image-models) repository. We thank [Ross Wrightman](https://rwightman.com/) for creating and maintaining this high-quality library.