File size: 2,956 Bytes
c1b1c08
 
 
 
 
 
 
d0075d1
 
347a8f3
c1b1c08
f203678
 
 
3f5f085
 
 
f203678
 
 
 
3f5f085
f203678
 
 
3f5f085
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f203678
 
3f5f085
 
 
 
 
 
 
f203678
 
 
3f5f085
 
 
 
f203678
 
 
3f5f085
 
f203678
3f5f085
 
f203678
3f5f085
 
 
 
 
f203678
 
 
 
 
 
 
 
3f5f085
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: other
tags:
- vision
- image-classification
datasets:
- imagenet-1k
metrics:
- accuracy

---

# TFaugvit

TFAugViT model is the tensorflow implementation of the 
[AugViT: Augmented Shortcuts for Vision Transformers](https://arxiv.org/pdf/2106.15941v1.pdf) by Yehui Tang, Kai Han, Chang Xu, An Xiao, Yiping Deng, Chao Xu and Yunhe Wang, 
and first released in [this repository](https://github.com/kingcong/augvit).


## Model description

Aug-ViT inserts additional paths with learnable parameters in parallel on the original shortcuts for alleviating the feature collapse. The block-circulant projection is used to implement augmented shortcut, which brings negligible increase of computational cost.

## Intended uses & limitations

This model can be used for image classification tasks and easily be fine-tuned to suite your purpose of use.

### How to use

Here is how to use this model to classify an image into one of the 1,000 ImageNet classes:

```python
from transformers import TFAutoModelForImageClassification
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

model = TFAutoModelForImageClassification.from_pretrained("tensorgirl/TFaugvit",trust_remote_code=True)

outputs = model({'pixel_values':image})


# model predicts one of the 1000 ImageNet classes
predicted_class_idx = outputs.argmax(-1)
```

## Training data

The TFAugViT model is trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset consisting of 1 million images and 1,000 classes. 

## Training procedure

Due to the use of einops library you cannot use the model,fit() directly on this model, you will have to either write a custom training loop by passing the inputs as shown above or you can wrap the model in a functional model of keras and specify the batch_size beforehand.
If you want to train the model on some other data then either resize the images to 224x224 or change the model config image_size to suit your requirements.


### Training hyperparameters

The following hyperparameters were used during training:
- optimizer: Adam
- batch_size: 32
- training_precision: float32
- 
## Evaluation results

| Model            | ImageNet top-1 accuracy | # params  | Resolution |
|------------------|-------------------------|-----------|------------|
| Aug-ViT-S    | 81                   | 22.2 M     | 224x224 |
| Aug-ViT-B     | 82.4                    | 86.5 M     | 224x224|
| Aug-ViT-B (Upsampled)  | 84.2                | 86.5 M | 384x384|



### Framework versions

- Transformers 4.33.2
- TensorFlow 2.13.0
- Tokenizers 0.13.3

### BibTeX entry and citation info

```bibtex
@inproceedings{aug-vit tf,
title = {AugViT: Augmented Shortcuts for Vision Transformers},
author = {Yehui Tang, Kai Han, Chang Xu, An Xiao, Yiping Deng, Chao Xu and Yunhe Wang},
year = {2021},
URL = {https://arxiv.org/abs/2106.15941}
}
```