yourusername
commited on
Commit
·
9010034
1
Parent(s):
c7ce0d9
:pencil: edit README.md
Browse files
README.md
CHANGED
@@ -16,6 +16,27 @@ Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 2
|
|
16 |
|
17 |
Check out the code at my [my Github repo](https://github.com/nateraw/huggingface-vit-finetune).
|
18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
## Model description
|
20 |
|
21 |
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels.
|
|
|
16 |
|
17 |
Check out the code at my [my Github repo](https://github.com/nateraw/huggingface-vit-finetune).
|
18 |
|
19 |
+
## Usage
|
20 |
+
|
21 |
+
```python
|
22 |
+
from transformers import ViTFeatureExtractor, ViTForImageClassification
|
23 |
+
from PIL import Image
|
24 |
+
import requests
|
25 |
+
|
26 |
+
url = 'https://www.cs.toronto.edu/~kriz/cifar-10-sample/dog10.png'
|
27 |
+
image = Image.open(requests.get(url, stream=True).raw)
|
28 |
+
feature_extractor = ViTFeatureExtractor.from_pretrained('nateraw/vit-base-patch16-224-cifar10')
|
29 |
+
model = ViTForImageClassification.from_pretrained('nateraw/vit-base-patch16-224-cifar10')
|
30 |
+
inputs = feature_extractor(images=image, return_tensors="pt")
|
31 |
+
outputs = model(**inputs)
|
32 |
+
preds = outputs.logits.argmax(dim=1)
|
33 |
+
|
34 |
+
classes = [
|
35 |
+
'airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'
|
36 |
+
]
|
37 |
+
classes[preds[0]]
|
38 |
+
```
|
39 |
+
|
40 |
## Model description
|
41 |
|
42 |
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels.
|