Vision Transformer fine-tuned on Matthijs/snacks dataset

Vision Transformer (ViT) model pre-trained on ImageNet-21k and fine-tuned on Matthijs/snacks for 5 epochs using various data augmentation transformations from torchvision.

The model achieves a 94.97% and 94.43% accuracy on the validation and test set, respectively.

Data augmentation pipeline

The code block below shows the various transformations applied during pre-processing to augment the original dataset. The augmented images where generated on-the-fly with the set_transform method.

from transformers import ViTFeatureExtractor
from torchvision.transforms import (
    Compose,
    Normalize,
    Resize,
    RandomResizedCrop,
    RandomHorizontalFlip,
    RandomAdjustSharpness,
    ToTensor
)

checkpoint = 'google/vit-base-patch16-224-in21k'
feature_extractor = ViTFeatureExtractor.from_pretrained(checkpoint)

# transformations on the training set
train_aug_transforms = Compose([
    RandomResizedCrop(size=feature_extractor.size),
    RandomHorizontalFlip(p=0.5),
    RandomAdjustSharpness(sharpness_factor=5, p=0.5),
    ToTensor(),
    Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std),
])

# transformations on the validation/test set
valid_aug_transforms = Compose([
    Resize(size=(feature_extractor.size, feature_extractor.size)),
    ToTensor(),
    Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std),
])
Downloads last month
43
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train matteopilotto/vit-base-patch16-224-in21k-snacks

Space using matteopilotto/vit-base-patch16-224-in21k-snacks 1

Evaluation results