Indian Food Classification with Vision Transformer (ViT)

Overview

This model is a fine-tuned Vision Transformer (ViT) for the task of classifying images of Indian foods. The model was trained on the Indian Foods Dataset from Hugging Face Datasets.

Dataset

The Indian Foods Dataset contains 4,770 images across 15 different classes of popular Indian dishes. The dataset is split into:

  • Training: 3,047 images
  • Validation: 762 images
  • Testing: 961 images

Model

The base model used is the vision transformer (google/vit-base-patch16-224-in21k). The model was fine-tuned on the Indian Foods Dataset for 10 epochs using the AdamW optimizer with a learning rate of 2e-4.

Evaluation

The model was evaluated on the test set and achieved the following metrics:

  • Accuracy: 0.9667
  • Precision: 0.9670
  • Recall: 0.9667

Usage

You can use this pre-trained model directly from Hugging Face

Downloads last month
12
Safetensors
Model size
85.8M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train therealcyberlord/vit-indian-food