Model Card for Model ID (Use this model for Leaderboard)

This is a fine-tuned Vision Transformer (ViT) model from Google. The model was loaded and fine-tuned on the training data collected. Compared to Attempt 1, we are using the expanded dataset, trained for 20 epochs instead of 5, and only updated the classifier parameters at training time. Compared to Attempt 3, this was trained for 20 epochs and had a learning rate of

Link: https://huggingface.co/google/vit-base-patch16-224-in21k

lat_mean = 39.951640614844095

lat_std = 0.0007502796001097172

lon_mean = -75.19143196896502

lon_std = 0.0007452186171662059

model_name = "AppliedMLReedShreya/ViT_Attempt_4"
config = AutoConfig.from_pretrained(model_name)
config.num_labels = 2  # We need two outputs: latitude and longitude

# Load the pre-trained ViT model
vit_model = AutoModelForImageClassification.from_pretrained(model_name, config=config)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Using device: {device}')
vit_model = vit_model.to(device)

# Initialize lists to store predictions and actual values
all_preds = []
all_actuals = []

vit_model.eval()
with torch.no_grad():
    for images, gps_coords in val_dataloader:
        images, gps_coords = images.to(device), gps_coords.to(device)

        outputs = vit_model(images).logits

        # Denormalize predictions and actual values
        preds = outputs.cpu() * torch.tensor([lat_std, lon_std]) + torch.tensor([lat_mean, lon_mean])
        actuals = gps_coords.cpu() * torch.tensor([lat_std, lon_std]) + torch.tensor([lat_mean, lon_mean])

        all_preds.append(preds)
        all_actuals.append(actuals)

# Concatenate all batches
all_preds = torch.cat(all_preds).numpy()
all_actuals = torch.cat(all_actuals).numpy()