Overfitting in VideoMAE Model Fine-Tuning for Binary Classification on Home Camera Footage
Description:
I'm fine-tuning a VideoMAE model for binary classification on home camera footage to distinguish between two actions. Here’s a summary of my setup and the issues I’m facing:
Dataset & Variations:
I have two primary datasets:
- Small Dataset: ~120 clips for quicker iteration.
- Full Dataset: ~3k clips.
All videos are 6 seconds long, though I've also tested with 3-second clips.
I've also created variations with blurred or blacked-out backgrounds to help with recognition.
Model & Configuration:
The model classifies actions using 16 uniformly sampled frames per video.
I’ve tried various base models, including small, base, large, and models fine-tuned on SSV2 and Kinect.
Hyperparameters tested:
Batch sizes of 2, 4, and 8.
Epochs ranging from 4 to 16.
Learning rate set to 5e-5.
I removed the RandomCrop transformation since it entirely removes the person from the video.
I'm using the Hugging Face Video Classification Colab Notebook as a starting point: Training Notebook.
Problem: Despite these variations, the model overfits immediately. I’ve also tested using the UCF101 dataset to rule out dataset-specific issues and got similar results to the Hugging Face VideoMAE colab, so the code seems fine.
Request: Any advice on addressing this overfitting issue would be greatly appreciated. Specifically, I'm looking for guidance on:
- Additional hyperparameter adjustments.
- Potential model architecture changes (if applicable).
- Dataset augmentation techniques that might improve generalization.
Thank you for any help or insights you can provide!
I would try a couple of things, I have encountered similar issues:
1.) If your videos are 6 seconds long and the model is subsampling to 16 frames, visualize the videos after your transforms. Even with three seconds (i dont know what FPS your videos are recorded at) but you could be basically removing a lot of frames and maybe the actions become unrecognizable?
2.) I would add a lot of regularization. Weight decay. Heavier transforms with color and maybe blur. If your actions can take a horizontal flip while keeping signal, I would try that as well. Smaller learning rate and train for more epochs.
I have had some success with:
config = {
"weight_decay": 0.1,
"gradient_accumulation_steps": 2,
"fp16": True,
"lr_scheduler_type": "linear", # Try linear instead of cosine
"warmup_ratio": 0.1,
"max_grad_norm": 1.0,
}
But i have seen this same issue and would love to know if you had any success! Good luck.