Model Card for CLIP_COCO

Model Description

Homepage: https://imirandam.github.io/BiVLC_project_page/
Repository: https://github.com/IMirandaM/BiVLC
Paper: https://arxiv.org/abs/2406.09952
Point of Contact: Imanol Miranda

Model Summary

CLIP_COCO is a model presented in the BiVLC paper for experimentation. It has been fine-tuned with OpenCLIP framework using as basis the CLIP ViT-B-32 model pre-trained by 'openai'. The idea behind this fine-tuning is to have a baseline to compare the CLIP_TROHN-Text and CLIP_TROHN-Img models. Hyperparameters:

Learning rate: 1e-6.
Scheduler: Cosine scheduler with 50 warmup steps.
Optimizer: AdamW optimizer with beta1 = 0.9, beta2 = 0.98, eps = 1e-6 and weight decay = 0.1.
Loss function: InfoNCE Loss.
Batch size: We define a batch size of 400, resulting in 400 images x 400 captions.
Epochs: We fine-tune all models over 10 epochs and we used validation accuracy as the model selection criterion, i.e. we selected the model with the highest accuracy on the corresponding validation set.
Data: It is fine-tuned with COCO 2017 train split.

Evaluation Data

The model is evaluated in BiVLC.

Licensing Information

This work is licensed under a MIT License.

Citation Information

If you find this dataset useful, please consider citing our paper:

@misc{miranda2024bivlc,
      title={BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval}, 
      author={Imanol Miranda and Ander Salaberria and Eneko Agirre and Gorka Azkune},
      year={2024},
      eprint={2406.09952},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

imirandam
/

CLIP_COCO