|
---
|
|
tags:
|
|
- image-to-text
|
|
- image-captioning
|
|
license: apache-2.0
|
|
metrics:
|
|
- rouge
|
|
datasets:
|
|
- Mozilla/flickr30k-transformed-captions
|
|
widget:
|
|
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
|
|
example_title: Savanna
|
|
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
|
|
example_title: Football Match
|
|
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
|
|
example_title: Airport
|
|
base_model:
|
|
- google/vit-base-patch16-224-in21k
|
|
|
|
model-index:
|
|
- name: mozilla/distilvit
|
|
results:
|
|
- task:
|
|
type: image-to-text
|
|
name: Image To Text
|
|
dataset:
|
|
name: Mozilla/flickr30k-transformed-captions
|
|
type: Mozilla/flickr30k-transformed-captions
|
|
metrics:
|
|
- name: ROUGE-1
|
|
type: rouge
|
|
value: 43.006
|
|
verified: true
|
|
- name: ROUGE-2
|
|
type: rouge
|
|
value: 16.9939
|
|
verified: true
|
|
- name: ROUGE-L
|
|
type: rouge
|
|
value: 38.8923
|
|
verified: true
|
|
- name: ROUGE-LSUM
|
|
type: rouge
|
|
value: 38.8877
|
|
verified: true
|
|
- name: loss
|
|
type: loss
|
|
value: 0.19939416646957397
|
|
- name: gen_len
|
|
type: gen_len
|
|
value: 11.327256736227712
|
|
verified: true
|
|
---
|
|
|
|
# distilvit
|
|
|
|
This model is a work in progress. Fine-tuned version of those base models:
|
|
|
|
- a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
|
|
- a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2
|
|
|
|
This model was trained on:
|
|
|
|
- [Flickr30k debiased](https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions-gpt4o)
|
|
- [DocOrNot](https://huggingface.co/datasets/Mozilla/docornot)
|
|
- [Alt Text Validation](https://huggingface.co/datasets/Mozilla/alt-text-validation)
|
|
- A debiased version of COCO 2017: https://cocodataset.org
|
|
|
|
You can find the code used to create the model here: https://github.com/mozilla/distilvit
|
|
|
|
|
|
# training results
|
|
|
|
- eval/gen_len 14.99729
|
|
- eval/loss 0.17093
|
|
- eval/meteor 0.51479
|
|
- eval/rouge1 57.8066
|
|
- eval/rouge2 35.0888
|
|
- eval/rougeL 52.9138
|
|
- eval/rougeLsum 52.9101
|
|
- eval/runtime 760.2135
|
|
- eval/samples_per_second 11.18
|
|
- eval/steps_per_second 0.112
|
|
- train/epoch 8.0
|
|
- train/global_step 11752
|
|
- train/learning_rate 0.0
|
|
- train/loss 0.1034
|
|
- train/total_flos 1.518634875573869e+20
|
|
- train/train_loss 0.14875
|
|
- train/train_runtime 91405.9053
|
|
- train/train_samples_per_second 12.855
|
|
- train/train_steps_per_second 0.129
|
|
|
|
|
|
|