Edit model card

GIT-base fine-tuned for Narrative Image Captioning

GIT base trained on the HL Narratives for high-level narrative descriptions generation

Model fine-tuning πŸ‹οΈβ€

  • Trained for a 3 epochs
  • lr: 5eβˆ’5
  • Adam optimizer
  • half-precision (fp16)

Test set metrics 🧾

| Cider  | SacreBLEU  | Rouge-L|
|--------|------------|--------|
| 75.78  |   11.11    |  27.61 |

Model in Action πŸš€

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("git-base-captioning-ft-hl-narratives")
model = AutoModelForCausalLM.from_pretrained("git-base-captioning-ft-hl-narratives").to("cuda")

img_url = 'https://datasets-server.huggingface.co/assets/michelecafagna26/hl/--/default/train/0/image/image.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')


inputs = processor(raw_image, return_tensors="pt").to("cuda")
pixel_values = inputs.pixel_values

generated_ids = model.generate(pixel_values=pixel_values, max_length=50,
            do_sample=True,
            top_k=120,
            top_p=0.9,
            early_stopping=True,
            num_return_sequences=1)

processor.batch_decode(generated_ids, skip_special_tokens=True)

>>> "she is posing for a photo on the beach, she wants to post on her social media."

BibTex and citation info

@inproceedings{cafagna2023hl,
  title={{HL} {D}ataset: {V}isually-grounded {D}escription of {S}cenes, {A}ctions and
{R}ationales},
  author={Cafagna, Michele and van Deemter, Kees and Gatt, Albert},
  booktitle={Proceedings of the 16th International Natural Language Generation Conference (INLG'23)},
address = {Prague, Czech Republic},
  year={2023}
}
Downloads last month
12
Safetensors
Model size
177M params
Tensor type
I64
Β·
F32
Β·
Inference Examples
Inference API (serverless) is not available, repository is disabled.

Dataset used to train michelecafagna26/git-base-captioning-ft-hl-narratives