Edit model card

A Persian image captioning model constructed from a ViT + RoBERTa architecture trained on flickr30k-fa (created by Sajjad Ayoubi). The encoder (ViT) was initialized from https://huggingface.co/google/vit-base-patch16-224 and the decoder (RoBERTa) was initialized from https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base .

Usage

pip install hezar
from hezar.models import Model

model = Model.load("hezarai/vit-roberta-fa-image-captioning-flickr30k")
captions = model.predict("example_image.jpg")
print(captions)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Inference API (serverless) is not available, repository is disabled.

Dataset used to train hezarai/vit-roberta-fa-image-captioning-flickr30k

Collection including hezarai/vit-roberta-fa-image-captioning-flickr30k