|
--- |
|
language: |
|
- fa |
|
library_name: hezar |
|
tags: |
|
- image-to-text |
|
- hezar |
|
metrics: |
|
- wer |
|
pipeline_tag: image-to-text |
|
datasets: |
|
- hezarai/flickr30k-fa |
|
--- |
|
|
|
A Persian image captioning model constructed from a ViT + RoBERTa architecture trained on [flickr30k-fa](https://www.kaggle.com/datasets/sajjadayobi360/flickrfa) (created by Sajjad Ayoubi). |
|
The encoder (ViT) was initialized from https://huggingface.co/google/vit-base-patch16-224 and the decoder (RoBERTa) was initialized |
|
from https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base . |
|
|
|
## Usage |
|
``` |
|
pip install hezar |
|
``` |
|
```python |
|
from hezar.models import Model |
|
|
|
model = Model.load("hezarai/vit-roberta-fa-image-captioning-flickr30k") |
|
captions = model.predict("example_image.jpg") |
|
print(captions) |
|
``` |