zeronlg-4langs-vc / README.md
yangbang18's picture
Update README.md
f4adfc2
metadata
language:
  - en
  - zh
  - de
  - fr
library_name: sentence-transformers
license: apache-2.0

ZeroNLG

Without any labeled downstream pairs for training, ZeroNLG is an unified framework that deals with multiple natural language generation (NLG) tasks in a zero-shot manner, including image-to-text, video-to-text, and text-to-text generation tasks across English, Chinese, German, and French.

Pre-trained data: a machine-translated version of CC3M, including

  • 1.1M English sentences
  • 1.1M English-Chinese pairs
  • 1.1M English-German pairs
  • 1.1M English-French pairs

Paper: ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation

Authors: Bang Yang*, Fenglin Liu*, Yuexian Zou, Xian Wu, Yaowei Wang, David A. Clifton

Quick Start

Please follow our github repo to prepare the environment at first.

from zeronlg import ZeroNLG

# Automatically download the model from Huggingface Hub
# Note: this model is especially pre-trained for visual captioning
model = ZeroNLG('zeronlg-4langs-vc')

# `images` can be a remote image url, a local image/video file, etc
# `lang` should be one of English ('en'), Chinese ('zh'), German ('de'), and French ('fr')
url = 'https://img2.baidu.com/it/u=1856500011,1563285204&fm=253&fmt=auto&app=138&f=JPEG?w=667&h=500'
caption = model.forward(images=url, lang='en', num_beams=3, task='caption') 
# caption = "dogs play in the snow"

caption = model.forward(images=url, lang='zh', num_beams=3, task='caption') 
# caption = "狗 在 雪 地 里 玩 耍"

# Althernatively, you can call the specific forward function
caption = model.forward_caption(images=url, lang='en', num_beams=3)

Zero-Shot Performance

Visual captioning

Model: zeronlg-4langs-vc's multilingual decoder + CLIP's ViT-B-32 image encoder.

Dataset Language Type BLEU@1 BLEU@2 BLEU@3 BLEU@4 METEOR ROUGE-L CIDEr-D SPICE
Flickr30K English Image 46.4 27.2 15.5 8.9 13.0 31.3 21.0 7.6
Flickr30K Chinese Image 45.3 25.5 14.6 8.4 - 31.8 18.0 -
Flickr30K German Image 41.9 21.1 11.2 5.7 - 21.2 17.1 -
Flickr30K French Image 19.8 9.5 5.0 2.8 - 18.6 24.8 -
COCO English Image 47.5 29.0 16.8 9.6 14.4 34.9 29.9 8.7
MSR-VTT English Video 52.2 31.9 16.6 8.7 15.0 35.4 9.9 -
VATEX English Video 42.2 24.6 12.5 6.3 11.7 29.3 9.1 -
VATEX Chinese Video 41.9 24.3 13.7 7.1 - 29.6 9.8 -

Notes:

  • For non-English visual captioning, we do not report METEOR and SPICE, beacause they consider synonym matching and named entity recognition in English by default.
  • For video captioning in English, we do not report SPICE following common practices.
  • Flickr30K-Chinese is known as Flickr30K-CN.
  • Flickr30K-German and Flickr30K-French are introduced in Multi30K.

Cross-modal retrieval

Model: zeronlg-4langs-vc's multilingual encoder + CLIP's ViT-B-32 image encoder

Dataset Language Type I2T R@1 I2T R@5 I2T R@10 I2T Mean T2I R@1 T2I R@5 T2I R@10 T2I Mean Avg.
Flickr30K English Image 75.2 93.9 97.1 88.7 57.1 82.2 89.1 76.1 82.4
Flickr30K Chinese Image 75.0 93.0 96.7 88.2 53.8 79.8 87.1 73.6 80.9
Flickr30K German Image 70.9 91.1 95.7 85.9 47.5 74.1 83.1 68.2 77.1
Flickr30K French Image 55.8 83.4 91.5 76.9 56.6 81.2 88.4 75.4 76.2
COCO 5K English Image 45.0 71.1 80.3 65.5 28.2 53.3 64.5 48.7 57.1
COCO 1K English Image 66.0 89.1 94.6 83.2 47.5 77.5 87.9 71.0 77.1
MSR-VTT English Video 32.0 55.5 65.8 51.1 17.9 36.4 45.5 33.3 42.2
VATEX English Video 26.9 52.8 64.2 48.0 19.2 41.2 52.7 37.7 42.8
VATEX Chinese Video 40.6 70.9 82.7 64.7 28.8 58.0 70.1 52.3 58.5

Notes:

  • I2T: image-to-text retrieval, image as the query, search similar texts
  • T2I: text-to-image retrieval, text as the query, search similar images
  • R@K: Recall rate at top-K candidates
  • Avg.: Average of R@{1,5,10} on both directions
  • Retrieval uses the same testing sets as those for visual captioning, except COCO-1K, which splits the original testing set into 5 folds and report performance averaged over 5 folds.

Citation

@article{Yang2023ZeroNLG,
   title={ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation},
   author={Yang, Bang and Liu, Fenglin and Zou, Yuexian and Wu, Xian and Wang, Yaowei and Clifton, David A.},
   journal={arXiv preprint arXiv:2303.06458}
   year={2023}
}