metadata

language:
  - en
  - zh
  - de
  - fr
library_name: sentence-transformers
license: apache-2.0

ZeroNLG

Without any labeled downstream pairs for training, ZeroNLG is an unified framework that deals with multiple natural language generation (NLG) tasks in a zero-shot manner, including image-to-text, video-to-text, and text-to-text generation tasks across English, Chinese, German, and French.

Pre-trained data: a machine-translated version of CC3M, including

1.1M English sentences
1.1M English-Chinese pairs
1.1M English-German pairs
1.1M English-French pairs

Paper: ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation

Authors: Bang Yang*, Fenglin Liu*, Yuexian Zou, Xian Wu, Yaowei Wang, David A. Clifton

Quick Start

Please follow our github repo to prepare the environment at first.

from zeronlg import ZeroNLG

# Automatically download the model from Huggingface Hub
# Note: this model is especially pre-trained for visual captioning
model = ZeroNLG('zeronlg-4langs-vc')

# `images` can be a remote image url, a local image/video file, etc
# `lang` should be one of English ('en'), Chinese ('zh'), German ('de'), and French ('fr')
url = 'https://img2.baidu.com/it/u=1856500011,1563285204&fm=253&fmt=auto&app=138&f=JPEG?w=667&h=500'
caption = model.forward(images=url, lang='en', num_beams=3, task='caption') 
# caption = "dogs play in the snow"

caption = model.forward(images=url, lang='zh', num_beams=3, task='caption') 
# caption = "狗 在 雪 地 里 玩 耍"

# Althernatively, you can call the specific forward function
caption = model.forward_caption(images=url, lang='en', num_beams=3)

Zero-Shot Performance

Visual captioning

Model: zeronlg-4langs-vc's multilingual decoder + CLIP's ViT-B-32 image encoder.

Dataset	Language	Type	BLEU@1	BLEU@2	BLEU@3	BLEU@4	METEOR	ROUGE-L	CIDEr-D	SPICE
Flickr30K	English	Image	46.4	27.2	15.5	8.9	13.0	31.3	21.0	7.6
Flickr30K	Chinese	Image	45.3	25.5	14.6	8.4	-	31.8	18.0	-
Flickr30K	German	Image	41.9	21.1	11.2	5.7	-	21.2	17.1	-
Flickr30K	French	Image	19.8	9.5	5.0	2.8	-	18.6	24.8	-
COCO	English	Image	47.5	29.0	16.8	9.6	14.4	34.9	29.9	8.7
MSR-VTT	English	Video	52.2	31.9	16.6	8.7	15.0	35.4	9.9	-
VATEX	English	Video	42.2	24.6	12.5	6.3	11.7	29.3	9.1	-
VATEX	Chinese	Video	41.9	24.3	13.7	7.1	-	29.6	9.8	-

Notes:

For non-English visual captioning, we do not report METEOR and SPICE, beacause they consider synonym matching and named entity recognition in English by default.
For video captioning in English, we do not report SPICE following common practices.
Flickr30K-Chinese is known as Flickr30K-CN.
Flickr30K-German and Flickr30K-French are introduced in Multi30K.

Cross-modal retrieval

Model: zeronlg-4langs-vc's multilingual encoder + CLIP's ViT-B-32 image encoder

Dataset	Language	Type	I2T R@1	I2T R@5	I2T R@10	I2T Mean	T2I R@1	T2I R@5	T2I R@10	T2I Mean	Avg.
Flickr30K	English	Image	75.2	93.9	97.1	88.7	57.1	82.2	89.1	76.1	82.4
Flickr30K	Chinese	Image	75.0	93.0	96.7	88.2	53.8	79.8	87.1	73.6	80.9
Flickr30K	German	Image	70.9	91.1	95.7	85.9	47.5	74.1	83.1	68.2	77.1
Flickr30K	French	Image	55.8	83.4	91.5	76.9	56.6	81.2	88.4	75.4	76.2
COCO 5K	English	Image	45.0	71.1	80.3	65.5	28.2	53.3	64.5	48.7	57.1
COCO 1K	English	Image	66.0	89.1	94.6	83.2	47.5	77.5	87.9	71.0	77.1
MSR-VTT	English	Video	32.0	55.5	65.8	51.1	17.9	36.4	45.5	33.3	42.2
VATEX	English	Video	26.9	52.8	64.2	48.0	19.2	41.2	52.7	37.7	42.8
VATEX	Chinese	Video	40.6	70.9	82.7	64.7	28.8	58.0	70.1	52.3	58.5

Notes:

I2T: image-to-text retrieval, image as the query, search similar texts
T2I: text-to-image retrieval, text as the query, search similar images
R@K: Recall rate at top-K candidates
Avg.: Average of R@{1,5,10} on both directions
Retrieval uses the same testing sets as those for visual captioning, except COCO-1K, which splits the original testing set into 5 folds and report performance averaged over 5 folds.

Citation

@article{Yang2023ZeroNLG,
   title={ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation},
   author={Yang, Bang and Liu, Fenglin and Zou, Yuexian and Wu, Xian and Wang, Yaowei and Clifton, David A.},
   journal={arXiv preprint arXiv:2303.06458}
   year={2023}
}