Update README.md

f4adfc2 over 1 year ago

6.23 kB

	---
	language:
	- en
	- zh
	- de
	- fr
	library_name: sentence-transformers
	license: apache-2.0
	---

	# ZeroNLG

	Without any labeled downstream pairs for training, ZeroNLG is an unified framework that deals with multiple natural language generation (NLG) tasks in a zero-shot manner, including image-to-text, video-to-text, and text-to-text generation tasks across English, Chinese, German, and French.

	[Pre-trained data](https://drive.google.com/file/d/1yCLpDLDO5TnoqfyHKwgi51Fw66QliOvM/view?usp=share_link): a machine-translated version of [CC3M](https://huggingface.co/datasets/conceptual_captions), including
	- 1.1M English sentences
	- 1.1M English-Chinese pairs
	- 1.1M English-German pairs
	- 1.1M English-French pairs

	Paper: [ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation](https://arxiv.org/abs/2303.06458)

	Authors: Bang Yang\, Fenglin Liu\, Yuexian Zou, Xian Wu, Yaowei Wang, David A. Clifton



	## Quick Start
	Please follow our [github repo](https://github.com/yangbang18/ZeroNLG) to prepare the environment at first.

	```python
	from zeronlg import ZeroNLG

	# Automatically download the model from Huggingface Hub
	# Note: this model is especially pre-trained for visual captioning
	model = ZeroNLG('zeronlg-4langs-vc')

	# `images` can be a remote image url, a local image/video file, etc
	# `lang` should be one of English ('en'), Chinese ('zh'), German ('de'), and French ('fr')
	url = 'https://img2.baidu.com/it/u=1856500011,1563285204&fm=253&fmt=auto&app=138&f=JPEG?w=667&h=500'
	caption = model.forward(images=url, lang='en', num_beams=3, task='caption')
	# caption = "dogs play in the snow"

	caption = model.forward(images=url, lang='zh', num_beams=3, task='caption')
	# caption = "狗在雪地里玩耍"

	# Althernatively, you can call the specific forward function
	caption = model.forward_caption(images=url, lang='en', num_beams=3)
	```

	## Zero-Shot Performance
	### Visual captioning
	Model: [zeronlg-4langs-vc](https://huggingface.co/yangbang18/zeronlg-4langs-vc)'s multilingual decoder + CLIP's ViT-B-32 image encoder.
	\| Dataset \| Language \| Type \| BLEU@1 \| BLEU@2 \| BLEU@3 \| BLEU@4 \| METEOR \| ROUGE-L \| CIDEr-D \| SPICE \|
	\| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| [Flickr30K](https://paperswithcode.com/paper/from-image-descriptions-to-visual-denotations) \| English \| Image \| 46.4 \| 27.2 \| 15.5 \| 8.9 \| 13.0 \| 31.3 \| 21.0 \| 7.6
	\| Flickr30K \| [Chinese](https://dl.acm.org/doi/abs/10.1145/3123266.3123366) \| Image \| 45.3 \| 25.5 \| 14.6 \| 8.4 \| - \| 31.8 \| 18.0 \| -
	\| Flickr30K \| [German](https://github.com/multi30k/dataset) \| Image \| 41.9 \| 21.1 \| 11.2 \| 5.7 \| - \| 21.2 \| 17.1 \| -
	\| Flickr30K \| [French](https://github.com/multi30k/dataset) \| Image \| 19.8 \| 9.5 \| 5.0 \| 2.8 \| - \| 18.6 \| 24.8 \| -
	\| [COCO](https://paperswithcode.com/paper/microsoft-coco-captions-data-collection-and) \| English \| Image \| 47.5 \| 29.0 \| 16.8 \| 9.6 \| 14.4 \| 34.9 \| 29.9 \| 8.7
	\| [MSR-VTT](https://paperswithcode.com/paper/msr-vtt-a-large-video-description-dataset-for) \| English \| Video \| 52.2 \| 31.9 \| 16.6 \| 8.7 \| 15.0 \| 35.4 \| 9.9 \| -
	\| [VATEX](https://paperswithcode.com/paper/vatex-a-large-scale-high-quality-multilingual) \| English \| Video \| 42.2 \| 24.6 \| 12.5 \| 6.3 \| 11.7 \| 29.3 \| 9.1 \| -
	\| VATEX \| Chinese \| Video \| 41.9 \| 24.3 \| 13.7 \| 7.1 \| - \| 29.6 \| 9.8 \| -

	Notes:
	- For non-English visual captioning, we do not report METEOR and SPICE, beacause they consider synonym matching and named entity recognition in English by default.
	- For video captioning in English, we do not report SPICE following common practices.
	- `Flickr30K-Chinese` is known as `Flickr30K-CN`.
	- `Flickr30K-German` and `Flickr30K-French` are introduced in `Multi30K`.

	### Cross-modal retrieval
	Model: [zeronlg-4langs-vc](https://huggingface.co/yangbang18/zeronlg-4langs-vc)'s multilingual encoder + CLIP's ViT-B-32 image encoder
	\| Dataset \| Language \| Type \| I2T R@1 \| I2T R@5 \| I2T R@10 \| I2T Mean \| T2I R@1 \| T2I R@5 \| T2I R@10 \| T2I Mean \| Avg.\|
	\| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| [Flickr30K](https://paperswithcode.com/paper/from-image-descriptions-to-visual-denotations) \| English \| Image \| 75.2 \| 93.9 \| 97.1 \| 88.7 \| 57.1 \| 82.2 \| 89.1 \| 76.1 \| 82.4\|
	\| Flickr30K \| [Chinese](https://dl.acm.org/doi/abs/10.1145/3123266.3123366) \| Image \| 75.0 \| 93.0 \| 96.7 \| 88.2 \| 53.8 \| 79.8 \| 87.1 \| 73.6 \| 80.9\|
	\| Flickr30K \| [German](https://github.com/multi30k/dataset) \| Image \| 70.9 \| 91.1 \| 95.7 \| 85.9 \| 47.5 \| 74.1 \| 83.1 \| 68.2 \| 77.1\|
	\| Flickr30K \| [French](https://github.com/multi30k/dataset) \| Image \| 55.8 \| 83.4 \| 91.5 \| 76.9 \| 56.6 \| 81.2 \| 88.4 \| 75.4 \| 76.2\|
	\| [COCO 5K](https://paperswithcode.com/paper/microsoft-coco-captions-data-collection-and) \| English \| Image \| 45.0 \| 71.1 \| 80.3 \| 65.5 \| 28.2 \| 53.3 \| 64.5 \| 48.7 \| 57.1
	\| COCO 1K \| English \| Image \| 66.0 \| 89.1 \| 94.6 \| 83.2 \| 47.5 \| 77.5 \| 87.9 \| 71.0 \| 77.1 \|
	\| [MSR-VTT](https://paperswithcode.com/paper/msr-vtt-a-large-video-description-dataset-for) \| English \| Video \| 32.0 \| 55.5 \| 65.8 \| 51.1 \| 17.9 \| 36.4 \| 45.5 \| 33.3 \| 42.2
	\| [VATEX](https://paperswithcode.com/paper/vatex-a-large-scale-high-quality-multilingual) \| English \| Video \| 26.9 \| 52.8 \| 64.2 \| 48.0 \| 19.2 \| 41.2 \| 52.7 \| 37.7 \| 42.8
	\| VATEX \| Chinese \| Video \| 40.6 \| 70.9 \| 82.7 \| 64.7 \| 28.8 \| 58.0 \| 70.1 \| 52.3 \| 58.5 \|

	Notes:
	- `I2T`: image-to-text retrieval, image as the query, search similar texts
	- `T2I`: text-to-image retrieval, text as the query, search similar images
	- `R@K`: Recall rate at top-K candidates
	- `Avg.`: Average of `R@{1,5,10}` on both directions
	- Retrieval uses the same testing sets as those for visual captioning, except `COCO-1K`, which splits the original testing set into 5 folds and report performance averaged over 5 folds.

	## Citation
	```bibtex
	@article{Yang2023ZeroNLG,
	title={ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation},
	author={Yang, Bang and Liu, Fenglin and Zou, Yuexian and Wu, Xian and Wang, Yaowei and Clifton, David A.},
	journal={arXiv preprint arXiv:2303.06458}
	year={2023}
	}
	```