M-CLIP
/

M-BERT-Base-ViT-B

Feature Extraction

Inference Endpoints

Model card Files Files and versions Community

M-BERT-Base-ViT-B / README.md

FreddeFrallan's picture

Update README.md

fbf612a over 3 years ago

|

2.2 kB

	<br />
	<p align="center">
	<h1 align="center">M-BERT Base ViT-B</h1>

	<p align="center">
	<a href="https://github.com/FreddeFrallan/Multilingual-CLIP/tree/main/Model%20Cards/M-BERT%20Base%20ViT-B">Github Model Card</a>
	</p>
	</p>

	## Usage
	To use this model along with the original CLIP vision encoder you need to download the code and additional linear weights from the [Multilingual-CLIP Github](https://github.com/FreddeFrallan/Multilingual-CLIP).

	Once this is done, you can load and use the model with the following code
	```python
	from src import multilingual_clip

	model = multilingual_clip.load_model('M-BERT-Base-ViT')
	embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])
	print(embeddings.shape)
	# Yields: torch.Size([3, 640])
	```

	<!-- ABOUT THE PROJECT -->
	## About
	A [BERT-base-multilingual](https://huggingface.co/bert-base-multilingual-cased) tuned to match the embedding space for [69 languages](https://github.com/FreddeFrallan/Multilingual-CLIP/blob/main/Model%20Cards/M-BERT%20Base%2069/Fine-Tune-Languages.md), to the embedding space of the CLIP text encoder which accompanies the ViT-B/32 vision encoder. <br>
	A full list of the 100 languages used during pre-training can be found [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages), and a list of the 4069languages used during fine-tuning can be found in [SupportedLanguages.md](https://github.com/FreddeFrallan/Multilingual-CLIP/blob/main/Model%20Cards/M-BERT%20Base%2069/Fine-Tune-Languages.md).

	Training data pairs was generated by sampling 40k sentences for each language from the combined descriptions of [GCC](https://ai.google.com/research/ConceptualCaptions/) + [MSCOCO](https://cocodataset.org/#home) + [VizWiz](https://vizwiz.org/tasks-and-datasets/image-captioning/), and translating them into the corresponding language.
	All translation was done using the [AWS translate service](https://aws.amazon.com/translate/), the quality of these translations have currently not been analyzed, but one can assume the quality varies between the 69 languages.