llm-jp-clip-vit-base-patch16 / README.md

Update README.md

1500c25 verified 21 days ago

6.07 kB

	---
	tags:
	- clip
	- llm-jp-clip
	- japanese-clip
	library_name: open_clip
	pipeline_tag: zero-shot-image-classification
	license:
	- apache-2.0
	datasets:
	- llm-jp/relaion2B-en-research-safe-japanese-translation
	language:
	- ja
	---
	# Model Card for llm-jp-clip-vit-base-patch16

	# Model Details

	Japanese CLIP model trained with [OpenCLIP](https://github.com/mlfoundations/open_clip) on [relaion2B-en-research-safe-japanese-translation](https://huggingface.co/datasets/llm-jp/relaion2B-en-research-safe-japanese-translation), a Japanese translation of the English subset of ReLAION-5B (https://huggingface.co/datasets/laion/relaion2B-en-research-safe), translated by [gemma-2-9b-it](https://huggingface.co/datasets/laion/relaion2B-en-research-safe).

	The total number of parameters of this model is 248M.

	# How to Use

	## Installation

	```bash
	$ pip install open_clip_torch
	```

	## Zero-shot Image Classification
	```python
	import open_clip

	model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
	tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')

	import torch
	from PIL import Image
	import requests

	url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
	image = Image.open(requests.get(url, stream=True).raw)
	image = preprocess(image).unsqueeze(0)
	text = tokenizer(["猫", "犬", "鳥"])

	with torch.no_grad(), torch.cuda.amp.autocast():
	image_features = model.encode_image(image)
	text_features = model.encode_text(text)
	image_features /= image_features.norm(dim=-1, keepdim=True)
	text_features /= text_features.norm(dim=-1, keepdim=True)

	text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

	print("Label probs:", text_probs)
	# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])
	```

	Reference:
	- [Using OpenCLIP at Hugging Face](https://huggingface.co/docs/hub/en/open_clip), HuggingFace Docs
	- OpenCLIP [repository](https://github.com/mlfoundations/open_clip)


	# Training Details

	## Model Architecture

	- Text Encoder: RoBERTa base with llm-jp-tokenizer
	- Image Encoder: ViT-B/16

	## Training Data

	This model is trained on [relaion2B-en-research-safe-japanese-translation](https://huggingface.co/datasets/llm-jp/relaion2B-en-research-safe-japanese-translation).
	Due to a 70% success rate in image downloads, the dataset size was 1.45 billion samples, and we processed it over 9 epochs (13 billion samples in total).

	# Evaluation

	Evaluation Code: https://github.com/llm-jp/clip-eval

	Table: Performance of each model in zero-shot image classification and image-text retrieval tasks. Bold indicates first place, and _underline_ indicates second place.


	\| Model \| Params (M) \| ImageNet \| Recruit \| CIFAR10 \| CIFAR100 \| Food101 \| Caltech101 \| XM3600 I → T \| XM3600 T → I \| Avg. \|
	\|-----------------------------\|-------------\|----------\|---------\|---------\|----------\|---------\|------------\|-------------\|-------------\|------\|
	\| Japanese CLIP \| \| \| \| \| \| \| \| \| \| \|
	\| [Rinna ViT-B/16](https://huggingface.co/rinna/japanese-clip-vit-b-16) \| 196 \| 50.6 \| 39.9 \| 90.7 \| 64.0 \| 53.2 \| 84.6 \| 53.8 \| 54.0 \| 61.4 \|
	\| [Rinna ViT-B/16 cloob](https://huggingface.co/rinna/japanese-cloob-vit-b-16) \| 196 \| 54.6 \| 41.6 \| 88.2 \| 60.3 \| 57.2 \| 80.2 \| 53.4 \| 53.4 \| 61.1 \|
	\| [LY ViT-B/16](https://huggingface.co/line-corporation/clip-japanese-base) \| 196 \| 52.0 \| 83.8 \| 96.3 \| 76.7 \| 73.9 \| 88.4 \| 76.9 \| 78.0 \| 78.3 \|
	\| [llm-jp-ViT-B/16](https://huggingface.co/llm-jp/llm-jp-clip-vit-base-patch16) \| 248 \| 54.2 \| 59.4 \| 91.8 \| 69.2 \| _82.2_ \| 85.6 \| 73.6 \| 72.7 \| 73.6 \|
	\| [StabilityAI ViT-L/16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16) \| 414 \| 62.4 \| 70.5 \| _97.6_ \| 84.1 \| 74.0 \| 86.7 \| 67.3 \| 66.0 \| 76.1 \|
	\| [llm-jp-ViT-L/14](https://huggingface.co/llm-jp/llm-jp-clip-vit-large-patch14) \| 467 \| _59.5_ \| 62.9 \| 96.4 \| 77.0 \| 88.2 \| _87.8_ \| 74.1 \| _74.1_ \| _77.5_ \|
	\| Multilingual CLIP \| \| \| \| \| \| \| \| \| \| \|
	\| [SigLIP B/16-256 multi](https://huggingface.co/google/siglip-base-patch16-256-multilingual) \| 370 \| 51.9 \| 71.2 \| 92.4 \| 65.8 \| 78.6 \| 85.6 \| 45.9 \| 43.0 \| 66.8 \|
	\| [jina-clip-v2](https://huggingface.co/jinaai/jina-clip-v2) \| 865 \| 35.8 \| 48.1 \| 95.1 \| 58.3 \| 52.0 \| 69.4 \| 67.3 \| 66.4 \| 61.6 \|
	\| [LAION ViT-H/14 multi](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k) \| 1193 \| 53.0 \| _74.5_ \| 97.9 \| _78.4_ \| 74.3 \| 85.1 \| _75.0_ \| 72.0 \| 76.3 \|


	# LICENSE
	[The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)


	Please refer to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms), as the training data was translated using gemma-2-9b-it. We utilizes Gemma solely for translation purposes. According to the definition of "Model Derivatives" in Section 1.1(e), our model does not fall under the category of a "model in order to cause that model to perform similarly to Gemma." Therefore, we have concluded that it is not necessary to inherit the Gemma license.

	# Citation

	Bibtex:
	```
	@inproceedings{sugiura2025clip,
	author = {杉浦一瑳 and 栗田修平 and 小田悠介 and 河原大輔 and 岡崎直観},
	month = mar,
	series = {言語処理学会第31回年次大会 (NLP2025)},
	title = {オープンLLMによる翻訳を活用した日本語 CLIP の開発},
	year = {2025}
	}

	```