|
--- |
|
tags: |
|
- clip |
|
- llm-jp-clip |
|
- japanese-clip |
|
library_name: open_clip |
|
pipeline_tag: zero-shot-image-classification |
|
license: |
|
- apache-2.0 |
|
datasets: |
|
- llm-jp/relaion2B-en-research-safe-japanese-translation |
|
language: |
|
- ja |
|
--- |
|
# Model Card for llm-jp-clip-vit-base-patch16 |
|
|
|
# Model Details |
|
|
|
Japanese CLIP model trained with [OpenCLIP](https://github.com/mlfoundations/open_clip) on [relaion2B-en-research-safe-japanese-translation](https://huggingface.co/datasets/llm-jp/relaion2B-en-research-safe-japanese-translation), a Japanese translation of the English subset of ReLAION-5B (https://huggingface.co/datasets/laion/relaion2B-en-research-safe), translated by [gemma-2-9b-it](https://huggingface.co/datasets/laion/relaion2B-en-research-safe). |
|
|
|
The total number of parameters of this model is 248M. |
|
|
|
# How to Use |
|
|
|
## Installation |
|
|
|
```bash |
|
$ pip install open_clip_torch |
|
``` |
|
|
|
## Zero-shot Image Classification |
|
```python |
|
import open_clip |
|
|
|
model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16') |
|
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16') |
|
|
|
import torch |
|
from PIL import Image |
|
import requests |
|
|
|
url = 'http://images.cocodataset.org/val2017/000000039769.jpg' |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
image = preprocess(image).unsqueeze(0) |
|
text = tokenizer(["猫", "犬", "鳥"]) |
|
|
|
with torch.no_grad(), torch.cuda.amp.autocast(): |
|
image_features = model.encode_image(image) |
|
text_features = model.encode_text(text) |
|
image_features /= image_features.norm(dim=-1, keepdim=True) |
|
text_features /= text_features.norm(dim=-1, keepdim=True) |
|
|
|
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) |
|
|
|
print("Label probs:", text_probs) |
|
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]]) |
|
``` |
|
|
|
Reference: |
|
- [Using OpenCLIP at Hugging Face](https://huggingface.co/docs/hub/en/open_clip), HuggingFace Docs |
|
- OpenCLIP [repository](https://github.com/mlfoundations/open_clip) |
|
|
|
|
|
# Training Details |
|
|
|
## Model Architecture |
|
|
|
- Text Encoder: RoBERTa base with llm-jp-tokenizer |
|
- Image Encoder: ViT-B/16 |
|
|
|
## Training Data |
|
|
|
This model is trained on [relaion2B-en-research-safe-japanese-translation](https://huggingface.co/datasets/llm-jp/relaion2B-en-research-safe-japanese-translation). |
|
Due to a 70% success rate in image downloads, the dataset size was 1.45 billion samples, and we processed it over 9 epochs (13 billion samples in total). |
|
|
|
# Evaluation |
|
|
|
Evaluation Code: https://github.com/llm-jp/clip-eval |
|
|
|
**Table:** Performance of each model in zero-shot image classification and image-text retrieval tasks. **Bold** indicates first place, and _underline_ indicates second place. |
|
|
|
|
|
| Model | Params (M) | ImageNet | Recruit | CIFAR10 | CIFAR100 | Food101 | Caltech101 | XM3600 I → T | XM3600 T → I | Avg. | |
|
|-----------------------------|-------------|----------|---------|---------|----------|---------|------------|-------------|-------------|------| |
|
| **Japanese CLIP** | | | | | | | | | | | |
|
| [Rinna ViT-B/16](https://huggingface.co/rinna/japanese-clip-vit-b-16) | 196 | 50.6 | 39.9 | 90.7 | 64.0 | 53.2 | 84.6 | 53.8 | 54.0 | 61.4 | |
|
| [Rinna ViT-B/16 cloob](https://huggingface.co/rinna/japanese-cloob-vit-b-16) | 196 | 54.6 | 41.6 | 88.2 | 60.3 | 57.2 | 80.2 | 53.4 | 53.4 | 61.1 | |
|
| [LY ViT-B/16](https://huggingface.co/line-corporation/clip-japanese-base) | 196 | 52.0 | **83.8** | 96.3 | 76.7 | 73.9 | **88.4** | **76.9** | **78.0** | **78.3** | |
|
| [**llm-jp-ViT-B/16**](https://huggingface.co/llm-jp/llm-jp-clip-vit-base-patch16) | 248 | 54.2 | 59.4 | 91.8 | 69.2 | _82.2_ | 85.6 | 73.6 | 72.7 | 73.6 | |
|
| [StabilityAI ViT-L/16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16) | 414 | **62.4** | 70.5 | _97.6_ | **84.1** | 74.0 | 86.7 | 67.3 | 66.0 | 76.1 | |
|
| [**llm-jp-ViT-L/14**](https://huggingface.co/llm-jp/llm-jp-clip-vit-large-patch14) | 467 | _59.5_ | 62.9 | 96.4 | 77.0 | **88.2** | _87.8_ | 74.1 | _74.1_ | _77.5_ | |
|
| **Multilingual CLIP** | | | | | | | | | | | |
|
| [SigLIP B/16-256 multi](https://huggingface.co/google/siglip-base-patch16-256-multilingual) | 370 | 51.9 | 71.2 | 92.4 | 65.8 | 78.6 | 85.6 | 45.9 | 43.0 | 66.8 | |
|
| [jina-clip-v2](https://huggingface.co/jinaai/jina-clip-v2) | 865 | 35.8 | 48.1 | 95.1 | 58.3 | 52.0 | 69.4 | 67.3 | 66.4 | 61.6 | |
|
| [LAION ViT-H/14 multi](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k) | 1193 | 53.0 | _74.5_ | **97.9** | _78.4_ | 74.3 | 85.1 | _75.0_ | 72.0 | 76.3 | |
|
|
|
|
|
# LICENSE |
|
[The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
|
|
Please refer to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms), as the training data was translated using gemma-2-9b-it. We utilizes Gemma solely for translation purposes. According to the definition of "Model Derivatives" in Section 1.1(e), our model does not fall under the category of a "model in order to cause that model to perform similarly to Gemma." Therefore, we have concluded that it is not necessary to inherit the Gemma license. |
|
|
|
# Citation |
|
|
|
Bibtex: |
|
``` |
|
@inproceedings{sugiura2025clip, |
|
author = {杉浦 一瑳 and 栗田 修平 and 小田 悠介 and 河原大輔 and 岡崎 直観}, |
|
month = mar, |
|
series = {言語処理学会第31回年次大会 (NLP2025)}, |
|
title = {オープンLLMによる翻訳を活用した日本語 CLIP の開発}, |
|
year = {2025} |
|
} |
|
|
|
``` |