File size: 6,065 Bytes

---
tags:
- clip
- llm-jp-clip
- japanese-clip
library_name: open_clip
pipeline_tag: zero-shot-image-classification
license:
- apache-2.0
datasets:
- llm-jp/relaion2B-en-research-safe-japanese-translation
language:
- ja
---
# Model Card for llm-jp-clip-vit-base-patch16

# Model Details

Japanese CLIP model trained with [OpenCLIP](https://github.com/mlfoundations/open_clip) on [relaion2B-en-research-safe-japanese-translation](https://huggingface.co/datasets/llm-jp/relaion2B-en-research-safe-japanese-translation), a Japanese translation of the English subset of ReLAION-5B (https://huggingface.co/datasets/laion/relaion2B-en-research-safe), translated by [gemma-2-9b-it](https://huggingface.co/datasets/laion/relaion2B-en-research-safe).

The total number of parameters of this model is 248M.

# How to Use

## Installation

```bash
$ pip install open_clip_torch
```

## Zero-shot Image Classification
```python
import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["猫", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])
```

Reference: 
- [Using OpenCLIP at Hugging Face](https://huggingface.co/docs/hub/en/open_clip), HuggingFace Docs
- OpenCLIP [repository](https://github.com/mlfoundations/open_clip)


# Training Details

## Model Architecture

- Text Encoder: RoBERTa base with llm-jp-tokenizer
- Image Encoder: ViT-B/16

## Training Data

This model is trained on [relaion2B-en-research-safe-japanese-translation](https://huggingface.co/datasets/llm-jp/relaion2B-en-research-safe-japanese-translation).
Due to a 70% success rate in image downloads, the dataset size was 1.45 billion samples, and we processed it over 9 epochs (13 billion samples in total).

# Evaluation

Evaluation Code: https://github.com/llm-jp/clip-eval

**Table:** Performance of each model in zero-shot image classification and image-text retrieval tasks. **Bold** indicates first place, and _underline_ indicates second place.


| Model                        | Params (M) | ImageNet | Recruit | CIFAR10 | CIFAR100 | Food101 | Caltech101 | XM3600 I → T | XM3600 T → I | Avg.  |
|-----------------------------|-------------|----------|---------|---------|----------|---------|------------|-------------|-------------|------|
| **Japanese CLIP**           |             |          |         |         |          |         |            |             |             |      |
| [Rinna ViT-B/16](https://huggingface.co/rinna/japanese-clip-vit-b-16)              | 196         | 50.6     | 39.9    | 90.7    | 64.0     | 53.2    | 84.6       | 53.8        | 54.0        | 61.4 |
| [Rinna ViT-B/16 cloob](https://huggingface.co/rinna/japanese-cloob-vit-b-16)        | 196         | 54.6     | 41.6    | 88.2    | 60.3     | 57.2    | 80.2       | 53.4        | 53.4        | 61.1 |
| [LY ViT-B/16](https://huggingface.co/line-corporation/clip-japanese-base)                 | 196         | 52.0     | **83.8** | 96.3    | 76.7     | 73.9    | **88.4**   | **76.9**    | **78.0**    | **78.3** |
| [**llm-jp-ViT-B/16**](https://huggingface.co/llm-jp/llm-jp-clip-vit-base-patch16)        | 248         | 54.2     | 59.4    | 91.8    | 69.2     | _82.2_   | 85.6       | 73.6        | 72.7        | 73.6 |
| [StabilityAI ViT-L/16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16)        | 414         | **62.4** | 70.5    | _97.6_   | **84.1** | 74.0    | 86.7       | 67.3        | 66.0        | 76.1 |
| [**llm-jp-ViT-L/14**](https://huggingface.co/llm-jp/llm-jp-clip-vit-large-patch14)        | 467         | _59.5_   | 62.9    | 96.4    | 77.0     | **88.2** | _87.8_      | 74.1        | _74.1_      | _77.5_ |
| **Multilingual CLIP**       |             |          |         |         |          |         |            |             |             |      |
| [SigLIP B/16-256 multi](https://huggingface.co/google/siglip-base-patch16-256-multilingual)       | 370         | 51.9     | 71.2    | 92.4    | 65.8     | 78.6    | 85.6       | 45.9        | 43.0        | 66.8 |
| [jina-clip-v2](https://huggingface.co/jinaai/jina-clip-v2)                | 865         | 35.8     | 48.1    | 95.1    | 58.3     | 52.0    | 69.4       | 67.3        | 66.4        | 61.6 |
| [LAION ViT-H/14 multi](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k)        | 1193        | 53.0     | _74.5_   | **97.9** | _78.4_   | 74.3    | 85.1       | _75.0_      | 72.0        | 76.3 |


# LICENSE
[The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)


Please refer to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms), as the training data was translated using gemma-2-9b-it. We utilizes Gemma solely for translation purposes. According to the definition of "Model Derivatives" in Section 1.1(e), our model does not fall under the category of a "model in order to cause that model to perform similarly to Gemma." Therefore, we have concluded that it is not necessary to inherit the Gemma license.

# Citation

Bibtex:
```
@inproceedings{sugiura2025clip,
author = {杉浦 一瑳 and 栗田 修平 and 小田 悠介 and 河原大輔 and 岡崎 直観},
month = mar,
series = {言語処理学会第31回年次大会 (NLP2025)},
title = {オープンLLMによる翻訳を活用した日本語 CLIP の開発},
year = {2025}
}

```