|
--- |
|
tags: |
|
- clip |
|
- llm-jp-clip |
|
- japanese-clip |
|
library_name: open_clip |
|
pipeline_tag: zero-shot-image-classification |
|
license: |
|
- apache-2.0 |
|
datasets: |
|
- laion/relaion2B-en-research-safe |
|
language: |
|
- ja |
|
--- |
|
# Model Card for llm-jp-clip-vit-base-patch16 |
|
|
|
# Model Details |
|
|
|
A CLIP ViT-B/16 model trained using [OpenCLIP](https://github.com/mlfoundations/open_clip) with the Japanese translation of the English subset of ReLAION-5B (https://huggingface.co/datasets/laion/relaion2B-en-research-safe), translated by [gemma-2-9b-it](https://huggingface.co/datasets/laion/relaion2B-en-research-safe). |
|
|
|
The total number of parameters of this model is 248M. |
|
|
|
# How to Use |
|
|
|
## Installation |
|
|
|
```bash |
|
$ pip install open_clip_torch |
|
``` |
|
|
|
## Zero-shot Image Classification |
|
```python |
|
import open_clip |
|
|
|
model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16') |
|
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16') |
|
|
|
import torch |
|
from PIL import Image |
|
import requests |
|
|
|
url = 'http://images.cocodataset.org/val2017/000000039769.jpg' |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
image = preprocess(image).unsqueeze(0) |
|
text = tokenizer(["猫", "犬", "鳥"]) |
|
|
|
with torch.no_grad(), torch.cuda.amp.autocast(): |
|
image_features = model.encode_image(image) |
|
text_features = model.encode_text(text) |
|
image_features /= image_features.norm(dim=-1, keepdim=True) |
|
text_features /= text_features.norm(dim=-1, keepdim=True) |
|
|
|
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) |
|
|
|
print("Label probs:", text_probs) |
|
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]]) |
|
``` |
|
|
|
Reference: |
|
- [Using OpenCLIP at Hugging Face](https://huggingface.co/docs/hub/en/open_clip), HuggingFace Docs |
|
- OpenCLIP [repository](https://github.com/mlfoundations/open_clip) |
|
|
|
|
|
# Training Details |
|
|
|
## Model Architecture |
|
|
|
- Text Encoder: RoBERTa base with llm-jp-tokenizer |
|
- Image Encoder: ViT-B/16 |
|
|
|
## Training Data |
|
|
|
We used a Japanese-translated version of the relaion2B-en-research-safe dataset. |
|
The translation was performed using gemma-2-9b-it. |
|
Due to a 70% success rate in image downloads, the dataset size was 1.45 billion samples, and we processed it over 9 epochs (13 billion samples in total). |
|
|
|
# Evaluation |
|
|
|
Evaluation Code: https://github.com/llm-jp/clip-eval |
|
|
|
TODO: |
|
|
|
# LICENSE |
|
[The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
Please also see Gemma Terms of Use (https://ai.google.dev/gemma/terms) as the training data is translated by [gemma-2-9b-it](https://huggingface.co/datasets/laion/relaion2B-en-research-safe). |
|
|
|
> 3.3 Generated Output |
|
> |
|
> Google claims no rights in Outputs you generate using Gemma. You and your users are solely responsible for Outputs and their subsequent uses. |
|
|
|
# Citation |
|
|
|
Bibtex: |
|
``` |
|
TODO: |
|
``` |