metadata

tags:
  - clip
  - llm-jp-clip
  - japanese-clip
library_name: open_clip
pipeline_tag: zero-shot-image-classification
license:
  - apache-2.0
datasets:
  - laion/relaion2B-en-research-safe
language:
  - ja

Model Card for llm-jp-clip-vit-base-patch16

Model Details

A CLIP ViT-B/16 model trained using OpenCLIP with the Japanese translation of the English subset of ReLAION-5B (https://huggingface.co/datasets/laion/relaion2B-en-research-safe), translated by gemma-2-9b-it.

The total number of parameters of this model is 248M.

How to Use

Installation

$ pip install open_clip_torch

Zero-shot Image Classification

import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["猫", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])

Reference:

Using OpenCLIP at Hugging Face, HuggingFace Docs
OpenCLIP repository

Training Details

Model Architecture

Text Encoder: RoBERTa base with llm-jp-tokenizer
Image Encoder: ViT-B/16

Training Data

We used a Japanese-translated version of the relaion2B-en-research-safe dataset. The translation was performed using gemma-2-9b-it. Due to a 70% success rate in image downloads, the dataset size was 1.45 billion samples, and we processed it over 9 epochs (13 billion samples in total).

Evaluation

Evaluation Code: https://github.com/llm-jp/clip-eval

TODO:

LICENSE

The Apache License, Version 2.0

Please also see Gemma Terms of Use (https://ai.google.dev/gemma/terms) as the training data is translated by gemma-2-9b-it.

3.3 Generated Output

Google claims no rights in Outputs you generate using Gemma. You and your users are solely responsible for Outputs and their subsequent uses.

Citation

Bibtex:

TODO: