llm-jp
/

llm-jp-clip-vit-base-patch16

Zero-Shot Image Classification

Model card Files Files and versions Community

speed commited on Jan 2

Commit

88ac7a0

·

verified ·

1 Parent(s): 3611ecd

Update README.md

Files changed (1) hide show

README.md +91 -2

README.md CHANGED Viewed

@@ -1,8 +1,97 @@
 ---
 tags:
 - clip
 library_name: open_clip
 pipeline_tag: zero-shot-image-classification
-license: mit
 ---
-# Model card for llm-jp-roberta-ViT-B-16-relaion-1.5B-lr5e-4-bs8k-accum4-20241205-epoch90

 ---
 tags:
 - clip
+- llm-jp-clip
+- japanese-clip
 library_name: open_clip
 pipeline_tag: zero-shot-image-classification
+license:
+- apache-2.0
+datasets:
+- laion/relaion2B-en-research-safe
+language:
+- ja
 ---
+# Model Card for llm-jp-clip-vit-base-patch16
+# Model Details
+A CLIP ViT-B/16 model trained using [OpenCLIP](https://github.com/mlfoundations/open_clip) with the Japanese translation of the English subset of ReLAION-5B (https://huggingface.co/datasets/laion/relaion2B-en-research-safe), translated by [gemma-2-9b-it](https://huggingface.co/datasets/laion/relaion2B-en-research-safe).
+The total number of parameters of this model is 248M.
+# How to Use
+## Installation
+```bash
+$ pip install open_clip_torch
+```
+## Zero-shot Image Classification
+```python
+import open_clip
+model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
+tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
+import torch
+from PIL import Image
+import requests
+url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
+image = Image.open(requests.get(url, stream=True).raw)
+image = preprocess(image).unsqueeze(0)
+text = tokenizer(["猫", "犬", "鳥"])
+with torch.no_grad(), torch.cuda.amp.autocast():
+    image_features = model.encode_image(image)
+    text_features = model.encode_text(text)
+    image_features /= image_features.norm(dim=-1, keepdim=True)
+    text_features /= text_features.norm(dim=-1, keepdim=True)
+    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
+print("Label probs:", text_probs)
+# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])
+```
+Reference:
+- [Using OpenCLIP at Hugging Face](https://huggingface.co/docs/hub/en/open_clip), HuggingFace Docs
+- OpenCLIP [repository](https://github.com/mlfoundations/open_clip)
+# Training Details
+## Model Architecture
+- Text Encoder: RoBERTa base with llm-jp-tokenizer
+- Image Encoder: ViT-B/16
+## Training Data
+We used a Japanese-translated version of the relaion2B-en-research-safe dataset.
+The translation was performed using gemma-2-9b-it.
+Due to a 70% success rate in image downloads, the dataset size was 1.45 billion samples, and we processed it over 9 epochs (13 billion samples in total).
+# Evaluation
+Evaluation Code: https://github.com/llm-jp/clip-eval
+TODO:
+# LICENSE
+[The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+Please also see Gemma Terms of Use (https://ai.google.dev/gemma/terms) as the training data is translated by [gemma-2-9b-it](https://huggingface.co/datasets/laion/relaion2B-en-research-safe).
+> 3.3 Generated Output
+>
+> Google claims no rights in Outputs you generate using Gemma. You and your users are solely responsible for Outputs and their subsequent uses.
+# Citation
+Bibtex:
+```
+TODO:
+```