speed commited on
Commit
88ac7a0
·
verified ·
1 Parent(s): 3611ecd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -2
README.md CHANGED
@@ -1,8 +1,97 @@
1
  ---
2
  tags:
3
  - clip
 
 
4
  library_name: open_clip
5
  pipeline_tag: zero-shot-image-classification
6
- license: mit
 
 
 
 
 
7
  ---
8
- # Model card for llm-jp-roberta-ViT-B-16-relaion-1.5B-lr5e-4-bs8k-accum4-20241205-epoch90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  tags:
3
  - clip
4
+ - llm-jp-clip
5
+ - japanese-clip
6
  library_name: open_clip
7
  pipeline_tag: zero-shot-image-classification
8
+ license:
9
+ - apache-2.0
10
+ datasets:
11
+ - laion/relaion2B-en-research-safe
12
+ language:
13
+ - ja
14
  ---
15
+ # Model Card for llm-jp-clip-vit-base-patch16
16
+
17
+ # Model Details
18
+
19
+ A CLIP ViT-B/16 model trained using [OpenCLIP](https://github.com/mlfoundations/open_clip) with the Japanese translation of the English subset of ReLAION-5B (https://huggingface.co/datasets/laion/relaion2B-en-research-safe), translated by [gemma-2-9b-it](https://huggingface.co/datasets/laion/relaion2B-en-research-safe).
20
+
21
+ The total number of parameters of this model is 248M.
22
+
23
+ # How to Use
24
+
25
+ ## Installation
26
+
27
+ ```bash
28
+ $ pip install open_clip_torch
29
+ ```
30
+
31
+ ## Zero-shot Image Classification
32
+ ```python
33
+ import open_clip
34
+
35
+ model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
36
+ tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
37
+
38
+ import torch
39
+ from PIL import Image
40
+ import requests
41
+
42
+ url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
43
+ image = Image.open(requests.get(url, stream=True).raw)
44
+ image = preprocess(image).unsqueeze(0)
45
+ text = tokenizer(["猫", "犬", "鳥"])
46
+
47
+ with torch.no_grad(), torch.cuda.amp.autocast():
48
+ image_features = model.encode_image(image)
49
+ text_features = model.encode_text(text)
50
+ image_features /= image_features.norm(dim=-1, keepdim=True)
51
+ text_features /= text_features.norm(dim=-1, keepdim=True)
52
+
53
+ text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
54
+
55
+ print("Label probs:", text_probs)
56
+ # Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])
57
+ ```
58
+
59
+ Reference:
60
+ - [Using OpenCLIP at Hugging Face](https://huggingface.co/docs/hub/en/open_clip), HuggingFace Docs
61
+ - OpenCLIP [repository](https://github.com/mlfoundations/open_clip)
62
+
63
+
64
+ # Training Details
65
+
66
+ ## Model Architecture
67
+
68
+ - Text Encoder: RoBERTa base with llm-jp-tokenizer
69
+ - Image Encoder: ViT-B/16
70
+
71
+ ## Training Data
72
+
73
+ We used a Japanese-translated version of the relaion2B-en-research-safe dataset.
74
+ The translation was performed using gemma-2-9b-it.
75
+ Due to a 70% success rate in image downloads, the dataset size was 1.45 billion samples, and we processed it over 9 epochs (13 billion samples in total).
76
+
77
+ # Evaluation
78
+
79
+ Evaluation Code: https://github.com/llm-jp/clip-eval
80
+
81
+ TODO:
82
+
83
+ # LICENSE
84
+ [The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
85
+
86
+ Please also see Gemma Terms of Use (https://ai.google.dev/gemma/terms) as the training data is translated by [gemma-2-9b-it](https://huggingface.co/datasets/laion/relaion2B-en-research-safe).
87
+
88
+ > 3.3 Generated Output
89
+ >
90
+ > Google claims no rights in Outputs you generate using Gemma. You and your users are solely responsible for Outputs and their subsequent uses.
91
+
92
+ # Citation
93
+
94
+ Bibtex:
95
+ ```
96
+ TODO:
97
+ ```