Update README.md
Browse files
README.md
CHANGED
@@ -15,9 +15,6 @@ license: apache-2.0
|
|
15 |
# jina-clip-v1
|
16 |
Jina CLIP: your CLIP model is also your text retriever!
|
17 |
|
18 |
-
## Quick Start
|
19 |
-
|
20 |
-
The easiest way to starting using `jina-clip-v1` is to use Jina AI [Embedding API](https://jina.ai/embeddings/).
|
21 |
|
22 |
## Intended Usage & Model Info
|
23 |
|
@@ -30,11 +27,11 @@ Traditional text embedding models, such as [jina-embeddings-v2-base-en](https://
|
|
30 |
|
31 |
## Data & Parameters
|
32 |
|
33 |
-
|
34 |
|
35 |
## Usage
|
36 |
|
37 |
-
You can use Jina CLIP directly
|
38 |
|
39 |
```python
|
40 |
!pip install transformers einops timm pillow
|
@@ -55,31 +52,6 @@ print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similari
|
|
55 |
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity
|
56 |
```
|
57 |
|
58 |
-
**notice: our emperical study shows that text-text cosine similarity is normally larger than text-image cosine similarity!**
|
59 |
-
If you want to merge two scores, we recommended 2 ways:
|
60 |
-
|
61 |
-
1. weighted average of text-text sim and text-image sim:
|
62 |
-
|
63 |
-
```python
|
64 |
-
# pseudo code
|
65 |
-
alpha = 0.6
|
66 |
-
beta = 0.4
|
67 |
-
|
68 |
-
combined_scores = alpha * sim(query, document) + beta * sim(text, image)
|
69 |
-
```
|
70 |
-
|
71 |
-
2. apply z-score normalization before merging scores:
|
72 |
-
|
73 |
-
```python
|
74 |
-
# pseudo code
|
75 |
-
query_document_mean = np.mean(cos_sim_query_documents)
|
76 |
-
query_document_std = np.std(cos_sim_query_documents)
|
77 |
-
text_image_mean = np.mean(cos_sim_text_images)
|
78 |
-
text_image_std = np.std(cos_sim_text_images)
|
79 |
-
|
80 |
-
query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
|
81 |
-
text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std
|
82 |
-
```
|
83 |
|
84 |
## Performance
|
85 |
|
@@ -119,7 +91,38 @@ Join our [Discord community](https://discord.jina.ai) and chat with other commun
|
|
119 |
|
120 |
If you find `jina-clip-v1` useful in your research, please cite the following paper:
|
121 |
|
122 |
-
```
|
123 |
-
|
|
|
|
|
|
|
|
|
|
|
124 |
```
|
125 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
# jina-clip-v1
|
16 |
Jina CLIP: your CLIP model is also your text retriever!
|
17 |
|
|
|
|
|
|
|
18 |
|
19 |
## Intended Usage & Model Info
|
20 |
|
|
|
27 |
|
28 |
## Data & Parameters
|
29 |
|
30 |
+
[Check out our paper](https://arxiv.org/abs/2405.20204)
|
31 |
|
32 |
## Usage
|
33 |
|
34 |
+
You can use Jina CLIP directly via transformers package.
|
35 |
|
36 |
```python
|
37 |
!pip install transformers einops timm pillow
|
|
|
52 |
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity
|
53 |
```
|
54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
55 |
|
56 |
## Performance
|
57 |
|
|
|
91 |
|
92 |
If you find `jina-clip-v1` useful in your research, please cite the following paper:
|
93 |
|
94 |
+
```bibtex
|
95 |
+
@misc{2405.20204,
|
96 |
+
Author = {Andreas Koukounas and Georgios Mastrapas and Michael Günther and Bo Wang and Scott Martens and Isabelle Mohr and Saba Sturua and Mohammad Kalim Akram and Joan Fontanals Martínez and Saahil Ognawala and Susana Guzman and Maximilian Werk and Nan Wang and Han Xiao},
|
97 |
+
Title = {Jina CLIP: Your CLIP Model Is Also Your Text Retriever},
|
98 |
+
Year = {2024},
|
99 |
+
Eprint = {arXiv:2405.20204},
|
100 |
+
}
|
101 |
```
|
102 |
|
103 |
+
|
104 |
+
**notice: our emperical study shows that text-text cosine similarity is normally larger than text-image cosine similarity!**
|
105 |
+
If you want to merge two scores, we recommended 2 ways:
|
106 |
+
|
107 |
+
1. weighted average of text-text sim and text-image sim:
|
108 |
+
|
109 |
+
```python
|
110 |
+
# pseudo code
|
111 |
+
alpha = 0.6
|
112 |
+
beta = 0.4
|
113 |
+
|
114 |
+
combined_scores = alpha * sim(query, document) + beta * sim(text, image)
|
115 |
+
```
|
116 |
+
|
117 |
+
2. apply z-score normalization before merging scores:
|
118 |
+
|
119 |
+
```python
|
120 |
+
# pseudo code
|
121 |
+
query_document_mean = np.mean(cos_sim_query_documents)
|
122 |
+
query_document_std = np.std(cos_sim_query_documents)
|
123 |
+
text_image_mean = np.mean(cos_sim_text_images)
|
124 |
+
text_image_std = np.std(cos_sim_text_images)
|
125 |
+
|
126 |
+
query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
|
127 |
+
text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std
|
128 |
+
```
|