File size: 6,988 Bytes
6417422 abf8f24 6417422 e7cdc21 6417422 e7cdc21 6417422 8ee2c90 6417422 e7cdc21 6417422 6607434 6417422 6607434 6417422 a792bea ea6cbd8 6417422 b9d7d6e 668152b b9d7d6e 668152b 6507fd6 668152b b9d7d6e 668152b 6417422 668152b 6417422 0aaf6db 1f9045f 6417422 07d4bb6 6417422 a792bea 6417422 6607434 6417422 14b6120 6607434 14b6120 85f037d 14b6120 6607434 14b6120 6607434 14b6120 6607434 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
---
tags:
- feature-extraction
- sentence-similarity
- mteb
- clip
- vision
language: en
inference: false
license: apache-2.0
---
# jina-clip-v1
Jina CLIP: your CLIP model is also your text retriever!
## Intended Usage & Model Info
`jina-clip-v1` is a state-of-the-art English **multimodal (text-image) embedding model**.
Traditional text embedding models, such as [jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en), excel in text-to-text retrieval but incapable of cross-modal tasks. Models like [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) effectively align image and text embeddings but are not optimized for text-to-text retrieval due to their training methodologies and context limitations.
`jina-clip-v1` bridges this gap by offering robust performance in both domains. Its text component matches the retrieval efficiency of `jina-embeddings-v2-base-en`, while its overall architecture sets a new benchmark for cross-modal retrieval. This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (M-RAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.
## Data & Parameters
[Check out our paper](https://arxiv.org/abs/2405.20204)
## Usage
You can use Jina CLIP directly via transformers package.
```python
!pip install transformers einops timm pillow
from transformers import AutoModel
from numpy.linalg import norm
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
# Initialize the model
model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)
# New meaningful sentences
sentences = ['Bridge close-shot', 'Bridge in far away']
# Public image URLs
image_urls = [
'https://fastly.picsum.photos/id/74/4288/2848.jpg?hmac=q02MzzHG23nkhJYRXR-_RgKTr6fpfwRgcXgE0EKvNB8',
'https://fastly.picsum.photos/id/84/1280/848.jpg?hmac=YFRYDI4UsfbeTzI8ZakNOR98wVU7a-9a2tGF542539s'
]
# Encode text and images
text_embeddings = model.encode_text(sentences)
image_embeddings = model.encode_image(image_urls) # also accepts PIL.image, local filenames, dataURI
# Compute similarities
print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity
print(cos_sim(text_embeddings[0], image_embeddings[1])) # text-image cross-modal similarity
print(cos_sim(text_embeddings[1], image_embeddings[0])) # text-image cross-modal similarity
print(cos_sim(text_embeddings[1], image_embeddings[1])) # text-image cross-modal similarity
```
## Performance
### Text-Image Retrieval
| Name | Flickr Image Retr. R@1 | Flickr Image Retr. R@5 | Flickr Text Retr. R@1 | Flickr Text Retr. R@5 |
|------------------|-------------------------|-------------------------|-----------------------|-----------------------|
| ViT-B-32 | 0.597 | 0.8398 | 0.781 | 0.938 |
| ViT-B-16 | 0.6216 | 0.8572 | 0.822 | 0.966 |
| jina-clip | 0.6748 | 0.8902 | 0.811 | 0.965 |
| Name | MSCOCO Image Retr. R@1 | MSCOCO Image Retr. R@5 | MSCOCO Text Retr. R@1 | MSCOCO Text Retr. R@5 |
|------------------|-------------------------|-------------------------|-----------------------|-----------------------|
| ViT-B-32 | 0.342 | 0.6001 | 0.5234 | 0.7634 |
| ViT-B-16 | 0.3309 | 0.5842 | 0.5242 | 0.767 |
| jina-clip | 0.4111 | 0.6644 | 0.5544 | 0.7904 |
### Text-Text Retrieval
| Name | STS12 | STS15 | STS17 | STS13 | STS14 | STS16 | STS22 | STSBenchmark | SummEval |
|-----------------------|--------|--------|--------|--------|--------|--------|--------|--------------|----------|
| jina-embeddings-v2 | 0.7427 | 0.8755 | 0.8888 | 0.833 | 0.7917 | 0.836 | 0.6346 | 0.8404 | 0.3056 |
| jina-clip | 0.7352 | 0.8746 | 0.8976 | 0.8323 | 0.7868 | 0.8377 | 0.6583 | 0.8493 | 0.3048 |
| Name | ArguAna | FiQA2018 | NFCorpus | Quora | SCIDOCS | SciFact | TRECCOVID |
|--------------------|---------|----------|----------|-------|---------|---------|-----------|
| jina-embeddings-v2 | 0.4418 | 0.4158 | 0.3245 | 0.882 | 0.1986 | 0.6668 | 0.6591 |
| jina-clip | 0.4933 | 0.3827 | 0.3352 | 0.8789| 0.2024 | 0.6734 | 0.7161 |
## Contact
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
## Citation
If you find `jina-clip-v1` useful in your research, please cite the following paper:
```bibtex
@misc{2405.20204,
Author = {Andreas Koukounas and Georgios Mastrapas and Michael Günther and Bo Wang and Scott Martens and Isabelle Mohr and Saba Sturua and Mohammad Kalim Akram and Joan Fontanals Martínez and Saahil Ognawala and Susana Guzman and Maximilian Werk and Nan Wang and Han Xiao},
Title = {Jina CLIP: Your CLIP Model Is Also Your Text Retriever},
Year = {2024},
Eprint = {arXiv:2405.20204},
}
```
## FAQ
### I encounter this problem, what should I do?
```
ValueError: The model class you are passing has a `config_class` attribute that is not consistent with the config class you passed (model has <class 'transformers_modules.jinaai.jina-clip-implementation.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.configuration_clip.JinaCLIPConfig'> and you passed <class 'transformers_modules.jinaai.jina-clip-implementation.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.configuration_cli.JinaCLIPConfig'>. Fix one of those so they match!
```
There was a bug in Transformers library between 4.40.x to 4.41.1. You can update transformers to >4.41.2 or <=4.40.0
### Given one query, how can I merge its text-text and text-image cosine similarity?
Our emperical study shows that text-text cosine similarity is normally larger than text-image cosine similarity!
If you want to merge two scores, we recommended 2 ways:
1. weighted average of text-text sim and text-image sim:
```python
combined_scores = sim(text, text) + lambda * sim(text, image) # optimal lambda depends on your dataset, but in general lambda=2 can be a good choice.
```
2. apply z-score normalization before merging scores:
```python
# pseudo code
query_document_mean = np.mean(cos_sim_text_texts)
query_document_std = np.std(cos_sim_text_texts)
text_image_mean = np.mean(cos_sim_text_images)
text_image_std = np.std(cos_sim_text_images)
query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std
```
|