jinaai
/

jina-clip-v1

Feature Extraction

Transformers.js

sentence-similarity

🇪🇺 Region: EU

Model card Files Files and versions Community

bwang0911 commited on May 21

Commit

6417422

•

1 Parent(s): ba49c92

Update README.md

Files changed (1) hide show

README.md +79 -3

README.md CHANGED Viewed

@@ -1,3 +1,79 @@
----
-license: apache-2.0
----

+---
+tags:
+  - feature-extraction
+  - sentence-similarity
+  - mteb
+language: en
+inference: false
+license: apache-2.0
+---
+<!-- TODO: add evaluation results here -->
+<br><br>
+<p align="center">
+<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
+</p>
+<p align="center">
+<b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
+</p>
+## Quick Start
+The easiest way to starting using `jina-clip-v1` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).
+## Intended Usage & Model Info
+### `jina-clip-v1` Overview
+### `jina-clip-v1` Overview
+`jina-clip-v1` is an English, monolingual **multimodal (text-image) embedding model**.
+Traditional text embedding models, such as [jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en),
+excel in text-to-text retrieval but lack cross-modal retrieval capabilities.
+Conversely, CLIP-like models, such as [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32),
+align image embeddings with text embeddings but underperform in text-to-text retrieval due to their training methodology and context length limitations.
+`jina-clip-v1` is an innovative **multimodal embedding model**.
+Its text component achieves comparable performance to `jina-embeddings-v2-base-en` in text-to-text retrieval,
+while the overall model delivers state-of-the-art performance in cross-modal retrieval tasks.
+This makes it an ideal choice for multimodal retrieval-augmented generation (M-RAG) applications,
+allowing for both text-to-text and text-to-image searches with a single model.
+## Data & Parameters
+Jina CLIP V1 [technical report]() coming soon.
+## Usage
+You can use Jina CLIP directly from transformers package.
+```python
+!pip install transformers
+from transformers import AutoModel
+from numpy.linalg import norm
+cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
+model = AutoModel.from_pretrained('jinaai/jina-clip-v1')
+text_embeddings = model.encode_text(['How is the weather today?', 'What is the current weather like today?'])
+image_embeddings = model.encode_image(['raindrop.png'])
+print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
+print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity
+```
+## Contact
+Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
+## Citation
+If you find Jina CLIP useful in your research, please cite the following paper:
+```console
+TBD
+```