|
--- |
|
license: cc-by-nc-4.0 |
|
model-index: |
|
- name: CondViT-B16-txt |
|
results: |
|
- dataset: |
|
name: LAION - Referred Visual Search - Fashion |
|
split: test |
|
type: Slep/LAION-RVS-Fashion |
|
metrics: |
|
- name: R@1 +10K Dist. |
|
type: recall_at_1|10000 |
|
value: 94.18 ± 0.86 |
|
- name: R@5 +10K Dist. |
|
type: recall_at_5|10000 |
|
value: 98.78 ± 0.32 |
|
- name: R@10 +10K Dist. |
|
type: recall_at_10|10000 |
|
value: 99.25 ± 0.30 |
|
- name: R@20 +10K Dist. |
|
type: recall_at_20|10000 |
|
value: 99.71 ± 0.17 |
|
- name: R@50 +10K Dist. |
|
type: recall_at_50|10000 |
|
value: 99.79 ± 0.13 |
|
- name: R@1 +100K Dist. |
|
type: recall_at_1|100000 |
|
value: 87.07 ± 1.30 |
|
- name: R@5 +100K Dist. |
|
type: recall_at_5|100000 |
|
value: 95.28 ± 0.61 |
|
- name: R@10 +100K Dist. |
|
type: recall_at_10|100000 |
|
value: 96.99 ± 0.44 |
|
- name: R@20 +100K Dist. |
|
type: recall_at_20|100000 |
|
value: 98.04 ± 0.36 |
|
- name: R@50 +100K Dist. |
|
type: recall_at_50|100000 |
|
value: 98.98 ± 0.26 |
|
- name: R@1 +500K Dist. |
|
type: recall_at_1|500000 |
|
value: 79.41 ± 1.02 |
|
- name: R@5 +500K Dist. |
|
type: recall_at_5|500000 |
|
value: 89.65 ± 1.08 |
|
- name: R@10 +500K Dist. |
|
type: recall_at_10|500000 |
|
value: 92.72 ± 0.87 |
|
- name: R@20 +500K Dist. |
|
type: recall_at_20|500000 |
|
value: 94.88 ± 0.58 |
|
- name: R@50 +500K Dist. |
|
type: recall_at_50|500000 |
|
value: 97.13 ± 0.48 |
|
- name: R@1 +1M Dist. |
|
type: recall_at_1|1000000 |
|
value: 75.60 ± 1.40 |
|
- name: R@5 +1M Dist. |
|
type: recall_at_5|1000000 |
|
value: 86.62 ± 1.42 |
|
- name: R@10 +1M Dist. |
|
type: recall_at_10|1000000 |
|
value: 90.13 ± 1.06 |
|
- name: R@20 +1M Dist. |
|
type: recall_at_20|1000000 |
|
value: 92.82 ± 0.76 |
|
- name: R@50 +1M Dist. |
|
type: recall_at_50|1000000 |
|
value: 95.61 ± 0.62 |
|
- name: Available Dists. |
|
type: n_dists |
|
value: 2000014 |
|
- name: Embedding Dimension |
|
type: embedding_dim |
|
value: 512 |
|
- name: Conditioning |
|
type: conditioning |
|
value: text |
|
source: |
|
name: LRVSF Leaderboard |
|
url: https://huggingface.co/spaces/Slep/LRVSF-Leaderboard |
|
task: |
|
type: Retrieval |
|
tags: |
|
- lrvsf-benchmark |
|
datasets: |
|
- Slep/LAION-RVS-Fashion |
|
--- |
|
|
|
# Conditional ViT - B/16 - Text |
|
|
|
*Introduced in <a href=https://arxiv.org/abs/2306.02928>**LRVSF-Fashion: Extending Visual Search with Referring Instructions**</a>, Lepage et al. 2023* |
|
|
|
<div align="center"> |
|
<div id=links> |
|
|
|
|Data|Code|Models|Spaces| |
|
|:-:|:-:|:-:|:-:| |
|
|[Full Dataset](https://huggingface.co/datasets/Slep/LAION-RVS-Fashion)|[Training Code](https://github.com/Simon-Lepage/CondViT-LRVSF)|[Categorical Model](https://huggingface.co/Slep/CondViT-B16-cat)|[LRVS-F Leaderboard](https://huggingface.co/spaces/Slep/LRVSF-Leaderboard)| |
|
|[Test set](https://zenodo.org/doi/10.5281/zenodo.11189942)|[Benchmark Code](https://github.com/Simon-Lepage/LRVSF-Benchmark)|[Textual Model](https://huggingface.co/Slep/CondViT-B16-txt)|[Demo](https://huggingface.co/spaces/Slep/CondViT-LRVSF-Demo)| |
|
</div> |
|
</div> |
|
|
|
## General Infos |
|
|
|
Model finetuned from CLIP ViT-B/16 on LRVSF at 224x224. The conditioning text is preprocessed by a frozen [Sentence T5-XL](https://huggingface.co/sentence-transformers/sentence-t5-xl). |
|
|
|
Research use only. |
|
|
|
## How to Use |
|
|
|
```python |
|
from PIL import Image |
|
import requests |
|
from transformers import AutoProcessor, AutoModel |
|
import torch |
|
|
|
model = AutoModel.from_pretrained("Slep/CondViT-B16-txt") |
|
processor = AutoProcessor.from_pretrained("Slep/CondViT-B16-txt") |
|
|
|
url = "https://huggingface.co/datasets/Slep/LAION-RVS-Fashion/resolve/main/assets/108856.0.jpg" |
|
img = Image.open(requests.get(url, stream=True).raw) |
|
txt = "a brown bag" |
|
|
|
inputs = processor(images=[img], texts=[txt]) |
|
raw_embedding = model(**inputs) |
|
normalized_embedding = torch.nn.functional.normalize(raw_embedding, dim=-1) |
|
``` |