Slep
/

CondViT-B16-txt

Feature Extraction

lrvsf-benchmark

Model card Files Files and versions Community

CondViT-B16-txt / README.md

Slep's picture

Update README.md

5ee6619 verified 9 months ago

|

history blame contribute delete

3.96 kB

	---
	license: cc-by-nc-4.0
	model-index:
	- name: CondViT-B16-txt
	results:
	- dataset:
	name: LAION - Referred Visual Search - Fashion
	split: test
	type: Slep/LAION-RVS-Fashion
	metrics:
	- name: R@1 +10K Dist.
	type: recall_at_1\|10000
	value: 94.18 ± 0.86
	- name: R@5 +10K Dist.
	type: recall_at_5\|10000
	value: 98.78 ± 0.32
	- name: R@10 +10K Dist.
	type: recall_at_10\|10000
	value: 99.25 ± 0.30
	- name: R@20 +10K Dist.
	type: recall_at_20\|10000
	value: 99.71 ± 0.17
	- name: R@50 +10K Dist.
	type: recall_at_50\|10000
	value: 99.79 ± 0.13
	- name: R@1 +100K Dist.
	type: recall_at_1\|100000
	value: 87.07 ± 1.30
	- name: R@5 +100K Dist.
	type: recall_at_5\|100000
	value: 95.28 ± 0.61
	- name: R@10 +100K Dist.
	type: recall_at_10\|100000
	value: 96.99 ± 0.44
	- name: R@20 +100K Dist.
	type: recall_at_20\|100000
	value: 98.04 ± 0.36
	- name: R@50 +100K Dist.
	type: recall_at_50\|100000
	value: 98.98 ± 0.26
	- name: R@1 +500K Dist.
	type: recall_at_1\|500000
	value: 79.41 ± 1.02
	- name: R@5 +500K Dist.
	type: recall_at_5\|500000
	value: 89.65 ± 1.08
	- name: R@10 +500K Dist.
	type: recall_at_10\|500000
	value: 92.72 ± 0.87
	- name: R@20 +500K Dist.
	type: recall_at_20\|500000
	value: 94.88 ± 0.58
	- name: R@50 +500K Dist.
	type: recall_at_50\|500000
	value: 97.13 ± 0.48
	- name: R@1 +1M Dist.
	type: recall_at_1\|1000000
	value: 75.60 ± 1.40
	- name: R@5 +1M Dist.
	type: recall_at_5\|1000000
	value: 86.62 ± 1.42
	- name: R@10 +1M Dist.
	type: recall_at_10\|1000000
	value: 90.13 ± 1.06
	- name: R@20 +1M Dist.
	type: recall_at_20\|1000000
	value: 92.82 ± 0.76
	- name: R@50 +1M Dist.
	type: recall_at_50\|1000000
	value: 95.61 ± 0.62
	- name: Available Dists.
	type: n_dists
	value: 2000014
	- name: Embedding Dimension
	type: embedding_dim
	value: 512
	- name: Conditioning
	type: conditioning
	value: text
	source:
	name: LRVSF Leaderboard
	url: https://huggingface.co/spaces/Slep/LRVSF-Leaderboard
	task:
	type: Retrieval
	tags:
	- lrvsf-benchmark
	datasets:
	- Slep/LAION-RVS-Fashion
	---

	# Conditional ViT - B/16 - Text

	Introduced in <a href=https://arxiv.org/abs/2306.02928>LRVSF-Fashion: Extending Visual Search with Referring Instructions</a>, Lepage et al. 2023

	<div align="center">
	<div id=links>

	\|Data\|Code\|Models\|Spaces\|
	\|:-:\|:-:\|:-:\|:-:\|
	\|[Full Dataset](https://huggingface.co/datasets/Slep/LAION-RVS-Fashion)\|[Training Code](https://github.com/Simon-Lepage/CondViT-LRVSF)\|[Categorical Model](https://huggingface.co/Slep/CondViT-B16-cat)\|[LRVS-F Leaderboard](https://huggingface.co/spaces/Slep/LRVSF-Leaderboard)\|
	\|[Test set](https://zenodo.org/doi/10.5281/zenodo.11189942)\|[Benchmark Code](https://github.com/Simon-Lepage/LRVSF-Benchmark)\|[Textual Model](https://huggingface.co/Slep/CondViT-B16-txt)\|[Demo](https://huggingface.co/spaces/Slep/CondViT-LRVSF-Demo)\|
	</div>
	</div>

	## General Infos

	Model finetuned from CLIP ViT-B/16 on LRVSF at 224x224. The conditioning text is preprocessed by a frozen [Sentence T5-XL](https://huggingface.co/sentence-transformers/sentence-t5-xl).

	Research use only.

	## How to Use

	```python
	from PIL import Image
	import requests
	from transformers import AutoProcessor, AutoModel
	import torch

	model = AutoModel.from_pretrained("Slep/CondViT-B16-txt")
	processor = AutoProcessor.from_pretrained("Slep/CondViT-B16-txt")

	url = "https://huggingface.co/datasets/Slep/LAION-RVS-Fashion/resolve/main/assets/108856.0.jpg"
	img = Image.open(requests.get(url, stream=True).raw)
	txt = "a brown bag"

	inputs = processor(images=[img], texts=[txt])
	raw_embedding = model(**inputs)
	normalized_embedding = torch.nn.functional.normalize(raw_embedding, dim=-1)
	```