GME: General Multimodal Embedding
gme-Qwen2-VL-7B
We are excited to present GME-Qwen2VL
series of unified multimodal embedding models,
which are based on the advanced Qwen2-VL multimodal large language models (MLLMs).
The GME
models support three types of input: text, image, and image-text pair, all of which can produce universal vector representations and have powerful retrieval performance.
Key Enhancements of GME Models:
- Unified Multimodal Representation: GME models can process both single-modal and combined-modal inputs, resulting in a unified vector representation. This enables versatile retrieval scenarios (Any2Any Search), supporting tasks such as text retrieval, image retrieval from text, and image-to-image searches.
- High Performance: Achieves state-of-the-art (SOTA) results in our universal multimodal retrieval benchmark (UMRB) and demonstrate strong evaluation scores in the Multimodal Textual Evaluation Benchmark (MTEB).
- Dynamic Image Resolution: Benefiting from
Qwen2-VL
and our training data, GME models support dynamic resolution image input. - Strong Visual Retrieval Performance: Enhanced by the Qwen2-VL model series, our models excel in visual document retrieval tasks that require a nuanced understanding of document screenshots. This capability is particularly beneficial for complex document understanding scenarios, such as multimodal retrieval-augmented generation (RAG) applications focused on academic papers.
Developed by: Tongyi Lab, Alibaba Group
Paper: GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
Model List
Models | Model Size | Max Seq. Length | Dimension | MTEB-en | MTEB-zh | UMRB |
---|---|---|---|---|---|---|
gme-Qwen2-VL-2B |
2.21B | 32768 | 1536 | 65.27 | 66.92 | 64.45 |
gme-Qwen2-VL-7B |
8.29B | 32768 | 3584 | 67.48 | 69.73 | 67.44 |
Usage
Use with custom code
# You can find the script gme_inference.py in https://huggingface.co/Alibaba-NLP/gme-Qwen2VL-2B/blob/main/scripts/gme_inference.py
from gme_inference import GmeQwen2VL
model = GmeQwen2VL('Alibaba-NLP/gme-Qwen2-VL-7B-Instruct')
texts = [
"What kind of car is this?",
"The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023."
]
images = [
'https://en.wikipedia.org/wiki/File:Tesla_Cybertruck_damaged_window.jpg',
'https://en.wikipedia.org/wiki/File:2024_Tesla_Cybertruck_Foundation_Series,_front_left_(Greenwich).jpg',
]
# Single-modal embedding
e_text = gme.get_text_embeddings(texts=texts)
e_image = gme.get_image_embeddings(images=images)
print((e_text * e_image).sum(-1))
## tensor([0.1702, 0.5278], dtype=torch.float16)
# How to set embedding instruction
e_query = gme.get_text_embeddings(texts=texts, instruction='Find an image that matches the given text.')
# If is_query=False, we always use the default instruction.
e_corpus = gme.get_image_embeddings(images=images, is_query=False)
print((e_query * e_corpus).sum(-1))
## tensor([0.2000, 0.5752], dtype=torch.float16)
# Fused-modal embedding
e_fused = gme.get_fused_embeddings(texts=texts, images=images)
print((e_fused[0] * e_fused[1]).sum())
## tensor(0.6826, dtype=torch.float16)
Evaluation
We validated the performance on our universal multimodal retrieval benchmark (UMRB) among others.
Single-modal | Cross-modal | Fused-modal | Avg. | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
T→T (16) | I→I (1) | T→I (4) | T→VD (10) | I→T (4) | T→IT (2) | IT→T (5) | IT→I (2) | IT→IT (3) | (47) | ||
VISTA | 0.2B | 55.15 | 31.98 | 32.88 | 10.12 | 31.23 | 45.81 | 53.32 | 8.97 | 26.26 | 37.32 |
CLIP-SF | 0.4B | 39.75 | 31.42 | 59.05 | 24.09 | 62.95 | 66.41 | 53.32 | 34.9 | 55.65 | 43.66 |
One-Peace | 4B | 43.54 | 31.27 | 61.38 | 42.9 | 65.59 | 42.72 | 28.29 | 6.73 | 23.41 | 42.01 |
DSE | 4.2B | 48.94 | 27.92 | 40.75 | 78.21 | 52.54 | 49.62 | 35.44 | 8.36 | 40.18 | 50.04 |
E5-V | 8.4B | 52.41 | 27.36 | 46.56 | 41.22 | 47.95 | 54.13 | 32.9 | 23.17 | 7.23 | 42.52 |
GME-Qwen2-VL-2B | 2.2B | 55.93 | 29.86 | 57.36 | 87.84 | 61.93 | 76.47 | 64.58 | 37.02 | 66.47 | 64.45 |
GME-Qwen2-VL-7B | 8.3B | 58.19 | 31.89 | 61.35 | 89.92 | 65.83 | 80.94 | 66.18 | 42.56 | 73.62 | 67.44 |
The MTEB Leaderboard English tab shows the text embeddings performence of our model.
More detailed experimental results can be found in the paper.
Limitations
- Single Image Input: In
Qwen2-VL
, an image could be converted into a very large number of visual tokens. We limit the number of visual tokens to 1024 to obtain a good training efficiency. Due to the lack of relevant data, our models and evaluations retain one single image. - English-only Training: Our models are trained on english data only. Although the
Qwen2-VL
models are multilingual, the multilingual-multimodal embedding performance are not guaranteed.
We will extend to multi-image input, image-text interleaved data as well as multilingual data in the future version.
Redistribution and Use
We encourage and value diverse applications of GME models and continuous enhancements to the models themselves.
If you distribute or make GME models (or any derivative works) available, or if you create a product or service (including another AI model) that incorporates them, you must prominently display
Built with GME
on your website, user interface, blog post, About page, or product documentation.If you utilize GME models or their outputs to develop, train, fine-tune, or improve an AI model that is distributed or made available, you must prefix the name of any such AI model with
GME
.
Cloud API Services
In addition to the open-source GME series models, GME series models are also available as commercial API services on Alibaba Cloud.
- MultiModal Embedding Models: The
multimodal-embedding-v1
model service is available.
Note that the models behind the commercial APIs are not entirely identical to the open-source models.
Hiring
We have open positions for Research Interns and Full-Time Researchers to join our team at Tongyi Lab. We are seeking passionate individuals with expertise in representation learning, LLM-driven information retrieval, Retrieval-Augmented Generation (RAG), and agent-based systems. Our team is located in the vibrant cities of Beijing and Hangzhou, offering a collaborative and dynamic work environment where you can contribute to cutting-edge advancements in artificial intelligence and machine learning. If you are driven by curiosity and eager to make a meaningful impact through your work, we would love to hear from you. Please submit your resume along with a brief introduction to [email protected].
Citation
If you find our paper or models helpful, please consider cite:
@misc{zhang2024gme,
title={GME: Improving Universal Multimodal Retrieval by Multimodal LLMs},
author={Zhang, Xin and Zhang, Yanzhao and Xie, Wen and Li, Mingxin and Dai, Ziqi and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Li, Wenjie and Zhang, Min},
year={2024},
eprint={2412.16855},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={http://arxiv.org/abs/2412.16855},
}
- Downloads last month
- 0
Model tree for Alibaba-NLP/gme-Qwen2-VL-7B-Instruct
Collection including Alibaba-NLP/gme-Qwen2-VL-7B-Instruct
Evaluation results
- cos_sim_pearson on MTEB AFQMCvalidation set self-reported55.463
- cos_sim_spearman on MTEB AFQMCvalidation set self-reported59.667
- euclidean_pearson on MTEB AFQMCvalidation set self-reported57.814
- euclidean_spearman on MTEB AFQMCvalidation set self-reported59.667
- manhattan_pearson on MTEB AFQMCvalidation set self-reported57.724
- manhattan_spearman on MTEB AFQMCvalidation set self-reported59.550
- cos_sim_pearson on MTEB ATECtest set self-reported52.382
- cos_sim_spearman on MTEB ATECtest set self-reported55.468
- euclidean_pearson on MTEB ATECtest set self-reported56.975
- euclidean_spearman on MTEB ATECtest set self-reported55.468