On the usage and creation of E5-instruct embeddings

#8
by jmaronasm - opened

I am reading papers about the E5 family and decoder based embeddings. References are:

https://arxiv.org/pdf/2402.05672.pdf
https://arxiv.org/pdf/2212.03533.pdf
https://arxiv.org/pdf/2401.00368.pdf

Regarding the instruction based finetunning E5 family I think I understand how training is performed. We have a siamese network that shares weights. We input an instruction + query to obtain an embedding, and the document to obtain another embeding. And then minimize a constrative loss.

However, from the FAQs I see we need to provide instructions when generating embeddings from downstream tasks: https://huggingface.co/intfloat/multilingual-e5-large-instruct#faq

However, from my understanding and what I see in the code this is not necessary. My points are:

  1. If the model generates an embedding from instruction + query and another for document ( see section 4.1 at https://arxiv.org/pdf/2212.03533.pdf) for constrative loss, then generating the embedding for the document is not really informed by the instruction+query. Model weights are, but not the embedding generation.

  2. Following previous point if we look at the code example at huggingface https://huggingface.co/intfloat/multilingual-e5-large-instruct#transformers , we see a concatenation of queries + documents resulting in a batch of 4 samples (first two samples are english and chinese query and final two samples their corresponding documents). This generates 4 embeddings which are not being informed by other samples in the batch. Embeddings are just informed from the tokens of the sample in the batch per se. So this should mean that generating an embedding for query + documents should result in a similar embedding for, e.g. query, then if only query is passed through the model.

  3. I have checked point 2 and its true. Here is the output of two runs of the HF code.

The embedding obtained for the input:

Instruct: Given a web search query, retrieve relevant passages that answer the query \n QUery: how much protein should a female eat

Has values:

0.020, 0.0112, -0.0451...

If I input both the instruct + query and document, i.e.:
Instruct: Given a web search query, retrieve relevant passages that answer the query \n QUery: how much protein should a female eat
As a general guidelin, the CDC's average requirement of protein for woman ages 19 to 70 is 48 grams ....

I obtain two embeddings, the first one with exactly the same values

0.020, 0.0112, -0.0451...
0.0327, 0.0041, -0.0503...

So in conclusion, what is this FAQ being refered to? I can understand that is necessary for training but not for obtaining embeddings for downstream applications.

Okay,

after some thinking and reading I think the point is that only when generating the embedding of the query for whatever task (retrieval, clustering etc) one needs to prompt the model with the instruction. I dont see, however, how this could apply to clustering assignment when our goal is to assign similar documents and not a query to a cluster.

  1. You are right that the documents are not informed by instructions, only the query-side has instructions. This is because the documents are often pre-built as a vector index, we would like it to be independent of the query task. As a result, we can query against a pre-built vector index with any task instruction as we want.

  2. For clustering tasks, the query is simply the documents you want to cluster. Here are some instructions we used for clustering: https://github.com/microsoft/unilm/blob/78804b3640bf37efc5666893a0f9674443ae125b/e5/utils.py#L142-L152 The term query is a bit overloaded, it is not just a short phrase as the case in search engine.

Sign up or log in to comment