Unclear query and passage prefix instructions
The query and passage prefix instructions are not clear. I understand using both query
and passage
for asymmetric tasks.
But I don't get why query
and query
should be used for semantic similarity, I get it if the texts are like Quora duplicates.
But the paper "Text Embeddings by Weakly-Supervised Contrastive Pre-training" says:
"For the Quora duplicate retrieval task in the BEIR benchmark, we add prefix “query: ” to all the
questions. For other retrieval tasks, we use “query: ” and “passage: ” prefixes correspondingly."
Shouldn't I use passage:
prefix for regular passages ?
Thank you for your help
Commenting because I'm curious too!
I'm currently have my text documents embedded with "passage: " prefix and it makes it an inefficient use of resource to store an entirely new set of vectors with the difference being the prefix is different i.e. "query: " .
Am also curious why not use two embeddings that have been prefixed with "passage: " and used for symmetric tasks instead? Why must be "query: "?
It's an empirical observation that using "query: " prefix for symmetric tasks performs slightly better than the "passage: " prefix.
Thanks for your answer. How better doest it performs ? Do we have metrics about this claim ?
I guess we can run the MTEB benchmark script using a passage:
prefix to measure it.
Also, this means that if I want to perform both symmetric and asymmetric tasks and to have a maximum performance, I would have to store the vectors with both query and passage prefix
Hey, I just ran the numbers on 10 semantic textual similarity tasks with multilingual-e5-large
. The results are as follows:
BIOSSES | SICK-R | STS12 | STS13 | STS14 | STS15 | STS16 | STS17 | STS22 | STS-B | Average | |
---|---|---|---|---|---|---|---|---|---|---|---|
w/ "query: " prefix | 82.49 | 80.23 | 80.02 | 81.55 | 77.72 | 89.31 | 85.78 | 88.11 | 63.04 | 87.3 | 81.55 |
w/ "passage: " prefix | 82.66 | 77.76 | 80.56 | 79.84 | 78.42 | 89.27 | 84.67 | 87.5 | 67.24 | 85.01 | 81.29 |
The difference is there but not much. If you have a validation set, it is best to test on your data.