Best instructions for clustering and semantic similarity
#29
by
rmilliere
- opened
The model card gives an example instruction for retrieval.
What are the recommended instructions to get embeddings optimized for either clustering or sentence similarity instead of retrieval?
Thank you for asking the question. All instruction prefix examples (including clustering, STS, classification, etc) are available in Table 7 of our NV-Embed paper: https://arxiv.org/pdf/2405.17428
Thanks, I missed that in the appendix.
If anyone else is looking for this information, here are the relevant instructions:
- STS: "Retrieve semantically similar text."
- Clustering (adjusted for a generic task): "Identify the topic or theme of X" (e.g., "Identify the topic or theme of the given sentences" for a corpus of sentences)