Knowledge distillation into smaller model
Hello!
I've quite enjoyed watching this model release. In truth, I did not expect an LLM to be capable of producing such valuable embeddings. To me, this begs the question: could we distill these high quality embeddings into a smaller model (e.g. bge-small) to 1) improve the performance of the smaller student model and 2) create longer sequence length models without requiring long-sequence labeled training data.
Additionally, I'm very interested in implementing first-party support for models such as this one in Sentence Transformers. One key aspect that is currently lacking is prompt template support. I was picturing adding a prompts
configuration option like so:
{
...
"prompts": {
"retrieval": "Retrieve semantically similar text. {}",
"summarization": "Given a news summary, retrieve other semantically similar summaries. {}",
...
},
"default_prompt_key": "retrieval",
}
Then, models can be ran like so:
from sentence_transformers import SentenceTransformers
model = SentenceTransformers("intfloat/multilingual-e5-base")
model.encode("how much protein should a female eat", prompt_key="query")
model.encode("As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day.", prompt_key="passage")
The same would work for intfloat/e5-mistral-7b-instruct
.
I have this change planned alongside various other improvements of Sentence Transformers, such as a stronger Trainer, more modern objective functions (InfoNCE, AngLE), multi-gpu training, and more. I'd be very curious to hear your thoughts on all of this.
Keep up the great work.
- Tom Aarsen
Hi @tomaarsen ,
Wow, that's really some impressive improvements of Sentence Transformers!
I have been using Sentence Transformers for quite some time, its interface is simple and easy to understand. However, when it comes to customizations such as tweaking the tokenization a bit or adding instructions, some inelegant hack is often necessary. I very much look forward to the new releases.
About the distillation of smaller models, it certainly makes sense from the point of inference efficiency and storage cost. The current e5-mistral-7b-instruct
model is expensive to run even on GPUs.
The ideas behind the Tightly Coupled Teacher paper might fit here. Another thought is that cross-encoders are usually more powerful than bi-encoders at the same model size, distilling from LLM-based cross-encoders such as RankLLaMA is also worth trying.
I have not run such distillation experiments yet, but this is definitely a promising research direction.
Liang
hi, you have an interesting motivation. Do you published this model as research paper for citation ? thanks
best regards