Knowledge distillation into smaller model

#13

by tomaarsen HF staff - opened Jan 10

Jan 10

Hello!

I've quite enjoyed watching this model release. In truth, I did not expect an LLM to be capable of producing such valuable embeddings. To me, this begs the question: could we distill these high quality embeddings into a smaller model (e.g. bge-small) to 1) improve the performance of the smaller student model and 2) create longer sequence length models without requiring long-sequence labeled training data.

Additionally, I'm very interested in implementing first-party support for models such as this one in Sentence Transformers. One key aspect that is currently lacking is prompt template support. I was picturing adding a prompts configuration option like so:

{
    ...
    "prompts": {
        "retrieval": "Retrieve semantically similar text. {}",
        "summarization": "Given a news summary, retrieve other semantically similar summaries. {}",
        ...
    },
    "default_prompt_key": "retrieval",
}

Then, models can be ran like so:

from sentence_transformers import SentenceTransformers

model = SentenceTransformers("intfloat/multilingual-e5-base")
model.encode("how much protein should a female eat", prompt_key="query")
model.encode("As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day.", prompt_key="passage")

The same would work for intfloat/e5-mistral-7b-instruct.
I have this change planned alongside various other improvements of Sentence Transformers, such as a stronger Trainer, more modern objective functions (InfoNCE, AngLE), multi-gpu training, and more. I'd be very curious to hear your thoughts on all of this.

Keep up the great work.

Tom Aarsen

intfloat

Owner Jan 11

Hi @tomaarsen ,

Wow, that's really some impressive improvements of Sentence Transformers!
I have been using Sentence Transformers for quite some time, its interface is simple and easy to understand. However, when it comes to customizations such as tweaking the tokenization a bit or adding instructions, some inelegant hack is often necessary. I very much look forward to the new releases.

About the distillation of smaller models, it certainly makes sense from the point of inference efficiency and storage cost. The current e5-mistral-7b-instruct model is expensive to run even on GPUs.
The ideas behind the Tightly Coupled Teacher paper might fit here. Another thought is that cross-encoders are usually more powerful than bi-encoders at the same model size, distilling from LLM-based cross-encoders such as RankLLaMA is also worth trying.

I have not run such distillation experiments yet, but this is definitely a promising research direction.

Liang

ducknificient

Mar 23

hi, you have an interesting motivation. Do you published this model as research paper for citation ? thanks

best regards

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment