Training Data and Distillation
Hi. I'm Márton Kardos, maintainer of MTEB.
I am writing to you as our recent efforts have been focused on being able to reliably indicate to our users whether models have been trained in-domain, or whether the scores on our benchmarks can be considered an accurate indication of a model's generalization performance.
As per your technical report, we have come to know that Jasper has been distilled from models, which were trained on multiple MTEB datasets, and have been able to annotate this in our model metadata.
Your report, however, does not indicate whether the Stella models were trained on MTEB datasets or were finetuned/distilled from models that were.
The Stella models deliver very similar performance to those models that have been finetuned on MTEB tasks, and it seems reasonable to assume that this is also the case for Stella.
As a fellow scholar, I assume that you have as strong a dedication to open science as I do, and it is in this spirit that I would like to ask you to disclose these details to us, and our and your users.
Thanks in advance, Márton
@kardosdrur
Hi there
1)Jasper is distilled from stella_en_1.5B_v5 and nvidia/NV-Embed-v2
2)stella_en_1.5B_v5 is distilled from gte-Qwen2-7B-instruct and nvidia/NV-Embed-v1
3)When training (i.e. distillation) jasper and stella models, we only use unsupervised texts, however, these training texts may sightly overlap with MTEB sentences
So, I think the two models' zeroshot score (or ratio?) should be consistent with nvidia/NV-Embed-v1,nvidia/NV-Embed-v2 and gte-Qwen2-7B-instruct.
Thanks for MTEB maintainer's efforts, MTEB-2.0 is cool and hope it could be a perfect leaderboard >~<