Parameters for peak performance

#21
by cvdbdo - opened

Are there any stats on performance on the same dataset when changing the document chunk size, chunk strategy, languages, or model quantization?
By trial and error it seems to me that a smaller (i.e. a few sentences MAX) chunks tend to perform better.
I am trying to compare different embedders at their best, using the proper parameters for each.

Alibaba-NLP org

Firstly, thank you for your interest in the GTE series models. This is a very interesting question. Currently, we do not have such experimental data, and our previous experimental results also did not indicate that shorter texts have better performance (due to the lack of such evaluation data).

The retrieval effectiveness of the model can be influenced by various factors, such as text length and language. We speculate that the better performance of shorter texts may be due to two reasons:

  • The training data for existing models predominantly consists of short texts, as it is relatively difficult to obtain relevance data for long texts.
  • The semantic expression of short texts is more precise and concise, which is more conducive for semantic search.

Sign up or log in to comment