Lajavaness/sentence-camembert-large · Size of the sentences

May 30, 2024

I'd like to use this model for embedding documents, but I receive errors when document is somehow big. Say number of words more than 300, is there any limit on this?

dangvantuan

La Javaness org May 30, 2024

Hi @JuanBarragan ,
Yes, max_seq_length must be less 512 token, near 400 words

JuanBarragan

May 30, 2024

Thank you @dangvantuan for your answer.
I fixed that, yet I remarqued that when I ask an embedding with more than these number of tokens the model gets corrupted. Even if I catch the exception, I cannot longer use it even with the right number of tokens. Is that normal?

dangvantuan

La Javaness org May 30, 2024

@JuanBarragan j
You may segment your document into many chunks and then calculate similarity with the query and then take the chunk with the highest similarity score

PierreTsr

Sep 30, 2024

•

edited Sep 30, 2024

Hi @dangvantuan , in sentence_bert_config.json the max_seq_length value is set to 514 and when loading the model with sentence_transformers the result of model.get_max_seq_length() is 514 as well.
Yet when manually truncating an input sequence to exactly 514 tokens with the associated tokenizer, the model fails with an IndexError: index out of range in self at the embedding step.

When manually changing the configuration of the model to model.max_seq_length = 512 after loading it, the issue disappears and the model is also able to automatically truncate inputs to the correct length.

Small reproducible example:

This does not work

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Lajavaness/sentence-camembert-large", device="cpu")
batch = ["This is a test " * 1000]
truncated_tokens = model.tokenizer(
    batch,
    truncation=True,
    max_length=model.max_seq_length,
    return_tensors="pt"
)
print(truncated_tokens["input_ids"].shape)
truncated_text = model.tokenizer.batch_decode(
    truncated_tokens["input_ids"], skip_special_tokens=True
)
print(truncated_text)
model.encode(truncated_text)

This works

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Lajavaness/sentence-camembert-large", device="cpu")
model.max_seq_length = 512
batch = ["This is a test " * 1000]
truncated_tokens = model.tokenizer(
    batch,
    truncation=True,
    max_length=model.max_seq_length,
    return_tensors="pt"
)
print(truncated_tokens["input_ids"].shape)
truncated_text = model.tokenizer.batch_decode(
    truncated_tokens["input_ids"], skip_special_tokens=True
)
print(truncated_text)
model.encode(truncated_text)

This works

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Lajavaness/sentence-camembert-large", device="cpu")
model.max_seq_length = 512
batch = ["This is a test " * 1000]
model.encode(batch)

This works as well (and I cannot figure out why), but I need to use the device kwarg in my use case

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Lajavaness/sentence-camembert-large")
batche= ["This is a test " * 1000]
model.encode(batch)

I believe there might be a small configuration issue, and the default max_seq_length needs to be decreased ever so slightlty.
Thanks for the upload, the model works great otherwise !