Size of the sentences
I'd like to use this model for embedding documents, but I receive errors when document is somehow big. Say number of words more than 300, is there any limit on this?
Hi
@JuanBarragan
,
Yes, max_seq_length must be less 512 token, near 400 words
Thank you
@dangvantuan
for your answer.
I fixed that, yet I remarqued that when I ask an embedding with more than these number of tokens the model gets corrupted. Even if I catch the exception, I cannot longer use it even with the right number of tokens. Is that normal?
@JuanBarragan
j
You may segment your document into many chunks and then calculate similarity with the query and then take the chunk with the highest similarity score
Hi
@dangvantuan
, in sentence_bert_config.json
the max_seq_length
value is set to 514 and when loading the model with sentence_transformers
the result of model.get_max_seq_length()
is 514 as well.
Yet when manually truncating an input sequence to exactly 514 tokens with the associated tokenizer, the model fails with an IndexError: index out of range in self
at the embedding step.
When manually changing the configuration of the model to model.max_seq_length = 512
after loading it, the issue disappears and the model is also able to automatically truncate inputs to the correct length.
Small reproducible example:
- This does not work
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Lajavaness/sentence-camembert-large", device="cpu")
batch = ["This is a test " * 1000]
truncated_tokens = model.tokenizer(
batch,
truncation=True,
max_length=model.max_seq_length,
return_tensors="pt"
)
print(truncated_tokens["input_ids"].shape)
truncated_text = model.tokenizer.batch_decode(
truncated_tokens["input_ids"], skip_special_tokens=True
)
print(truncated_text)
model.encode(truncated_text)
- This works
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Lajavaness/sentence-camembert-large", device="cpu")
model.max_seq_length = 512
batch = ["This is a test " * 1000]
truncated_tokens = model.tokenizer(
batch,
truncation=True,
max_length=model.max_seq_length,
return_tensors="pt"
)
print(truncated_tokens["input_ids"].shape)
truncated_text = model.tokenizer.batch_decode(
truncated_tokens["input_ids"], skip_special_tokens=True
)
print(truncated_text)
model.encode(truncated_text)
- This works
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Lajavaness/sentence-camembert-large", device="cpu")
model.max_seq_length = 512
batch = ["This is a test " * 1000]
model.encode(batch)
- This works as well (and I cannot figure out why), but I need to use the
device
kwarg in my use case
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Lajavaness/sentence-camembert-large")
batche= ["This is a test " * 1000]
model.encode(batch)
I believe there might be a small configuration issue, and the default max_seq_length needs to be decreased ever so slightlty.
Thanks for the upload, the model works great otherwise !