Missing tokenize method?
Hey,
I tried to fine tune that embedding model to my specific case - basically puting the name of the model into my code which works on different other models. Its basically - as you recommended - using th SentenceTransformer model.fit.
The error is:
File "xxx/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "xxx/sentence_transformers/SentenceTransformer.py", line 551, in smart_batching_collate
tokenized = self.tokenize(texts[idx])
File "xxx/sentence_transformers/SentenceTransformer.py", line 319, in tokenize
return self._first_module().tokenize(texts)
AttributeError: 'NoneType' object has no attribute 'tokenize'
Am I doing something wrong here or is this method simply missing (as the error states).
Do you have any recommendations?
Best,
Damian
hi @DamianS89 can give more context, such as your fine-tuning code?
Sure,
I am using (most of the time) SentenceTransformer to fine tune my embedding models:
Simplified code:
examples = []
examples.append(InputExample(texts=[data['query'], data['pos'][i], data['neg'][i]]))
train_examples = examples[:1000]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)
model = SentenceTransformer(
device=device,
)
evaluator = evaluation.TripletEvaluator.from_input_examples(eval_examples, name='eval', batch_size=batch_size)
train_loss = losses.TripletLoss(model=model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=num_epochs,
warmup_steps=warmup_steps,
steps_per_epoch=steps_per_epoch,
optimizer_params={'lr': learning_rate},
weight_decay=0,
show_progress_bar=True,
callback=val_callback,
evaluator=evaluator,
save_best_model=True,
)
model.save(f"{base_path}/fine-tuning/emb-models/{ft_model_id}")
Best,
Damian
hi
@DamianS89
in 2024 Jan 30th after sentence-transformers release, jina-v2 now supported by sbert officially, i'm not sure about the reason, but most likely because previous sbert does not support trust_remote_code
. But since yesterday you can do:
!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer(
"jinaai/jina-embeddings-v2-base-de", # switch to en/zh for English or Chinese
trust_remote_code=True. # NEEDD
)
# control your input sequence length up to 8192
model.max_seq_length = 1024
embeddings = model.encode([
'How is the weather today?',
'Wie ist das Wetter heute?'
])
print(cos_sim(embeddings[0], embeddings[1]))
>>> tensor([[0.9602]])
so, please upgrade sbert, then sent trust_remote_code=True
and give another try
Hey,
yep, I know, actually wanted to add there an issue and while I wrote it, they released 2.3.0^^
Before that I hacked the huggingface package and added basically "this.client.max_seq_length = xxx" statically for testing purposes.
Thank you for your responses.
Best,
Damian