Missing tokenize method?

by DamianS89 - opened Jan 26, 2024

Jan 26, 2024

Hey,

I tried to fine tune that embedding model to my specific case - basically puting the name of the model into my code which works on different other models. Its basically - as you recommended - using th SentenceTransformer model.fit.

The error is:

  File "xxx/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "xxx/sentence_transformers/SentenceTransformer.py", line 551, in smart_batching_collate
    tokenized = self.tokenize(texts[idx])
  File "xxx/sentence_transformers/SentenceTransformer.py", line 319, in tokenize
    return self._first_module().tokenize(texts)
AttributeError: 'NoneType' object has no attribute 'tokenize'

Am I doing something wrong here or is this method simply missing (as the error states).

Do you have any recommendations?

Best,

Damian

bwang0911

Jina AI org Jan 29, 2024

hi @DamianS89 can give more context, such as your fine-tuning code?

DamianS89

Jan 29, 2024

Sure,
I am using (most of the time) SentenceTransformer to fine tune my embedding models:

Simplified code:

examples = []
examples.append(InputExample(texts=[data['query'], data['pos'][i], data['neg'][i]]))

train_examples = examples[:1000]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)

model = SentenceTransformer(
    device=device,
)

evaluator = evaluation.TripletEvaluator.from_input_examples(eval_examples, name='eval', batch_size=batch_size)
train_loss = losses.TripletLoss(model=model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=num_epochs,
    warmup_steps=warmup_steps,
    steps_per_epoch=steps_per_epoch,
    optimizer_params={'lr': learning_rate},
    weight_decay=0,
    show_progress_bar=True,
    callback=val_callback,
    evaluator=evaluator,
    save_best_model=True,
)

model.save(f"{base_path}/fine-tuning/emb-models/{ft_model_id}")

Best,

Damian

bwang0911

Jina AI org Jan 30, 2024

hi @DamianS89 in 2024 Jan 30th after sentence-transformers release, jina-v2 now supported by sbert officially, i'm not sure about the reason, but most likely because previous sbert does not support trust_remote_code. But since yesterday you can do:

!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer(
    "jinaai/jina-embeddings-v2-base-de", # switch to en/zh for English or Chinese
    trust_remote_code=True.  # NEEDD
)

# control your input sequence length up to 8192
model.max_seq_length = 1024

embeddings = model.encode([
    'How is the weather today?',
    'Wie ist das Wetter heute?'
])
print(cos_sim(embeddings[0], embeddings[1]))
>>> tensor([[0.9602]])

so, please upgrade sbert, then sent trust_remote_code=True and give another try

DamianS89

Jan 30, 2024

Hey,
yep, I know, actually wanted to add there an issue and while I wrote it, they released 2.3.0^^
Before that I hacked the huggingface package and added basically "this.client.max_seq_length = xxx" statically for testing purposes.
Thank you for your responses.
Best,
Damian

DamianS89 changed discussion status to closed Jan 30, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment