Integrate Sentence Transformers, prevent manual tokenizer EOS
Hello!
Note
Congratulations on these model releases! Nice to see more strong reasonably sized embedding models, especially with nice features like MRL. Well done!
Pull Request overview
- Integrate with Sentence Transformers (+ README updated, added Sentence Transformers tag to make this model easier to find)
- Update the
tokenizer.json
TemplateProcessing
so the EOS is always appended.- Simplify
modeling_drama.py
_tokenize
as the EOS is now handled automatically.
- Simplify
- Rename
self.forward
toself.encode
inmodeling_drama.py
: this allows for ST to work, as it uses its own pooling.
Details
I noticed that you're using the Llama tokenizer, which (in)famously struggles with placing the EOS after the tokenized sequence. This is due to the TemplateProcessing
, which only contains BOS and not EOS. I used Arthur's recommendation here (https://github.com/huggingface/transformers/issues/22794#issuecomment-2092623992) to resolve it, i.e. I ran
bos = tokenizer.bos_token
eos = tokenizer.eos_token
tokenizer._tokenizer.post_processor = Sequence(
[
ByteLevel(add_prefix_space=True, trim_offsets=False, use_regex=True),
TemplateProcessing(
single=f"{bos}:0 $A:0 {eos}:0",
pair=f"{bos}:0 $A:0 {eos}:0 {bos}:1 $B:1 {eos}:1",
special_tokens=[
(f"{bos}", tokenizer.bos_token_id),
(f"{eos}", tokenizer.eos_token_id),
],
),
]
)
and then saved that tokenizer. In tokenizer.json
, the only updated lines are these:
...
"post_processor": {
"type": "Sequence",
"processors": [
{
"type": "ByteLevel",
"add_prefix_space": true,
"trim_offsets": false,
"use_regex": true
},
{
"type": "TemplateProcessing",
"single": [
{
"SpecialToken": {
"id": "<|begin_of_text|>",
"type_id": 0
}
},
{
"Sequence": {
"id": "A",
"type_id": 0
}
+ },
+ {
+ "SpecialToken": {
+ "id": "<|end_of_text|>",
+ "type_id": 0
+ }
}
],
"pair": [
{
"SpecialToken": {
"id": "<|begin_of_text|>",
"type_id": 0
}
},
{
"Sequence": {
"id": "A",
"type_id": 0
}
},
+ {
+ "SpecialToken": {
+ "id": "<|end_of_text|>",
+ "type_id": 0
+ }
+ },
{
"SpecialToken": {
"id": "<|begin_of_text|>",
"type_id": 1
}
},
{
"Sequence": {
"id": "B",
"type_id": 1
}
+ },
+ {
+ "SpecialToken": {
+ "id": "<|end_of_text|>",
+ "type_id": 1
+ }
}
],
"special_tokens": {
"<|begin_of_text|>": {
"id": "<|begin_of_text|>",
"ids": [
128000
],
"tokens": [
"<|begin_of_text|>"
]
+ },
+ "<|end_of_text|>": {
+ "id": "<|end_of_text|>",
+ "ids": [
+ 128001
+ ],
+ "tokens": [
+ "<|end_of_text|>"
+ ]
}
}
}
]
},
...
And the updates that you can see in special_tokens_map.json
.
This allowed me to simplify the _tokenize
method in your custom modeling code a lot. It should also be more efficient now.
I would recommend rerunning your code with this revision to experiment:
import torch
from transformers import AutoTokenizer, AutoModel
queries = [
'What percentage of the Earth\'s atmosphere is oxygen?',
'意大利首都是哪里?',
]
documents = [
"The amount of oxygen in the atmosphere has fluctuated over the last 600 million years, reaching a peak of 35% during the Carboniferous period, significantly higher than today's 21%.",
"羅馬是欧洲国家意大利首都和罗马首都广域市的首府及意大利全国的政治、经济、文化和交通中心,位于意大利半島中部的台伯河下游平原地,建城初期在七座小山丘上,故又名“七丘之城”。按城市范围内的人口计算,罗马是意大利人口最多的城市,也是欧盟人口第三多的城市。",
]
model_name = "facebook/drama-base"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_name, revision="refs/pr/1")
model = AutoModel.from_pretrained(model_name, revision="refs/pr/1", trust_remote_code=True).to(device)
query_embs = model.encode_queries(tokenizer, queries)
doc_embs = model.encode_documents(tokenizer, documents)
scores = query_embs @ doc_embs.T
print(scores.tolist())
# Expected output: [[0.5310, 0.0821], [0.1298, 0.6181]]
# An extra test:
tokenized = tokenizer("This is my text")
decoded = tokenizer.decode(tokenized["input_ids"])
print(decoded)
# <|begin_of_text|>This is my text<|end_of_text|>
You'll notice that the results are the same, and that the tokenizer automatically uses the EOS.
Beyond these changes, I added the following Sentence Transformers (ST) files:
modules.json
: Required, tells ST which "modules" to use. It uses Transformer, Pooling, and Normalize here.sentence_bert_config.json
: Optional, gives arguments for the Transformer module, notably the maximum sequence length of 8192.config_sentence_transformers.json
: Optional, stores info about prompts and the default similarity function (cosine similarity, "dot" also works as embeddings are normalized)1_Pooling/config.json
: Required, gives arguments to the Pooling module, tells it to use Mean pooling.
This means that the model is now much easier to use in third parties that integrate with Sentence Transformers, like LangChain, LlamaIndex, Haystack, etc.
- Tom Aarsen
Also, I'd love to see these on MTEB. Note that there's an all-new way for submitting models since the MMTEB release from ~last week, described here: https://github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_model.md
It consists of a simple PR to https://github.com/embeddings-benchmark/mteb and a PR to https://github.com/embeddings-benchmark/results. The first one should actually be easier with this PR merged, as then you can use the SentenceTransformerLoader.
- Tom Aarsen
Hi @tomaarsen , thank you so much for sending this PR, very neat changes! We'll get back to you when we have a chance to test your PR! 😀
One quick question since we've not used Sentence Transformers with our model yet.model = SentenceTransformer("facebook/drama-base", truncate_dim=256, trust_remote_code=True)
How does SentenceTransformer
handle normalization in this case? Does normalization happen after the truncation?