tokenizer offset_mapping is incorrect
#111
by
Aflt98
- opened
I'm running this code:
from transformers import AutoTokenizer
# Initialize the tokenizer
model_path = 'meta-llama/Meta-Llama-3.1-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Tokenize the input text
text = 'The quick brown fox jumps over the lazy dog'
tokenized = tokenizer(
text,
return_tensors='pt',
return_offsets_mapping=True
)
# Debugging: Print the tokenized output
print("Tokenized Output:", tokenized)
# Check offset mapping
offset_mapping = tokenized['offset_mapping'][0]
print("Offset Mapping:", offset_mapping)
# Extract tokens based on offset mapping
tokens = [text[s:e] for s, e in offset_mapping]
print("Tokens:", tokens)
and here is the output:
Tokenized Output: {'input_ids': tensor([[128000, 791, 4062, 14198, 39935, 35308, 927, 279, 16053,
5679]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'offset_mapping': tensor([[[ 0, 0],
[ 0, 0],
[ 3, 3],
[ 9, 9],
[15, 15],
[19, 19],
[25, 25],
[30, 30],
[34, 34],
[39, 39]]])}
Offset Mapping: tensor([[ 0, 0],
[ 0, 0],
[ 3, 3],
[ 9, 9],
[15, 15],
[19, 19],
[25, 25],
[30, 30],
[34, 34],
[39, 39]])
Tokens: ['', '', '', '', '', '', '', '', '', '']
Why is the offset mapping [0, 0], ... ?
related issue: https://github.com/huggingface/tokenizers/issues/1553