Bug in tokenizer?
#2
by
birgitbrause
- opened
I might have noticed a bug in tokenizer. It seems that in its normalizer, strip_accents
is automatically set to True when loaded, because it is not explicitly set to False in tokenizer_config.json
.
Because of this, the tokenizer normalizes German Umlauts. In the tokenizer's vocabulary however, I can see tokens having Umlauts.
When loaded from hub the tokenizer behaves like this:
If I deactivate the normalization:
Because of the Umlauts in the vocabulary, I guess strip_accents == False
is the originally intended behaviour?
Do you have an idea if the model was trained with or without seeing Umlauts?
I tested this with transformers versions 2.3.0, 4.6.1, 4.25.1.