Exception: data did not match any variant of untagged enum PyDecoderWrapper

#1
by ymoslem - opened

Hello! Thanks for your efforts!

When I tried to load the tokenizer

tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_en")

I received the following error:

Exception: data did not match any variant of untagged enum PyDecoderWrapper at line 1130 column 3

I tried downloading the tokenizer locally too.

Thanks!

it seems tokenizer was trained with older version:

!pip install tokenizers==0.13.4rc2
!wget https://huggingface.co/HPLT/hplt_bert_base_en/resolve/main/tokenizer.json?download=true -O tokenizer.json
    
from tokenizers import Tokenizer

tok = Tokenizer.from_file('./tokenizer.json')
tok.get_vocab()

this works to a first approximation:

from transformers import PreTrainedTokenizerFast
from tokenizers import Tokenizer
import requests

lang = "en"
response = requests.get(f"https://huggingface.co/HPLT/hplt_bert_base_{lang}/resolve/main/tokenizer.json?download=true")

tokenizer_json = json.loads(response.content)
    
for item in tokenizer_json['pre_tokenizer']['pretokenizers']:
    if 'add_prefix_space' in item and item['type'] == 'Metaspace':
        value = item['add_prefix_space']
        del(item['add_prefix_space'])
        if value:
            item['prepend_scheme'] = 'always'
        else:
            item['prepend_scheme'] = 'never'
            
for item in tokenizer_json['decoder']['decoders']:
    if 'add_prefix_space' in item and item['type'] == 'Metaspace':
        value = item['add_prefix_space']
        del(item['add_prefix_space'])
        if value:
            item['prepend_scheme'] = 'always'
        else:
            item['prepend_scheme'] = 'never'
            
Tokenizer = Tokenizer.from_str(json.dumps(tokenizer_json))
tokenizer = PreTrainedTokenizerFast(tokenizer_object=Tokenizer)

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

#tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_en")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_en", trust_remote_code=True)

mask_id = tokenizer.convert_tokens_to_ids("[MASK]")
input_text = tokenizer("It's a beautiful[MASK].", return_tensors="pt")
output_p = model(**input_text)
output_text = torch.where(input_text.input_ids == mask_id, output_p.logits.argmax(-1), input_text.input_ids)

# should output: '[CLS] It's a beautiful place.[SEP]'
assert tokenizer.decode(output_text[0].tolist()) == "[CLS] It's a beautiful place.[SEP]"
HPLT org

Hi, thank you very much for reporting this issue!

We still need to investigate this further, it looks like there was a breaking change introduced in a recent version of tokenizers. The issue here seems to be with the Metaspace decoder, which is now not recognized (that's this part of the error message: data did not match any variant of untagged enum PyDecoderWrapper). I did a quick fix for this English model by replacing the Metaspace decoder by a Replace decoder (they should be equivalent), but it ultimately seems to be caused by a bug in tokenizers and I will ask the maintainers about it :)

Newer tokenizers does not have Metaspace option 'add_prefix_space':

add_prefix_space (bool, optional, defaults to True) — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello.

but instead 'prepend_scheme':

prepend_scheme (str, optional, defaults to "always") — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello. Choices: “always”, “never”, “first”. First means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used).

Probably, correct replacement for the option 'add_prefix_space'=True will be 'prepend_scheme'='first'. But i accidentally replace it to 'always' and it also works.

Thanks for the feedback! It looks like this PR is what causes the problems: https://github.com/huggingface/tokenizers/pull/1476

The issue is that the PR introduced a breaking changes that alters the behavior of the metaspace pretokenizer. This means that using the new tokenizers can lead to silent bugs. Therefore I reverted my previous "fix" so that loading the model actually fails if you use the most recent versions of the libraries. We recommend you to use tokenizers <0.19 with the HPLT models.

Thanks! These versions work well:

tokenizers==0.15.2
transformers==4.39.3
ymoslem changed discussion status to closed

Sign up or log in to comment