Switch from PreTrainedTokenizerFast to GPT2TokenizerFast and add eos_token & bos_token

`PreTrainedTokenizerFast` returns `token_type_ids` by default and santacoder is not trained on them so passing `model(tokenizer(text))` can result in weird behavior in some cases. We'll use `GPT2TokenizerFast`instead.

Files changed (1) hide show

tokenizer_config.json +4 -2

tokenizer_config.json CHANGED Viewed

@@ -1,5 +1,7 @@
 {
   "errors": "replace",
-  "tokenizer_class": "PreTrainedTokenizerFast",
   "model_max_length": 2048
-}

 {
   "errors": "replace",
+  "tokenizer_class": "GPT2TokenizerFast",
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
   "model_max_length": 2048
+}