The model just print <unk> tokens

by MrBananaHuman - opened Jul 14, 2023

Discussion

MrBananaHuman

Jul 14, 2023

•

edited Jul 14, 2023

I tried to generate sentence using your sample code, but I got just unk tokens

so, I add 'bad_words_ids = [[tokenizer.unk_token_id]]', and the result is

'Beijing is the capital of China. Translate this sentence from English to Chinese. [LEN0] ~~[LEN1] ~~[LEN2] ~~[LEN3] ~~[LEN4] ~~[LEN5] ~~[LEN6] ~~[LEN7] ~~[LEN8] ~~[LEN9] ~~[LEN10] ~~[LEN11] ~~[LEN12] ~~[LEN13] ~~[LEN14] ~~[LEN15] ~~[LEN16] ~~[LEN17]'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~what is wrong?~~

pemywei

Machine Translation Team at Alibaba DAMO Academy org Jul 14, 2023

I was unable to replicate the problem.

However, I have optimized the sample code and you may try again.

MrBananaHuman

Jul 14, 2023

here is my colab code

https://colab.research.google.com/drive/108YvdvdxzDN62TX9M0d6DsqztXSeLla4?usp=sharing

(I added 'torch_dtype=torch.float16' option due to the colab vram issue)

pemywei

Machine Translation Team at Alibaba DAMO Academy org Jul 14, 2023

We incorporate the bfloat16 numerical format for polylm, fp16 should be problematic.

MrBananaHuman

Jul 14, 2023

oh, i see :) i will test without that option
thank you

MrBananaHuman

Jul 15, 2023

•

edited Jul 15, 2023

This time, I loaded the 1.7b model, but the result is as follows.

"Beijing is the capital of China.\nTranslate this sentence from English to Chinese.\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n"

Please check the same colab link.

joaomiguel26

Jul 24, 2023

I am having the same problem with the 13B model. It only generates UNK tokens.
It does not happen with the 1.7B. Could you help us @pemywei ?
Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment