[BUG/Help] ice_text.model词表长度与config里设置不一致
#65
by
Au3C2
- opened
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
ice_text.model词表长度130344
>>> from transformers import AutoTokenizer, AutoModel
>>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
>>> len(tokenizer.get_vocab())
130344
config:
"vocab_size": 130528
模型参数:
transformer.word_embeddings.embedding_table torch.Size([130528, 4096]) torch.float16
lm_head.weight torch.Size([130528, 4096]) torch.float16
词表长度不一致导致有时会生成词表外的词,然后索引越界退出
Expected Behavior
词表大小与config、模型参数一致
Steps To Reproduce
>>> from transformers import AutoTokenizer, AutoModel
>>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
>>> len(tokenizer.get_vocab())
130344
Environment
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :
Anything else?
No response
Au3C2
changed discussion title from
Au3C2
to [BUG/Help] ice_text.model词表长度与config里设置不一致