File size: 1,294 Bytes
4375b06
 
e1b4100
4375b06
 
 
 
 
e1b4100
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
---
library_name: transformers
license: mit
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->
Based on: https://huggingface.co/microsoft/phi-2

Summary of changes made:
1. Add new special tokens for padding ([PAD]) and ChatML tokens (<|im_start|>, <|im_start|>) for further finetuning on instruction/chat datasets
2. Resize embedding layer and final output layer
  - https://huggingface.co/microsoft/phi-2/discussions/22#659d8ba950c1bbee5be6f179
    - Original embedding size is 51200, but only 50295 tokens were used
    - Resized the final embdedding matrix to avoid confusion, now aligns with tokenizer vocabulary
  - https://huggingface.co/microsoft/phi-2/discussions/43#659d8d3418dc7360290a4734

# Code for Reproducibility
```python
import torch
import transformers

transformers.set_seed(42)
torch.set_default_device("cuda")

model_checkpoint = "microsoft/phi-2"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_checkpoint)
model = transformers.AutoModelForCausalLM.from_pretrained(model_checkpoint, torch_dtype=torch.float16, trust_remote_code=True)

num_added_tokens = tokenizer.add_special_tokens({'additional_special_tokens': ['<|im_start|>', '<|im_end|>'], 'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))
```