|
--- |
|
license: mit |
|
library_name: transformers |
|
tags: |
|
- tokenizers |
|
--- |
|
|
|
# Tiktoken cl100k_base/gpt4 Tokenizer |
|
|
|
## Convert script |
|
modify from https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee |
|
|
|
## Example usage: |
|
```py |
|
import transformers |
|
|
|
tokenizer = transformers.AutoTokenizer.from_pretrained("DWDMaiMai/tiktoken_cl100k_base") |
|
assert [15339, 1917, 0] == tokenizer.encode("hello world!") |
|
|
|
messages = [ |
|
{"role": "user", "content": "Hello, how are you?"}, |
|
{"role": "assistant", "content": "I'm doing great. How can I help you today?"}, |
|
{"role": "user", "content": "I'd like to show off how chat templating works!"}, |
|
] |
|
assert """<|im_start|>user |
|
Hello, how are you?<|im_end|> |
|
<|im_start|>assistant |
|
I'm doing great. How can I help you today?<|im_end|> |
|
<|im_start|>user |
|
I'd like to show off how chat templating works!<|im_end|> |
|
<|im_start|>assistant |
|
""" == tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True, |
|
) |
|
``` |
|
|
|
## Relevant |