metadata
datasets:
- adrianhenkel/tokenized-total-512-reduced
This is the tokenizer used in the lucid prots project. The lower case letters represent the 3Di state of a residue introduced in the Foldseek paper.
Token | Word |
---|---|
0 | [PAD] |
1 | [UNK] |
2 | [CLS] |
3 | [SEP] |
4 | [MASK] |
5 | L |
6 | A |
7 | G |
8 | V |
9 | E |
10 | S |
11 | I |
12 | K |
13 | R |
14 | D |
15 | T |
16 | P |
17 | N |
18 | Q |
19 | F |
20 | Y |
21 | M |
22 | H |
23 | C |
24 | W |
25 | X |
26 | U |
27 | B |
28 | Z |
29 | O |
30 | a |
31 | c |
32 | d |
33 | e |
34 | f |
35 | g |
36 | h |
37 | i |
38 | k |
39 | l |
40 | m |
41 | n |
42 | p |
43 | q |
44 | r |
45 | s |
46 | t |
47 | v |
48 | w |
49 | y |