adrianhenkel's picture
Create README.md
38ffa9e
metadata
datasets:
  - adrianhenkel/tokenized-total-512-reduced

This is the tokenizer used in the lucid prots project. The lower case letters represent the 3Di state of a residue introduced in the Foldseek paper.

Token Word
0 [PAD]
1 [UNK]
2 [CLS]
3 [SEP]
4 [MASK]
5 L
6 A
7 G
8 V
9 E
10 S
11 I
12 K
13 R
14 D
15 T
16 P
17 N
18 Q
19 F
20 Y
21 M
22 H
23 C
24 W
25 X
26 U
27 B
28 Z
29 O
30 a
31 c
32 d
33 e
34 f
35 g
36 h
37 i
38 k
39 l
40 m
41 n
42 p
43 q
44 r
45 s
46 t
47 v
48 w
49 y