adrianhenkel commited on
Commit
38ffa9e
1 Parent(s): 2e9d126

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -0
README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - adrianhenkel/tokenized-total-512-reduced
4
+ ---
5
+ This is the tokenizer used in the lucid prots project. The lower case letters represent the 3Di state of a residue introduced in the [Foldseek](https://www.nature.com/articles/s41587-023-01773-0) paper.
6
+ | Token | Word |
7
+ |-------|-------|
8
+ | 0 | [PAD] |
9
+ | 1 | [UNK] |
10
+ | 2 | [CLS] |
11
+ | 3 | [SEP] |
12
+ | 4 | [MASK] |
13
+ | 5 | L |
14
+ | 6 | A |
15
+ | 7 | G |
16
+ | 8 | V |
17
+ | 9 | E |
18
+ | 10 | S |
19
+ | 11 | I |
20
+ | 12 | K |
21
+ | 13 | R |
22
+ | 14 | D |
23
+ | 15 | T |
24
+ | 16 | P |
25
+ | 17 | N |
26
+ | 18 | Q |
27
+ | 19 | F |
28
+ | 20 | Y |
29
+ | 21 | M |
30
+ | 22 | H |
31
+ | 23 | C |
32
+ | 24 | W |
33
+ | 25 | X |
34
+ | 26 | U |
35
+ | 27 | B |
36
+ | 28 | Z |
37
+ | 29 | O |
38
+ | 30 | a |
39
+ | 31 | c |
40
+ | 32 | d |
41
+ | 33 | e |
42
+ | 34 | f |
43
+ | 35 | g |
44
+ | 36 | h |
45
+ | 37 | i |
46
+ | 38 | k |
47
+ | 39 | l |
48
+ | 40 | m |
49
+ | 41 | n |
50
+ | 42 | p |
51
+ | 43 | q |
52
+ | 44 | r |
53
+ | 45 | s |
54
+ | 46 | t |
55
+ | 47 | v |
56
+ | 48 | w |
57
+ | 49 | y |