--- datasets: - adrianhenkel/tokenized-total-512-reduced --- This is the tokenizer used in the lucid prots project. The lower case letters represent the 3Di state of a residue introduced in the [Foldseek](https://www.nature.com/articles/s41587-023-01773-0) paper. | Token | Word | |-------|-------| | 0 | [PAD] | | 1 | [UNK] | | 2 | [CLS] | | 3 | [SEP] | | 4 | [MASK] | | 5 | L | | 6 | A | | 7 | G | | 8 | V | | 9 | E | | 10 | S | | 11 | I | | 12 | K | | 13 | R | | 14 | D | | 15 | T | | 16 | P | | 17 | N | | 18 | Q | | 19 | F | | 20 | Y | | 21 | M | | 22 | H | | 23 | C | | 24 | W | | 25 | X | | 26 | U | | 27 | B | | 28 | Z | | 29 | O | | 30 | a | | 31 | c | | 32 | d | | 33 | e | | 34 | f | | 35 | g | | 36 | h | | 37 | i | | 38 | k | | 39 | l | | 40 | m | | 41 | n | | 42 | p | | 43 | q | | 44 | r | | 45 | s | | 46 | t | | 47 | v | | 48 | w | | 49 | y |