Rostlab
/

prot_t5_xl_half_uniref50-enc

protein language model

text-generation-inference

Inference Endpoints

Model card Files Files and versions

t03i commited on Jan 31, 2023

Commit

94a6abc

•

1 Parent(s): 2646ade

Fix code example

Files changed (1) hide show

README.md +18 -16

README.md CHANGED Viewed

@@ -1,5 +1,4 @@
 ---
-language: protein
 tags:
 - protein language model
 datasets:
@@ -38,26 +37,29 @@ An extensive, interactive example on how to use this model for common tasks can
 Here is how to use this model to extract the features of a given protein sequence in PyTorch:
 ```python
-from transformers import T5Tokenizer, T5EncoderModel
-import torch
-tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)
-model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc", torch_dtype=torch.float16)
-sequences_Example = ["A E T C Z A O","S K T Z P"]
-sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]
-ids = tokenizer.batch_encode_plus(seqs, add_special_tokens=True, padding="longest")
-input_ids = torch.tensor(ids['input_ids'])
-attention_mask = torch.tensor(ids['attention_mask'])
-with torch.no_grad():
-    embedding_rpr = model(input_ids=input_ids,attention_mask=attention_mask)
-emb_0 = embedding_repr.last_hidden_state[0,:6]
-emb_1 = embedding_repr.last_hidden_state[1,:4]
 ```
 **NOTE**: Please make sure to explicitly set the model to `float16` (`T5EncoderModel.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', torch_dtype=torch.float16)`) otherwise, the generated embeddings will be full precision.

 ---
 tags:
 - protein language model
 datasets:
 Here is how to use this model to extract the features of a given protein sequence in PyTorch:
 ```python
+sequence_examples = ["PRTEINO", "SEQWENCE"]
+# this will replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
+sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]
+# tokenize sequences and pad up to the longest sequence in the batch
+ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
+input_ids = torch.tensor(ids['input_ids']).to(device)
+attention_mask = torch.tensor(ids['attention_mask']).to(device)
+# generate embeddings
+with torch.no_grad():
+    embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)
+# extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7])
+emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
+print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")
+# do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
+emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)
+# if you want to derive a single representation (per-protein embedding) for the whole protein
+emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)
+print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")
 ```
 **NOTE**: Please make sure to explicitly set the model to `float16` (`T5EncoderModel.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', torch_dtype=torch.float16)`) otherwise, the generated embeddings will be full precision.