--- library_name: sentence-transformers pipeline_tag: sentence-similarity datasets: - monsoon-nlp/protein-pairs-uniprot-swissprot tags: - sentence-transformers - sentence-similarity - transformers - biology - protein language model license: cc base_model: Rostlab/prot_bert_bfd --- # Protein Matryoshka Embeddings The model generates an embedding for input proteins. It was trained using [Matryoshka loss](https://huggingface.co/blog/matryoshka), so shortened embeddings can be used for faster search and other tasks. Inputs use [IUPAC-IUB codes](https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation) where letters A-Z map to amino acids. For example: "M A R N W S F R V" The base model was [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd). A [sentence-transformers](https://github.com/UKPLab/sentence-transformers) model was trained on cosine-similarity of embeddings from [UniProt](https://www.uniprot.org/help/downloads#embeddings). For train/test/validation datasets of embeddings and distances, see: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot ## Usage Install these dependencies: ``` pip install -U sentence-transformers datasets ``` Generating embeddings: ```python from sentence_transformers import SentenceTransformer sequences = ["M S L E Q K...", "M A R N W S F R V..."] model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings') embeddings = model.encode(sentences) print(embeddings) ``` ## Training + Code CoLab notebook: https://colab.research.google.com/drive/1uBk-jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing Results on 1,000 protein pairs from the validation dataset, during training: |steps|cosine_pearson|cosine_spearman| |-----|--------------|---------------| |3000|0.8598688660086558|0.8666855900999677| |6000|0.8692703523988448|0.8615673651584274| |9000|0.8779733537629968|0.8754158959780602| |12000|0.8877422045031667|0.8881492475969834| |15000|0.9027359688395733|0.899106724739699| |18000|0.9046675789738002|0.9044183600191271| |21000|0.9165801536390973|0.9061381997421003| |24000|0.9128046401341833|0.9076748537082228| |27000|0.918547416546341|0.9127677526055185| |30000|0.9239429677657788|0.9187051589781693| ## Validation Scatter plots comparing the full and 128-dim embeddings to the original embeddings, using pairs from the test set: https://colab.research.google.com/drive/1hm4IIMXaLt_7QYRNvkiXl5BqmsHdC1Ue?usp=sharing ## Finetuning / Tasks One of the more popular evaluations is [Tasks Assessing Protein Embeddings (TAPE)](https://github.com/songlab-cal/tape) Example using SciKit-Learn to train on Fluorescence, a regression task from TAPE: https://colab.research.google.com/drive/1cH9jOBSC56mqJHU_6ztQPp6qWJguNjAn?usp=sharing Example using SciKit-Learn to train on a classification task from [greenbeing-binary](https://huggingface.co/datasets/monsoon-nlp/greenbeing-binary) - https://colab.research.google.com/drive/1MCTn8f3oeIKpB6n_8mPumet3ukm7GD8a?usp=sharing ## Future This page will be updated when I have examples using it on protein classification tasks. I'm interested in whether [embedding quantization](https://huggingface.co/blog/embedding-quantization) could be even more efficient. If you want to collaborate on future projects / have resources to train longer on more embeddings, please get in touch.