File size: 3,337 Bytes
3a32a42 ff2db0b 1197aac 5644b80 f2cab23 1197aac f2cab23 845daa9 ff2db0b f2cab23 317d0a6 f2cab23 0315587 f2cab23 bdd6f63 f2cab23 317d0a6 f2cab23 317d0a6 0315587 f2cab23 317d0a6 f2cab23 a692b6b f2cab23 a692b6b f2cab23 a692b6b f2cab23 0315587 f2cab23 cc63b5f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
---
tags:
- biology
- small-molecules
- single-cell-genes
- drug-discovery
- protein-solubility
- ibm
- mammal
- pytorch
library_name: biomed-multi-alignment
license: apache-2.0
base_model:
- ibm/biomed.omics.bl.sm.ma-ted-458m
---
Protein solubility is a critical factor in both pharmaceutical research and production processes, as it can significantly impact the quality and function of a protein.
This is an example for finetuning `ibm/biomed.omics.bl.sm-ted-458m` for protein solubility prediction (binary classification) based solely on the amino acid sequence.
The benchmark defined in: https://academic.oup.com/bioinformatics/article/34/15/2605/4938490
Data retrieved from: https://zenodo.org/records/1162886
## Model Summary
- **Developers:** IBM Research
- **GitHub Repository:** https://github.com/BiomedSciAI/biomed-multi-alignment
- **Paper:** https://arxiv.org/abs/2410.22367
- **Release Date**: Oct 28th, 2024
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
## Usage
Using `ibm/biomed.omics.bl.sm.ma-ted-458m` requires installing https://github.com/BiomedSciAI/biomed-multi-alignment
```
pip install git+https://github.com/BiomedSciAI/biomed-multi-alignment.git
```
A simple example for a task already supported by `ibm/biomed.omics.bl.sm.ma-ted-458m`:
```python
import os
from fuse.data.tokenizers.modular_tokenizer.op import ModularTokenizerOp
from mammal.examples.protein_solubility.task import ProteinSolubilityTask
from mammal.keys import CLS_PRED, SCORES
from mammal.model import Mammal
# Load Model
model = Mammal.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-458m.protein_solubility")
model.eval()
# Load Tokenizer
tokenizer_op = ModularTokenizerOp.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-458m.protein_solubility")
# convert to MAMMAL style
sample_dict = {"protein_seq": protein_seq}
sample_dict = ProteinSolubilityTask.data_preprocessing(
sample_dict=sample_dict,
protein_sequence_key="protein_seq",
tokenizer_op=tokenizer_op,
device=model.device,
)
# running in generate mode
batch_dict = model.generate(
[sample_dict],
output_scores=True,
return_dict_in_generate=True,
max_new_tokens=5,
)
# Post-process the model's output
ans = ProteinSolubilityTask.process_model_output(
tokenizer_op=tokenizer_op,
decoder_output=batch_dict[CLS_PRED][0],
decoder_output_scores=batch_dict[SCORES][0],
)
# Print prediction
print(f"{ans=}")
```
For more advanced usage, see our detailed example at: on `https://github.com/BiomedSciAI/biomed-multi-alignment`
## Citation
If you found our work useful, please consider giving a star to the repo and cite our paper:
```
@misc{shoshan2024mammalmolecularaligned,
title={MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language},
author={Yoel Shoshan and Moshiko Raboh and Michal Ozery-Flato and Vadim Ratner and Alex Golts and Jeffrey K. Weber and Ella Barkan and Simona Rabinovici-Cohen and Sagi Polaczek and Ido Amos and Ben Shapira and Liam Hazan and Matan Ninio and Sivan Ravid and Michael M. Danziger and Joseph A. Morrone and Parthasarathy Suryanarayanan and Michal Rosen-Zvi and Efrat Hexter},
year={2024},
eprint={2410.22367},
archivePrefix={arXiv},
primaryClass={q-bio.QM},
url={https://arxiv.org/abs/2410.22367},
}
``` |