ibm-research
/

biomed.omics.bl.sm.ma-ted-458m.protein_solubility

biomed-multi-alignment

small-molecules

single-cell-genes

protein-solubility

Model card Files Files and versions Community

biomed.omics.bl.sm.ma-ted-458m.protein_solubility / README.md

SagiPolaczek's picture

Update README.md

5644b80 verified 3 months ago

|

3.34 kB

	---
	tags:
	- biology
	- small-molecules
	- single-cell-genes
	- drug-discovery
	- protein-solubility
	- ibm
	- mammal
	- pytorch

	library_name: biomed-multi-alignment
	license: apache-2.0
	base_model:
	- ibm/biomed.omics.bl.sm.ma-ted-458m
	---

	Protein solubility is a critical factor in both pharmaceutical research and production processes, as it can significantly impact the quality and function of a protein.
	This is an example for finetuning `ibm/biomed.omics.bl.sm-ted-458m` for protein solubility prediction (binary classification) based solely on the amino acid sequence.

	The benchmark defined in: https://academic.oup.com/bioinformatics/article/34/15/2605/4938490
	Data retrieved from: https://zenodo.org/records/1162886

	## Model Summary

	- Developers: IBM Research
	- GitHub Repository: https://github.com/BiomedSciAI/biomed-multi-alignment
	- Paper: https://arxiv.org/abs/2410.22367
	- Release Date: Oct 28th, 2024
	- License: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).

	## Usage

	Using `ibm/biomed.omics.bl.sm.ma-ted-458m` requires installing https://github.com/BiomedSciAI/biomed-multi-alignment

	```
	pip install git+https://github.com/BiomedSciAI/biomed-multi-alignment.git
	```

	A simple example for a task already supported by `ibm/biomed.omics.bl.sm.ma-ted-458m`:
	```python
	import os

	from fuse.data.tokenizers.modular_tokenizer.op import ModularTokenizerOp

	from mammal.examples.protein_solubility.task import ProteinSolubilityTask
	from mammal.keys import CLS_PRED, SCORES
	from mammal.model import Mammal

	# Load Model
	model = Mammal.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-458m.protein_solubility")
	model.eval()

	# Load Tokenizer
	tokenizer_op = ModularTokenizerOp.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-458m.protein_solubility")

	# convert to MAMMAL style
	sample_dict = {"protein_seq": protein_seq}
	sample_dict = ProteinSolubilityTask.data_preprocessing(
	sample_dict=sample_dict,
	protein_sequence_key="protein_seq",
	tokenizer_op=tokenizer_op,
	device=model.device,
	)

	# running in generate mode
	batch_dict = model.generate(
	[sample_dict],
	output_scores=True,
	return_dict_in_generate=True,
	max_new_tokens=5,
	)

	# Post-process the model's output
	ans = ProteinSolubilityTask.process_model_output(
	tokenizer_op=tokenizer_op,
	decoder_output=batch_dict[CLS_PRED][0],
	decoder_output_scores=batch_dict[SCORES][0],
	)

	# Print prediction
	print(f"{ans=}")
	```

	For more advanced usage, see our detailed example at: on `https://github.com/BiomedSciAI/biomed-multi-alignment`


	## Citation

	If you found our work useful, please consider giving a star to the repo and cite our paper:
	```
	@misc{shoshan2024mammalmolecularaligned,
	title={MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language},
	author={Yoel Shoshan and Moshiko Raboh and Michal Ozery-Flato and Vadim Ratner and Alex Golts and Jeffrey K. Weber and Ella Barkan and Simona Rabinovici-Cohen and Sagi Polaczek and Ido Amos and Ben Shapira and Liam Hazan and Matan Ninio and Sivan Ravid and Michael M. Danziger and Joseph A. Morrone and Parthasarathy Suryanarayanan and Michal Rosen-Zvi and Efrat Hexter},
	year={2024},
	eprint={2410.22367},
	archivePrefix={arXiv},
	primaryClass={q-bio.QM},
	url={https://arxiv.org/abs/2410.22367},
	}
	```