--- license: apache-2.0 --- # SilkomeGPT: Generative strategies for modeling, design and analysis of spider silk protein sequences for enhanced mechanical properties Generative strategies for modeling, design and analysis of silk protein sequences for enhanced mechanical properties Wei Lu, David L. Kaplan, Markus J. Buehler Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, MA 02139, USA > Contact email: mbuehler@mit.edu Abstract: Spider silks are remarkable materials characterized by superb mechanical properties such as strength, extensibility and lightweightedness. Yet, to date, limited models are available to fully explore sequence-property relationships for analysis and design. Here a custom generative large-language model is proposed to enable design of novel spider silk protein sequences to meet complex combinations of target mechanical properties. The model, pretrained on a large set of protein sequences, is fine-tuned on ~1,000 major ampullate spidroin (MaSp) sequences for which associated fiber-level mechanical properties exist, to yield an end-to-end forward and inverse generative approach that is aplied in a multi-agent strategy. Performance is assessed through: (1) a novelty analysis and protein type classification for generated spidroin sequences through Basic Local Alignment Search Tool (BLAST) searches, (2) property evaluation and comparison with similar sequences, (3) comparison of molecular structures, as well as, and (4) a detailed sequence motif analyses. This work generates silk sequences with property combinations that do not exist in nature, and develops a deep understanding of the mechanistic roles of sequence patterns in achieving overarching key mechanical properties (elastic modulus, strength, toughness, failure strain). The model provides an efficient approach to expand the silkome dataset, facilitating further sequence-structure analyses of silks, and establishes a foundation for synthetic silk design and optimization. This work not only shows the capacity of generative transformer models to design complex materials, but also illustrates an effective use of agentic modeling for self-improving design solutions. Keywords: biomaterials; deep learning; generative autoregressive transformer; hierarchical; multiscale modeling; spider silk; spidroin GitHub (more codes, notebooks, etc.): https://github.com/lamm-mit/SilkomeGPT # Trained model and inference This model is a pretrained autoregressive transformer model in GPT-style, trained on a large number of silk and other protein sequences. The pretraining task is defined as "Sequence<...>" where ... is an amino acid sequence. Load pretrained model: ```python from transformers import AutoModelForCausalLM, AutoTokenizer trained_model_name='lamm-mit/SilkomeGPT' tokenizer = AutoTokenizer.from_pretrained(trained_model_name, trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token model_name = pretrained_model_name model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True ).to(device) model.config.use_cache = False ``` Sample inference using the "GenerateSilkContent<...>" task, where here, the model will produce a silk sequence that meets the list of properties requested: ```python prompt = "GenerateSilkContent<0.177,0.222,0.082,0.065,0.225,0.241,0.266,0.515>" generated = torch.tensor(tokenizer.encode(prompt, add_special_tokens = False)).unsqueeze(0).to(device) print(generated.shape, generated) sample_outputs = model.generate( inputs=generated, eos_token_id =tokenizer.eos_token_id, do_sample=True, top_k=500, max_length = 300, top_p=0.9, num_return_sequences=3, temperature=1, ).to(device) for i, sample_output in enumerate(sample_outputs): print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True))) ``` Output (here, three candidate sequences): ```raw torch.Size([1, 66]) tensor([[ 43, 299, 73, 86, 69, 88, 73, 55, 77, 80, 79, 39, 83, 82, 88, 299, 88, 32, 20, 18, 21, 27, 27, 16, 20, 18, 22, 22, 22, 16, 20, 18, 20, 28, 22, 16, 20, 18, 20, 26, 25, 16, 20, 18, 22, 22, 25, 16, 20, 18, 22, 24, 21, 16, 20, 18, 22, 26, 26, 16, 20, 18, 25, 21, 25, 34]], device='cuda:0') 0: GenerateSilkContent<0.177,0.222,0.082,0.065,0.225,0.241,0.266,0.515> [AAAAGGSGGSGGYGPGGYGPGGSGDAAAAAAAAGGSGGAGGYGPGGYGPGGFGPGGSGDAAAAAAAAAGGSGGSGGYGPGGYGPGGSGDAAAAAAAAGGSGGPGGYGPGGYGPGGFGLSGSGDAAAAAAAAAGGSGGSEGYGPGGYGPGGSGDAAAAAAAAAGGSGGPGGYGPGGYGPGGYGPGGSGDAAAAAAAAAGGSGGSGGYGPGGYGPGGSGDAAAAAAAAGGSGGPGGYGPGGYGPGGFGPGGSGDAAAAAAAAAGGSGGSGGYGPGGYGPGGSGAAVAAASAAGGSGGSGGYGPGGYGPGGSGAAAASAAASAISSPASTSRISFVASRLVSGGTANVSNLSNTIGTVMSQVRAGNPGASECEVVIQTLIELLAALIHILGSASIGNVNYGSTAQSAAVVSESFQSAFQ] 1: GenerateSilkContent<0.177,0.222,0.082,0.065,0.225,0.241,0.266,0.515> [MTLTIRLALSLLVAICTQSMFALGQSVSPWSSPDMAENFMSVFTDSLSQSGAFSYDQMDDISSIGDSIRSGVEKMARSGKTSANKLQAMNMAFASAVAEIAISEGGGQSAQVKTNAVADALSTAFLQTTGVVNTQFVNEIRSLISMFAQANSVSSSSASVSASAGGAGGYGPQAQGAAAVVAGGYGPGSQGPQSYGPGPQAQSSAVAVSAGSQGPQSYGPGPQGPGPQGPGPQGSGPQGPGPQGPGSQGPQSYGPGPQGPSSPGQSSYQYSVSITSQSGSQGTSGGLGSQGAGGADQGGYGNGQGGSGSAAAAAAAGGAGGAGQGGLGAGGAGQGYGAGLGRQGGSGQGGAAAAAAAAGGLGGQGGYGGQDSQGAGQGGYGSGQGGSGAAAAAAAAGGAGRGGLGSGGAGQGYGAGLGGQGGSGQGGQGGQQPGQSGYGRQGQGSGGAGQGGLGSGGAGQGYGAGLGGQGGSGQGGAAAAAAAAGGLGRQGPGSGGAGQGYGAGLGGQGGSGQGGAAAAAAAAGGLGGQGGYGGQGSQGAGQGGYGSGQGGSGAAAAAAAAGGAGQGGYGGQGSQGAGQGGYGSGQGGSGQGGAAAAAAAAGGLGGQGGYGGQGSQGAGQGGYGSGQGGSGQGGAAAAAAAAGGLGGQGGYGGQGSQGAGQGGYGSGQGGSGAAAAAAAAGGAGGAGRG] 2: GenerateSilkContent<0.177,0.222,0.082,0.065,0.225,0.241,0.266,0.515> [MNWSIRLALLGLVVLSTQTTFAFGQAATPWENTALAEAFINSFLDSIGRTGAFSLSQQDDMSTIGDTLKSAMEKMAQSRKSSKSKLQALNMAFASSMAEIAVAEEGGLSIQAKTEAIASSLSSAFLQTTGVVNYQFVNEIKSLIYMIAQATTNEVASSEASAGGGGGSGQGRYVSSSAAGTYGSAPQSTGENRPAPQGPPQQGPTYGPSAAVLVSAVGGYGQGPAAPSQQGPTGPSQQRQANQGPYGLSVQQEPESQGSYGPETNAAAAAAGGYGPGAVGQQGLGAGGQQGPGGQRP] ``` ## Citation To cite this work: ``` @article{WeiKaplanBuehler_2023, title = {Generative Modeling, Design, and Analysis of Spider Silk Protein Sequences for Enhanced Mechanical Properties}, author = {W. Lu, D. L., Kaplan, M.J. Buehler}, journal = {Adv. Funct. Mater.}, year = {2023}, volume = {}, pages = {}, url = {https://doi.org/10.1002/adfm.202311324} } ```