GPepT: A Language Model for Peptides and Peptidomimetics

GPepT is a cutting-edge language model designed to understand and generate sequences in the specialized domain of peptides and peptidomimetics. It serves as a powerful tool for de novo protein design and engineering. As demonstrated in our research, the incorporation of peptidomimetics significantly broadens the chemical space accessible through generated sequences, enabling innovative approaches to peptide-based therapeutics.

Model Overview

GPepT builds upon the GPT-2 Transformer architecture, comprising 36 layers and a model dimensionality of 1280, with a total of 738 million parameters. This decoder-only model has been pre-trained on a curated dataset of peptides and peptidomimetics mined from bioactivity-labeled chemical formulas in ChEMBL.

To leverage GPepT’s pre-trained weights, input molecules must be converted into a standardized sequence-like representation of peptidomimetics using Monomerizer (available on GitHub). Detailed insights into the training process and datasets are provided in our accompanying publication.

Unlike traditional protein design models, GPepT is trained in a self-supervised manner, using raw sequence data without explicit annotation. This design enables the model to generalize across diverse sequence spaces, producing functional antimicrobial peptidomimetics upon fine-tuning.


Using GPepT for Sequence Generation

GPepT is fully compatible with the HuggingFace Transformers Python library. Installation instructions can be found here.

The model excels at generating peptidomimetic sequences in a zero-shot fashion, but it can also be fine-tuned on custom datasets to generate sequences tailored to specific requirements.

Example 1: Zero-Shot Sequence Generation

GPepT generates sequences that extend from a specified input token (e.g., <|endoftext|>). If no input is provided, it selects the start token automatically and generates likely sequences. Here’s a Python example:

from transformers import pipeline

# Initialize GPepT for text generation
GPepT = pipeline('text-generation', model="Playingyoyo/GPepT")

# Generate sequences (expressed in tokens, average ~4 amino acids per token)
sequences = GPepT("<|endoftext|>", 
                   max_length=25, 
                   do_sample=True, 
                   top_k=950, 
                   repetition_penalty=1.5, 
                   num_return_sequences=5, 
                   eos_token_id=0)

# Print generated sequences
for seq in sequences:
    print(seq['generated_text'])

Sample output:

<|endoftext|>R K A L E Z1649
<|endoftext|>G K A L Z341
<|endoftext|>G V A G K X4097 V A P

Example 2: Fine-Tuning for Directed Sequence Generation

Fine-tuning enables GPepT to generate sequences with user-defined properties. To prepare training data:

  1. git clone https://github.com/tsudalab/Monomerizer/tree/main
  2. cd Monomerizer
  3. python3 Monomerizer/run_pipeline.py --input_file path_to_your_smiles_file.txt. Check the repo for the required format.
    1. will monomerize the SMILES and split the resulting sequences into training (output/datetime/for_GPepT/train90.txt) and validation (output/datetime/for_GPepT/val10.txt) files.

To fine-tune the model:

python run_clm.py --model_name_or_path Playingyoyo/GPepT \
                  --train_file path_to_train90.txt \
                  --validation_file path_to_val10.txt \
                  --tokenizer_name Playingyoyo/GPepT \
                  --do_train \
                  --do_eval \
                  --output_dir ./output \
                  --learning_rate 1e-5

The fine-tuned model will be saved in the ./output directory, ready to generate tailored sequences.


Selecting Valid Sequences

While GPepT generates diverse peptidomimetic sequences, not all are chemically valid. For example:

  • Invalid Sequences: Those with terminal modifications (e.g., Z) embedded within the sequence.
  • Valid Sequences: Should adhere to standard peptidomimetic rules.

By filtering out invalid sequences, GPepT users can ensure the generation of high-quality candidates for further study.


GPepT stands as a powerful tool for researchers at the forefront of peptide and peptidomimetic innovation, enabling both exploration and application in vast chemical and biological spaces.

Downloads last month
22
Safetensors
Model size
774M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.