|
--- |
|
tags: |
|
- antibody language model |
|
- antibody |
|
- protein language model |
|
base_model: Exscientia/IgBert_unpaired |
|
license: mit |
|
--- |
|
|
|
# IgBert |
|
|
|
Model pretrained on protein and antibody sequences using a masked language modeling (MLM) objective. It was introduced in the paper [Large scale paired antibody language models](https://arxiv.org/abs/2403.17889). |
|
|
|
The model is finetuned from IgBert-unpaired using paired antibody sequences from the [Observed Antibody Space](https://opig.stats.ox.ac.uk/webapps/oas/). |
|
|
|
# Use |
|
|
|
The model and tokeniser can be loaded using the `transformers` library |
|
|
|
```python |
|
from transformers import BertModel, BertTokenizer |
|
|
|
tokeniser = BertTokenizer.from_pretrained("Exscientia/IgBert", do_lower_case=False) |
|
model = BertModel.from_pretrained("Exscientia/IgBert", add_pooling_layer=False) |
|
``` |
|
|
|
The tokeniser is used to prepare batch inputs |
|
```python |
|
# heavy chain sequences |
|
sequences_heavy = [ |
|
"VQLAQSGSELRKPGASVKVSCDTSGHSFTSNAIHWVRQAPGQGLEWMGWINTDTGTPTYAQGFTGRFVFSLDTSARTAYLQISSLKADDTAVFYCARERDYSDYFFDYWGQGTLVTVSS", |
|
"QVQLVESGGGVVQPGRSLRLSCAASGFTFSNYAMYWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRFTISRDNSKNTLYLQMNSLRTEDTAVYYCASGSDYGDYLLVYWGQGTLVTVSS" |
|
] |
|
|
|
# light chain sequences |
|
sequences_light = [ |
|
"EVVMTQSPASLSVSPGERATLSCRARASLGISTDLAWYQQRPGQAPRLLIYGASTRATGIPARFSGSGSGTEFTLTISSLQSEDSAVYYCQQYSNWPLTFGGGTKVEIK", |
|
"ALTQPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYDVSKRPSGVSNRFSGSKSGNTASLTISGLQSEDEADYYCNSLTSISTWVFGGGTKLTVL" |
|
] |
|
|
|
# The tokeniser expects input of the form ["V Q ... S S [SEP] E V ... I K", ...] |
|
paired_sequences = [] |
|
for sequence_heavy, sequence_light in zip(sequences_heavy, sequences_light): |
|
paired_sequences.append(' '.join(sequence_heavy)+' [SEP] '+' '.join(sequence_light)) |
|
|
|
tokens = tokeniser.batch_encode_plus( |
|
paired_sequences, |
|
add_special_tokens=True, |
|
pad_to_max_length=True, |
|
return_tensors="pt", |
|
return_special_tokens_mask=True |
|
) |
|
``` |
|
|
|
Note that the tokeniser adds a `[CLS]` token at the beginning of each paired sequence, a `[SEP]` token at the end of each paired sequence and pads using the `[PAD]` token. For example a batch containing sequences `V Q L [SEP] E V V`, `Q V [SEP] A L` will be tokenised to `[CLS] V Q L [SEP] E V V [SEP]` and `[CLS] Q V [SEP] A L [SEP] [PAD] [PAD]`. |
|
|
|
Sequence embeddings are generated by feeding tokens through the model |
|
|
|
```python |
|
output = model( |
|
input_ids=tokens['input_ids'], |
|
attention_mask=tokens['attention_mask'] |
|
) |
|
|
|
residue_embeddings = output.last_hidden_state |
|
``` |
|
|
|
To obtain a sequence representation, the residue tokens can be averaged over like so |
|
|
|
```python |
|
import torch |
|
|
|
# mask special tokens before summing over embeddings |
|
residue_embeddings[tokens["special_tokens_mask"] == 1] = 0 |
|
sequence_embeddings_sum = residue_embeddings.sum(1) |
|
|
|
# average embedding by dividing sum by sequence lengths |
|
sequence_lengths = torch.sum(tokens["special_tokens_mask"] == 0, dim=1) |
|
sequence_embeddings = sequence_embeddings_sum / sequence_lengths.unsqueeze(1) |
|
``` |
|
|
|
For sequence level fine-tuning the model can be loaded with a pooling head by setting `add_pooling_layer=True` and using `output.pooler_output` in the down-stream task. |