File size: 1,488 Bytes
d27d23e
6e2af84
d27d23e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87d4015
 
 
 
 
 
 
d27d23e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
---
# language: protein
tags:
- protein language model
datasets:
- ProteinKG25
widget:
- text: "D L I P T S S K L V V [MASK] D T S L Q V K K A F F A L V T"

---
	
# OntoProtein model
Pretrained model on protein sequences using masked language modeling (MLM) and knowledge embedding (KE) objective objective. It was introduced in [this paper](https://openreview.net/pdf?id=yfe1VMYAXa4) and first released in [this repository](https://github.com/zjunlp/OntoProtein). This model is trained on uppercase amino acids: it only works with capital letter amino acids.

## Model description
OntoProtein is the first general framework that makes use of structure in GO (Gene Ontology) into protein pre-training models. We construct a novel large-scale knowledge graph that consists of GO and its related proteins, and gene annotation texts or protein sequences describe all nodes in the graph. We propose novel contrastive learning with knowledge-aware negative sampling to jointly optimize the knowledge graph and protein embedding during pre-training.

### BibTeX entry and citation info
```bibtex
@inproceedings{
zhang2022ontoprotein,
title={OntoProtein: Protein Pretraining With Gene Ontology Embedding},
author={Ningyu Zhang and Zhen Bi and Xiaozhuan Liang and Siyuan Cheng and Haosen Hong and Shumin Deng and Qiang Zhang and Jiazhang Lian and Huajun Chen},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=yfe1VMYAXa4}
}
```