File size: 3,331 Bytes
71ed37f
 
 
 
 
 
9525e19
71ed37f
9525e19
71ed37f
9525e19
 
 
 
 
71ed37f
9525e19
71ed37f
 
 
 
f155a24
7f6f258
 
 
86f25d8
71ed37f
3d0d8f6
 
 
 
 
 
 
 
 
 
71ed37f
 
3d0d8f6
71ed37f
3d0d8f6
 
71ed37f
 
 
9525e19
71ed37f
3d0d8f6
71ed37f
 
3d0d8f6
 
 
 
 
 
 
9525e19
 
 
f155a24
36b0b34
 
9525e19
 
f155a24
 
 
 
 
 
 
 
 
 
 
 
36b0b34
9525e19
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
tags:
- dna
- human_genome
---

# GENA-LM (gena-lm-bigbird-base-t2t)

GENA-LM is a Family of Open-Source Foundational Models for Long DNA Sequences.

GENA-LM models are transformer masked language models trained on human DNA sequence. 

`gena-lm-bigbird-base-t2t` follows the BigBird architecture and its HuggingFace implementation.

Differences between GENA-LM (`gena-lm-bigbird-base-t2t`) and DNABERT:
- BPE tokenization instead of k-mers;
- input sequence size is about 36000 nucleotides (4096 BPE tokens) compared to 512 nucleotides of DNABERT;
- pre-training on T2T vs. GRCh38.p13 human genome assembly.

Source code and data: https://github.com/AIRI-Institute/GENA_LM

Paper: https://academic.oup.com/nar/article/53/2/gkae1310/7954523

This repository also contains models that are finetuned on downstream tasks and models that are used in our [GENA-Web](https://dnalm.airi.net) web tool for genomic sequence annotation:
- splice sites prediction (branch [gena_web_spliceai](https://huggingface.co/AIRI-Institute/gena-lm-bigbird-base-t2t/tree/gena_web_spliceai))

## Examples

### Load pre-trained model
```python
from transformers import AutoTokenizer, BigBirdForMaskedLM

tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-t2t')
model = BigBirdForMaskedLM.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-t2t')
```


### How to load the model to fine-tune it on classification task
```python
from transformers import AutoTokenizer, BigBirdForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-t2t')
model = BigBirdForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-t2t')
```

## Model description
GENA-LM (`gena-lm-bigbird-base-t2t`) model is trained in a masked language model (MLM) fashion, following the methods proposed in the BigBird paper by masking 15% of tokens. Model config for `gena-lm-bigbird-base-t2t` is similar to the `google/bigbird-roberta-base`:

- 4096 Maximum sequence length
- 12 Layers, 12 Attention heads
- 768 Hidden size
- sparse config:
    - block size: 64
    - random blocks: 3
    - global blocks: 2
    - sliding window blocks: 3
- 32k Vocabulary size, tokenizer trained on DNA data.

We pre-trained `gena-lm-bigbird-base-t2t` using the latest T2T human genome assembly (https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/). The data was augmented by sampling mutations from 1000-genome SNPs (gnomAD dataset). Pre-training was performed for 1,070,000 iterations with batch size 256.

## Evaluation
For evaluation results, see our paper: https://academic.oup.com/nar/article/53/2/gkae1310/7954523

## Citation
```bibtex
@article{GENA_LM,
    author = {Fishman, Veniamin and Kuratov, Yuri and Shmelev, Aleksei and Petrov, Maxim and Penzar, Dmitry and Shepelin, Denis and Chekanov, Nikolay and Kardymon, Olga and Burtsev, Mikhail},
    title = {GENA-LM: a family of open-source foundational DNA language models for long sequences},
    journal = {Nucleic Acids Research},
    volume = {53},
    number = {2},
    pages = {gkae1310},
    year = {2025},
    month = {01},
    issn = {0305-1048},
    doi = {10.1093/nar/gkae1310},
    url = {https://doi.org/10.1093/nar/gkae1310},
    eprint = {https://academic.oup.com/nar/article-pdf/53/2/gkae1310/61443229/gkae1310.pdf},
}
```