juhoinkinen ltgoslo commited on
Commit
4d1b312
·
verified ·
0 Parent(s):

Duplicate from HPLT/hplt_bert_base_fi

Browse files

Co-authored-by: Language Technology Group, University of Oslo, Norway <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fi
4
+ inference: false
5
+ tags:
6
+ - BERT
7
+ - HPLT
8
+ - encoder
9
+ license: apache-2.0
10
+ datasets:
11
+ - HPLT/hplt_monolingual_v1_2
12
+ ---
13
+
14
+ # HPLT Bert for Finnish
15
+
16
+ <img src="https://hplt-project.org/_next/static/media/logo-hplt.d5e16ca5.svg" width=12.5%>
17
+
18
+ This is one of the encoder-only monolingual language models trained as a first release by the [HPLT project](https://hplt-project.org/).
19
+ It is a so called masked language model. In particular, we used the modification of the classic BERT model named [LTG-BERT](https://aclanthology.org/2023.findings-eacl.146/).
20
+
21
+ A monolingual LTG-BERT model is trained for every major language in the [HPLT 1.2 data release](https://hplt-project.org/datasets/v1.2) (*75* models total).
22
+
23
+ All the HPLT encoder-only models use the same hyper-parameters, roughly following the BERT-base setup:
24
+ - hidden size: 768
25
+ - attention heads: 12
26
+ - layers: 12
27
+ - vocabulary size: 32768
28
+
29
+ Every model uses its own tokenizer trained on language-specific HPLT data.
30
+ See sizes of the training corpora, evaluation results and more in our [language model training report](https://hplt-project.org/HPLT_D4_1___First_language_models_trained.pdf).
31
+
32
+ [The training code](https://github.com/hplt-project/HPLT-WP4).
33
+
34
+ [The training statistics of all 75 runs](https://api.wandb.ai/links/ltg/kduj7mjn)
35
+
36
+ ## Example usage
37
+
38
+ This model currently needs a custom wrapper from `modeling_ltgbert.py`, you should therefore load the model with `trust_remote_code=True`.
39
+
40
+ ```python
41
+ import torch
42
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
43
+
44
+ tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_fi")
45
+ model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_fi", trust_remote_code=True)
46
+
47
+ mask_id = tokenizer.convert_tokens_to_ids("[MASK]")
48
+ input_text = tokenizer("It's a beautiful[MASK].", return_tensors="pt")
49
+ output_p = model(**input_text)
50
+ output_text = torch.where(input_text.input_ids == mask_id, output_p.logits.argmax(-1), input_text.input_ids)
51
+
52
+ # should output: '[CLS] It's a beautiful place.[SEP]'
53
+ print(tokenizer.decode(output_text[0].tolist()))
54
+ ```
55
+
56
+ The following classes are currently implemented: `AutoModel`, `AutoModelMaskedLM`, `AutoModelForSequenceClassification`, `AutoModelForTokenClassification`, `AutoModelForQuestionAnswering` and `AutoModeltForMultipleChoice`.
57
+
58
+ ## Intermediate checkpoints
59
+
60
+ We are releasing 10 intermediate checkpoints for each model at intervals of every 3125 training steps in separate branches. The naming convention is `stepXXX`: for example, `step18750`.
61
+
62
+ You can load a specific model revision with `transformers` using the argument `revision`:
63
+ ```python
64
+ model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_fi", revision="step21875", trust_remote_code=True)
65
+ ```
66
+
67
+ You can access all the revisions for the models with the following code:
68
+ ```python
69
+ from huggingface_hub import list_repo_refs
70
+ out = list_repo_refs("HPLT/hplt_bert_base_fi")
71
+ print([b.name for b in out.branches])
72
+ ```
73
+
74
+ ## Cite us
75
+
76
+ ```bibtex
77
+ @inproceedings{samuel-etal-2023-trained,
78
+ title = "Trained on 100 million words and still in shape: {BERT} meets {B}ritish {N}ational {C}orpus",
79
+ author = "Samuel, David and
80
+ Kutuzov, Andrey and
81
+ {\O}vrelid, Lilja and
82
+ Velldal, Erik",
83
+ editor = "Vlachos, Andreas and
84
+ Augenstein, Isabelle",
85
+ booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
86
+ month = may,
87
+ year = "2023",
88
+ address = "Dubrovnik, Croatia",
89
+ publisher = "Association for Computational Linguistics",
90
+ url = "https://aclanthology.org/2023.findings-eacl.146",
91
+ doi = "10.18653/v1/2023.findings-eacl.146",
92
+ pages = "1954--1974"
93
+ })
94
+ ```
95
+
96
+ ```bibtex
97
+ @inproceedings{de-gibert-etal-2024-new-massive,
98
+ title = "A New Massive Multilingual Dataset for High-Performance Language Technologies",
99
+ author = {de Gibert, Ona and
100
+ Nail, Graeme and
101
+ Arefyev, Nikolay and
102
+ Ba{\~n}{\'o}n, Marta and
103
+ van der Linde, Jelmer and
104
+ Ji, Shaoxiong and
105
+ Zaragoza-Bernabeu, Jaume and
106
+ Aulamo, Mikko and
107
+ Ram{\'\i}rez-S{\'a}nchez, Gema and
108
+ Kutuzov, Andrey and
109
+ Pyysalo, Sampo and
110
+ Oepen, Stephan and
111
+ Tiedemann, J{\"o}rg},
112
+ editor = "Calzolari, Nicoletta and
113
+ Kan, Min-Yen and
114
+ Hoste, Veronique and
115
+ Lenci, Alessandro and
116
+ Sakti, Sakriani and
117
+ Xue, Nianwen",
118
+ booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
119
+ month = may,
120
+ year = "2024",
121
+ address = "Torino, Italia",
122
+ publisher = "ELRA and ICCL",
123
+ url = "https://aclanthology.org/2024.lrec-main.100",
124
+ pages = "1116--1128",
125
+ abstract = "We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of {\mbox{$\approx$}} 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.",
126
+ }
127
+ ```
128
+
__init__.py ADDED
File without changes
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LtgbertForMaskedLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_ltgbert.LtgbertConfig",
7
+ "AutoModel": "modeling_ltgbert.LtgbertModel",
8
+ "AutoModelForMaskedLM": "modeling_ltgbert.LtgbertForMaskedLM",
9
+ "AutoModelForSequenceClassification": "modeling_ltgbert.LtgbertForSequenceClassification",
10
+ "AutoModelForTokenClassification": "modeling_ltgbert.LtgbertForTokenClassification",
11
+ "AutoModelForQuestionAnswering": "modeling_ltgbert.LtgbertForQuestionAnswering",
12
+ "AutoModelForMultipleChoice": "modeling_ltgbert.LtgbertForMultipleChoice"
13
+ },
14
+ "attention_probs_dropout_prob": 0.1,
15
+ "hidden_dropout_prob": 0.1,
16
+ "hidden_size": 768,
17
+ "intermediate_size": 2560,
18
+ "layer_norm_eps": 1e-07,
19
+ "max_position_embeddings": 512,
20
+ "num_attention_heads": 12,
21
+ "num_hidden_layers": 12,
22
+ "position_bucket_size": 32,
23
+ "torch_dtype": "float32",
24
+ "vocab_size": 32768
25
+ }
configuration_ltgbert.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.configuration_utils import PretrainedConfig
2
+
3
+
4
+ class LtgbertConfig(PretrainedConfig):
5
+ """Configuration class to store the configuration of a `LtgbertModel`.
6
+ """
7
+ def __init__(
8
+ self,
9
+ vocab_size=32768,
10
+ attention_probs_dropout_prob=0.1,
11
+ hidden_dropout_prob=0.1,
12
+ hidden_size=768,
13
+ intermediate_size=2048,
14
+ max_position_embeddings=512,
15
+ position_bucket_size=32,
16
+ num_attention_heads=12,
17
+ num_hidden_layers=12,
18
+ layer_norm_eps=1.0e-7,
19
+ output_all_encoded_layers=True,
20
+ **kwargs,
21
+ ):
22
+ super().__init__(**kwargs)
23
+
24
+ self.vocab_size = vocab_size
25
+ self.hidden_size = hidden_size
26
+ self.num_hidden_layers = num_hidden_layers
27
+ self.num_attention_heads = num_attention_heads
28
+ self.intermediate_size = intermediate_size
29
+ self.hidden_dropout_prob = hidden_dropout_prob
30
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
31
+ self.max_position_embeddings = max_position_embeddings
32
+ self.output_all_encoded_layers = output_all_encoded_layers
33
+ self.position_bucket_size = position_bucket_size
34
+ self.layer_norm_eps = layer_norm_eps
modeling_ltgbert.py ADDED
@@ -0,0 +1,639 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ from typing import List, Optional, Tuple, Union
3
+
4
+ import torch
5
+ import torch.nn as nn
6
+ import torch.nn.functional as F
7
+ from torch.utils import checkpoint
8
+
9
+ from .configuration_ltgbert import LtgbertConfig
10
+ from transformers.modeling_utils import PreTrainedModel
11
+ from transformers.activations import gelu_new
12
+ from transformers.modeling_outputs import (
13
+ MaskedLMOutput,
14
+ MultipleChoiceModelOutput,
15
+ QuestionAnsweringModelOutput,
16
+ SequenceClassifierOutput,
17
+ TokenClassifierOutput,
18
+ BaseModelOutput
19
+ )
20
+ from transformers.pytorch_utils import softmax_backward_data
21
+
22
+
23
+ class Encoder(nn.Module):
24
+ def __init__(self, config, activation_checkpointing=False):
25
+ super().__init__()
26
+ self.layers = nn.ModuleList([EncoderLayer(config) for _ in range(config.num_hidden_layers)])
27
+
28
+ for i, layer in enumerate(self.layers):
29
+ layer.mlp.mlp[1].weight.data *= math.sqrt(1.0 / (2.0 * (1 + i)))
30
+ layer.mlp.mlp[-2].weight.data *= math.sqrt(1.0 / (2.0 * (1 + i)))
31
+
32
+ self.activation_checkpointing = activation_checkpointing
33
+
34
+ def forward(self, hidden_states, attention_mask, relative_embedding):
35
+ hidden_states, attention_probs = [hidden_states], []
36
+
37
+ for layer in self.layers:
38
+ if self.activation_checkpointing:
39
+ hidden_state, attention_p = checkpoint.checkpoint(layer, hidden_states[-1], attention_mask, relative_embedding)
40
+ else:
41
+ hidden_state, attention_p = layer(hidden_states[-1], attention_mask, relative_embedding)
42
+
43
+ hidden_states.append(hidden_state)
44
+ attention_probs.append(attention_p)
45
+
46
+ return hidden_states, attention_probs
47
+
48
+
49
+ class MaskClassifier(nn.Module):
50
+ def __init__(self, config, subword_embedding):
51
+ super().__init__()
52
+ self.nonlinearity = nn.Sequential(
53
+ nn.LayerNorm(config.hidden_size, config.layer_norm_eps, elementwise_affine=False),
54
+ nn.Linear(config.hidden_size, config.hidden_size),
55
+ nn.GELU(),
56
+ nn.LayerNorm(config.hidden_size, config.layer_norm_eps, elementwise_affine=False),
57
+ nn.Dropout(config.hidden_dropout_prob),
58
+ nn.Linear(subword_embedding.size(1), subword_embedding.size(0))
59
+ )
60
+
61
+ def forward(self, x, masked_lm_labels=None):
62
+ if masked_lm_labels is not None:
63
+ x = torch.index_select(x.flatten(0, 1), 0, torch.nonzero(masked_lm_labels.flatten() != -100).squeeze())
64
+ x = self.nonlinearity(x)
65
+ return x
66
+
67
+
68
+ class EncoderLayer(nn.Module):
69
+ def __init__(self, config):
70
+ super().__init__()
71
+ self.attention = Attention(config)
72
+ self.mlp = FeedForward(config)
73
+
74
+ def forward(self, x, padding_mask, relative_embedding):
75
+ attention_output, attention_probs = self.attention(x, padding_mask, relative_embedding)
76
+ x = x + attention_output
77
+ x = x + self.mlp(x)
78
+ return x, attention_probs
79
+
80
+
81
+ class GeGLU(nn.Module):
82
+ def forward(self, x):
83
+ x, gate = x.chunk(2, dim=-1)
84
+ x = x * gelu_new(gate)
85
+ return x
86
+
87
+
88
+ class FeedForward(nn.Module):
89
+ def __init__(self, config):
90
+ super().__init__()
91
+ self.mlp = nn.Sequential(
92
+ nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps, elementwise_affine=False),
93
+ nn.Linear(config.hidden_size, 2*config.intermediate_size, bias=False),
94
+ GeGLU(),
95
+ nn.LayerNorm(config.intermediate_size, eps=config.layer_norm_eps, elementwise_affine=False),
96
+ nn.Linear(config.intermediate_size, config.hidden_size, bias=False),
97
+ nn.Dropout(config.hidden_dropout_prob)
98
+ )
99
+
100
+ def forward(self, x):
101
+ return self.mlp(x)
102
+
103
+
104
+ class MaskedSoftmax(torch.autograd.Function):
105
+ @staticmethod
106
+ def forward(self, x, mask, dim):
107
+ self.dim = dim
108
+ x.masked_fill_(mask, float('-inf'))
109
+ x = torch.softmax(x, self.dim)
110
+ x.masked_fill_(mask, 0.0)
111
+ self.save_for_backward(x)
112
+ return x
113
+
114
+ @staticmethod
115
+ def backward(self, grad_output):
116
+ output, = self.saved_tensors
117
+ input_grad = softmax_backward_data(self, grad_output, output, self.dim, output)
118
+ return input_grad, None, None
119
+
120
+
121
+ class Attention(nn.Module):
122
+ def __init__(self, config):
123
+ super().__init__()
124
+
125
+ self.config = config
126
+
127
+ if config.hidden_size % config.num_attention_heads != 0:
128
+ raise ValueError(f"The hidden size {config.hidden_size} is not a multiple of the number of attention heads {config.num_attention_heads}")
129
+
130
+ self.hidden_size = config.hidden_size
131
+ self.num_heads = config.num_attention_heads
132
+ self.head_size = config.hidden_size // config.num_attention_heads
133
+
134
+ self.in_proj_qk = nn.Linear(config.hidden_size, 2*config.hidden_size, bias=True)
135
+ self.in_proj_v = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
136
+ self.out_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
137
+
138
+ self.pre_layer_norm = nn.LayerNorm(config.hidden_size, config.layer_norm_eps, elementwise_affine=False)
139
+ self.post_layer_norm = nn.LayerNorm(config.hidden_size, config.layer_norm_eps, elementwise_affine=True)
140
+
141
+ position_indices = torch.arange(config.max_position_embeddings, dtype=torch.long).unsqueeze(1) \
142
+ - torch.arange(config.max_position_embeddings, dtype=torch.long).unsqueeze(0)
143
+ position_indices = self.make_log_bucket_position(position_indices, config.position_bucket_size, config.max_position_embeddings)
144
+ position_indices = config.position_bucket_size - 1 + position_indices
145
+ self.register_buffer("position_indices", position_indices, persistent=True)
146
+
147
+ self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
148
+ self.scale = 1.0 / math.sqrt(3 * self.head_size)
149
+
150
+ def make_log_bucket_position(self, relative_pos, bucket_size, max_position):
151
+ sign = torch.sign(relative_pos)
152
+ mid = bucket_size // 2
153
+ abs_pos = torch.where((relative_pos < mid) & (relative_pos > -mid), mid - 1, torch.abs(relative_pos).clamp(max=max_position - 1))
154
+ log_pos = torch.ceil(torch.log(abs_pos / mid) / math.log((max_position-1) / mid) * (mid - 1)).int() + mid
155
+ bucket_pos = torch.where(abs_pos <= mid, relative_pos, log_pos * sign).long()
156
+ return bucket_pos
157
+
158
+ def compute_attention_scores(self, hidden_states, relative_embedding):
159
+ key_len, batch_size, _ = hidden_states.size()
160
+ query_len = key_len
161
+
162
+ if self.position_indices.size(0) < query_len:
163
+ position_indices = torch.arange(query_len, dtype=torch.long).unsqueeze(1) \
164
+ - torch.arange(query_len, dtype=torch.long).unsqueeze(0)
165
+ position_indices = self.make_log_bucket_position(position_indices, self.config.position_bucket_size, 512)
166
+ position_indices = self.config.position_bucket_size - 1 + position_indices
167
+ self.position_indices = position_indices.to(hidden_states.device)
168
+
169
+ hidden_states = self.pre_layer_norm(hidden_states)
170
+
171
+ query, key = self.in_proj_qk(hidden_states).chunk(2, dim=2) # shape: [T, B, D]
172
+ value = self.in_proj_v(hidden_states) # shape: [T, B, D]
173
+
174
+ query = query.reshape(query_len, batch_size * self.num_heads, self.head_size).transpose(0, 1)
175
+ key = key.reshape(key_len, batch_size * self.num_heads, self.head_size).transpose(0, 1)
176
+ value = value.view(key_len, batch_size * self.num_heads, self.head_size).transpose(0, 1)
177
+
178
+ attention_scores = torch.bmm(query, key.transpose(1, 2) * self.scale)
179
+
180
+ query_pos, key_pos = self.in_proj_qk(self.dropout(relative_embedding)).chunk(2, dim=-1) # shape: [2T-1, D]
181
+ query_pos = query_pos.view(-1, self.num_heads, self.head_size) # shape: [2T-1, H, D]
182
+ key_pos = key_pos.view(-1, self.num_heads, self.head_size) # shape: [2T-1, H, D]
183
+
184
+ query = query.view(batch_size, self.num_heads, query_len, self.head_size)
185
+ key = key.view(batch_size, self.num_heads, query_len, self.head_size)
186
+
187
+ attention_c_p = torch.einsum("bhqd,khd->bhqk", query, key_pos.squeeze(1) * self.scale)
188
+ attention_p_c = torch.einsum("bhkd,qhd->bhqk", key * self.scale, query_pos.squeeze(1))
189
+
190
+ position_indices = self.position_indices[:query_len, :key_len].expand(batch_size, self.num_heads, -1, -1)
191
+ attention_c_p = attention_c_p.gather(3, position_indices)
192
+ attention_p_c = attention_p_c.gather(2, position_indices)
193
+
194
+ attention_scores = attention_scores.view(batch_size, self.num_heads, query_len, key_len)
195
+ attention_scores.add_(attention_c_p)
196
+ attention_scores.add_(attention_p_c)
197
+
198
+ return attention_scores, value
199
+
200
+ def compute_output(self, attention_probs, value):
201
+ attention_probs = self.dropout(attention_probs)
202
+ context = torch.bmm(attention_probs.flatten(0, 1), value) # shape: [B*H, Q, D]
203
+ context = context.transpose(0, 1).reshape(context.size(1), -1, self.hidden_size) # shape: [Q, B, H*D]
204
+ context = self.out_proj(context)
205
+ context = self.post_layer_norm(context)
206
+ context = self.dropout(context)
207
+ return context
208
+
209
+ def forward(self, hidden_states, attention_mask, relative_embedding):
210
+ attention_scores, value = self.compute_attention_scores(hidden_states, relative_embedding)
211
+ attention_probs = MaskedSoftmax.apply(attention_scores, attention_mask, -1)
212
+ return self.compute_output(attention_probs, value), attention_probs.detach()
213
+
214
+
215
+ class Embedding(nn.Module):
216
+ def __init__(self, config):
217
+ super().__init__()
218
+ self.hidden_size = config.hidden_size
219
+
220
+ self.word_embedding = nn.Embedding(config.vocab_size, config.hidden_size)
221
+ self.word_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps, elementwise_affine=False)
222
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
223
+
224
+ self.relative_embedding = nn.Parameter(torch.empty(2 * config.position_bucket_size - 1, config.hidden_size))
225
+ self.relative_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
226
+
227
+ def forward(self, input_ids):
228
+ word_embedding = self.dropout(self.word_layer_norm(self.word_embedding(input_ids)))
229
+ relative_embeddings = self.relative_layer_norm(self.relative_embedding)
230
+ return word_embedding, relative_embeddings
231
+
232
+
233
+ #
234
+ # HuggingFace wrappers
235
+ #
236
+
237
+ class LtgbertPreTrainedModel(PreTrainedModel):
238
+ config_class = LtgbertConfig
239
+ supports_gradient_checkpointing = True
240
+
241
+ def _set_gradient_checkpointing(self, module, value=False):
242
+ if isinstance(module, Encoder):
243
+ module.activation_checkpointing = value
244
+
245
+ def _init_weights(self, module):
246
+ std = math.sqrt(2.0 / (5.0 * self.hidden_size))
247
+
248
+ if isinstance(module, nn.Linear):
249
+ nn.init.trunc_normal_(module.weight.data, mean=0.0, std=std, a=-2*std, b=2*std)
250
+ if module.bias is not None:
251
+ module.bias.data.zero_()
252
+ elif isinstance(module, nn.Embedding):
253
+ nn.init.trunc_normal_(module.weight.data, mean=0.0, std=std, a=-2*std, b=2*std)
254
+ elif isinstance(module, nn.LayerNorm):
255
+ if module.bias is not None:
256
+ module.bias.data.zero_()
257
+ if module.weight is not None:
258
+ module.weight.data.fill_(1.0)
259
+
260
+
261
+ class LtgbertModel(LtgbertPreTrainedModel):
262
+ def __init__(self, config, add_mlm_layer=False, gradient_checkpointing=False, **kwargs):
263
+ super().__init__(config, **kwargs)
264
+ self.config = config
265
+ self.hidden_size = config.hidden_size
266
+
267
+ self.embedding = Embedding(config)
268
+ self.transformer = Encoder(config, activation_checkpointing=gradient_checkpointing)
269
+ self.classifier = MaskClassifier(config, self.embedding.word_embedding.weight) if add_mlm_layer else None
270
+
271
+
272
+ def get_input_embeddings(self):
273
+ return self.embedding.word_embedding
274
+
275
+ def set_input_embeddings(self, value):
276
+ self.embedding.word_embedding = value
277
+
278
+ def get_contextualized_embeddings(
279
+ self,
280
+ input_ids: Optional[torch.Tensor] = None,
281
+ attention_mask: Optional[torch.Tensor] = None
282
+ ) -> List[torch.Tensor]:
283
+ if input_ids is not None:
284
+ input_shape = input_ids.size()
285
+ else:
286
+ raise ValueError("You have to specify input_ids")
287
+
288
+ batch_size, seq_length = input_shape
289
+ device = input_ids.device
290
+
291
+ if attention_mask is None:
292
+ attention_mask = torch.zeros(batch_size, seq_length, dtype=torch.bool, device=device)
293
+ else:
294
+ attention_mask = ~attention_mask.bool()
295
+ attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
296
+
297
+ static_embeddings, relative_embedding = self.embedding(input_ids.t())
298
+ contextualized_embeddings, attention_probs = self.transformer(static_embeddings, attention_mask, relative_embedding)
299
+ contextualized_embeddings = [e.transpose(0, 1) for e in contextualized_embeddings]
300
+ last_layer = contextualized_embeddings[-1]
301
+ contextualized_embeddings = [contextualized_embeddings[0]] + [
302
+ contextualized_embeddings[i] - contextualized_embeddings[i - 1]
303
+ for i in range(1, len(contextualized_embeddings))
304
+ ]
305
+ return last_layer, contextualized_embeddings, attention_probs
306
+
307
+ def forward(
308
+ self,
309
+ input_ids: Optional[torch.Tensor] = None,
310
+ attention_mask: Optional[torch.Tensor] = None,
311
+ token_type_ids: Optional[torch.Tensor] = None,
312
+ position_ids: Optional[torch.Tensor] = None,
313
+ output_hidden_states: Optional[bool] = None,
314
+ output_attentions: Optional[bool] = None,
315
+ return_dict: Optional[bool] = None,
316
+ **kwargs
317
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutput]:
318
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
319
+
320
+ sequence_output, contextualized_embeddings, attention_probs = self.get_contextualized_embeddings(input_ids, attention_mask)
321
+
322
+ if not return_dict:
323
+ return (
324
+ sequence_output,
325
+ *([contextualized_embeddings] if output_hidden_states else []),
326
+ *([attention_probs] if output_attentions else [])
327
+ )
328
+
329
+ return BaseModelOutput(
330
+ last_hidden_state=sequence_output,
331
+ hidden_states=contextualized_embeddings if output_hidden_states else None,
332
+ attentions=attention_probs if output_attentions else None
333
+ )
334
+
335
+
336
+ class LtgbertForMaskedLM(LtgbertModel):
337
+ _keys_to_ignore_on_load_unexpected = ["head"]
338
+
339
+ def __init__(self, config, **kwargs):
340
+ super().__init__(config, add_mlm_layer=True, **kwargs)
341
+
342
+ def get_output_embeddings(self):
343
+ return self.classifier.nonlinearity[-1].weight
344
+
345
+ def set_output_embeddings(self, new_embeddings):
346
+ self.classifier.nonlinearity[-1].weight = new_embeddings
347
+
348
+ def forward(
349
+ self,
350
+ input_ids: Optional[torch.Tensor] = None,
351
+ attention_mask: Optional[torch.Tensor] = None,
352
+ token_type_ids: Optional[torch.Tensor] = None,
353
+ position_ids: Optional[torch.Tensor] = None,
354
+ output_hidden_states: Optional[bool] = None,
355
+ output_attentions: Optional[bool] = None,
356
+ return_dict: Optional[bool] = None,
357
+ labels: Optional[torch.LongTensor] = None,
358
+ **kwargs
359
+ ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
360
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
361
+
362
+ sequence_output, contextualized_embeddings, attention_probs = self.get_contextualized_embeddings(input_ids, attention_mask)
363
+ subword_prediction = self.classifier(sequence_output)
364
+ subword_prediction[:, :, :106+1] = float("-inf")
365
+
366
+ masked_lm_loss = None
367
+ if labels is not None:
368
+ masked_lm_loss = F.cross_entropy(subword_prediction.flatten(0, 1), labels.flatten())
369
+
370
+ if not return_dict:
371
+ output = (
372
+ subword_prediction,
373
+ *([contextualized_embeddings] if output_hidden_states else []),
374
+ *([attention_probs] if output_attentions else [])
375
+ )
376
+ return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
377
+
378
+ return MaskedLMOutput(
379
+ loss=masked_lm_loss,
380
+ logits=subword_prediction,
381
+ hidden_states=contextualized_embeddings if output_hidden_states else None,
382
+ attentions=attention_probs if output_attentions else None
383
+ )
384
+
385
+
386
+ class Classifier(nn.Module):
387
+ def __init__(self, config, num_labels: int):
388
+ super().__init__()
389
+
390
+ drop_out = getattr(config, "cls_dropout", None)
391
+ drop_out = config.hidden_dropout_prob if drop_out is None else drop_out
392
+
393
+ self.nonlinearity = nn.Sequential(
394
+ nn.LayerNorm(config.hidden_size, config.layer_norm_eps, elementwise_affine=False),
395
+ nn.Linear(config.hidden_size, config.hidden_size),
396
+ nn.GELU(),
397
+ nn.LayerNorm(config.hidden_size, config.layer_norm_eps, elementwise_affine=False),
398
+ nn.Dropout(drop_out),
399
+ nn.Linear(config.hidden_size, num_labels)
400
+ )
401
+
402
+ def forward(self, x):
403
+ x = self.nonlinearity(x)
404
+ return x
405
+
406
+
407
+ class LtgbertForSequenceClassification(LtgbertModel):
408
+ _keys_to_ignore_on_load_unexpected = ["classifier"]
409
+ _keys_to_ignore_on_load_missing = ["head"]
410
+
411
+ def __init__(self, config, **kwargs):
412
+ super().__init__(config, add_mlm_layer=False, **kwargs)
413
+
414
+ self.num_labels = config.num_labels
415
+ self.head = Classifier(config, self.num_labels)
416
+
417
+ def forward(
418
+ self,
419
+ input_ids: Optional[torch.Tensor] = None,
420
+ attention_mask: Optional[torch.Tensor] = None,
421
+ token_type_ids: Optional[torch.Tensor] = None,
422
+ position_ids: Optional[torch.Tensor] = None,
423
+ output_attentions: Optional[bool] = None,
424
+ output_hidden_states: Optional[bool] = None,
425
+ return_dict: Optional[bool] = None,
426
+ labels: Optional[torch.LongTensor] = None,
427
+ **kwargs
428
+ ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
429
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
430
+
431
+ sequence_output, contextualized_embeddings, attention_probs = self.get_contextualized_embeddings(input_ids, attention_mask)
432
+ logits = self.head(sequence_output[:, 0, :])
433
+
434
+ loss = None
435
+ if labels is not None:
436
+ if self.config.problem_type is None:
437
+ if self.num_labels == 1:
438
+ self.config.problem_type = "regression"
439
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
440
+ self.config.problem_type = "single_label_classification"
441
+ else:
442
+ self.config.problem_type = "multi_label_classification"
443
+
444
+ if self.config.problem_type == "regression":
445
+ loss_fct = nn.MSELoss()
446
+ if self.num_labels == 1:
447
+ loss = loss_fct(logits.squeeze(), labels.squeeze())
448
+ else:
449
+ loss = loss_fct(logits, labels)
450
+ elif self.config.problem_type == "single_label_classification":
451
+ loss_fct = nn.CrossEntropyLoss()
452
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
453
+ elif self.config.problem_type == "multi_label_classification":
454
+ loss_fct = nn.BCEWithLogitsLoss()
455
+ loss = loss_fct(logits, labels)
456
+
457
+ if not return_dict:
458
+ output = (
459
+ logits,
460
+ *([contextualized_embeddings] if output_hidden_states else []),
461
+ *([attention_probs] if output_attentions else [])
462
+ )
463
+ return ((loss,) + output) if loss is not None else output
464
+
465
+ return SequenceClassifierOutput(
466
+ loss=loss,
467
+ logits=logits,
468
+ hidden_states=contextualized_embeddings if output_hidden_states else None,
469
+ attentions=attention_probs if output_attentions else None
470
+ )
471
+
472
+
473
+ class LtgbertForTokenClassification(LtgbertModel):
474
+ _keys_to_ignore_on_load_unexpected = ["classifier"]
475
+ _keys_to_ignore_on_load_missing = ["head"]
476
+
477
+ def __init__(self, config, **kwargs):
478
+ super().__init__(config, add_mlm_layer=False, **kwargs)
479
+
480
+ self.num_labels = config.num_labels
481
+ self.head = Classifier(config, self.num_labels)
482
+
483
+ def forward(
484
+ self,
485
+ input_ids: Optional[torch.Tensor] = None,
486
+ attention_mask: Optional[torch.Tensor] = None,
487
+ token_type_ids: Optional[torch.Tensor] = None,
488
+ position_ids: Optional[torch.Tensor] = None,
489
+ output_attentions: Optional[bool] = None,
490
+ output_hidden_states: Optional[bool] = None,
491
+ return_dict: Optional[bool] = None,
492
+ labels: Optional[torch.LongTensor] = None,
493
+ **kwargs
494
+ ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]:
495
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
496
+
497
+ sequence_output, contextualized_embeddings, attention_probs = self.get_contextualized_embeddings(input_ids, attention_mask)
498
+ logits = self.head(sequence_output)
499
+
500
+ loss = None
501
+ if labels is not None:
502
+ loss_fct = nn.CrossEntropyLoss()
503
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
504
+
505
+ if not return_dict:
506
+ output = (
507
+ logits,
508
+ *([contextualized_embeddings] if output_hidden_states else []),
509
+ *([attention_probs] if output_attentions else [])
510
+ )
511
+ return ((loss,) + output) if loss is not None else output
512
+
513
+ return TokenClassifierOutput(
514
+ loss=loss,
515
+ logits=logits,
516
+ hidden_states=contextualized_embeddings if output_hidden_states else None,
517
+ attentions=attention_probs if output_attentions else None
518
+ )
519
+
520
+
521
+ class LtgbertForQuestionAnswering(LtgbertModel):
522
+ _keys_to_ignore_on_load_unexpected = ["classifier"]
523
+ _keys_to_ignore_on_load_missing = ["head"]
524
+
525
+ def __init__(self, config, **kwargs):
526
+ super().__init__(config, add_mlm_layer=False, **kwargs)
527
+
528
+ self.num_labels = config.num_labels
529
+ self.head = Classifier(config, self.num_labels)
530
+
531
+ def forward(
532
+ self,
533
+ input_ids: Optional[torch.Tensor] = None,
534
+ attention_mask: Optional[torch.Tensor] = None,
535
+ token_type_ids: Optional[torch.Tensor] = None,
536
+ position_ids: Optional[torch.Tensor] = None,
537
+ output_attentions: Optional[bool] = None,
538
+ output_hidden_states: Optional[bool] = None,
539
+ return_dict: Optional[bool] = None,
540
+ start_positions: Optional[torch.Tensor] = None,
541
+ end_positions: Optional[torch.Tensor] = None,
542
+ **kwargs
543
+ ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
544
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
545
+
546
+ sequence_output, contextualized_embeddings, attention_probs = self.get_contextualized_embeddings(input_ids, attention_mask)
547
+ logits = self.head(sequence_output)
548
+
549
+ start_logits, end_logits = logits.split(1, dim=-1)
550
+ start_logits = start_logits.squeeze(-1).contiguous()
551
+ end_logits = end_logits.squeeze(-1).contiguous()
552
+
553
+ total_loss = None
554
+ if start_positions is not None and end_positions is not None:
555
+ # If we are on multi-GPU, split add a dimension
556
+ if len(start_positions.size()) > 1:
557
+ start_positions = start_positions.squeeze(-1)
558
+ if len(end_positions.size()) > 1:
559
+ end_positions = end_positions.squeeze(-1)
560
+
561
+ # sometimes the start/end positions are outside our model inputs, we ignore these terms
562
+ ignored_index = start_logits.size(1)
563
+ start_positions = start_positions.clamp(0, ignored_index)
564
+ end_positions = end_positions.clamp(0, ignored_index)
565
+
566
+ loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
567
+ start_loss = loss_fct(start_logits, start_positions)
568
+ end_loss = loss_fct(end_logits, end_positions)
569
+ total_loss = (start_loss + end_loss) / 2
570
+
571
+ if not return_dict:
572
+ output = (
573
+ start_logits,
574
+ end_logits,
575
+ *([contextualized_embeddings] if output_hidden_states else []),
576
+ *([attention_probs] if output_attentions else [])
577
+ )
578
+ return ((total_loss,) + output) if total_loss is not None else output
579
+
580
+ return QuestionAnsweringModelOutput(
581
+ loss=total_loss,
582
+ start_logits=start_logits,
583
+ end_logits=end_logits,
584
+ hidden_states=contextualized_embeddings if output_hidden_states else None,
585
+ attentions=attention_probs if output_attentions else None
586
+ )
587
+
588
+
589
+ class LtgbertForMultipleChoice(LtgbertModel):
590
+ _keys_to_ignore_on_load_unexpected = ["classifier"]
591
+ _keys_to_ignore_on_load_missing = ["head"]
592
+
593
+ def __init__(self, config, **kwargs):
594
+ super().__init__(config, add_mlm_layer=False, **kwargs)
595
+
596
+ self.num_labels = getattr(config, "num_labels", 2)
597
+ self.head = Classifier(config, self.num_labels)
598
+
599
+ def forward(
600
+ self,
601
+ input_ids: Optional[torch.Tensor] = None,
602
+ attention_mask: Optional[torch.Tensor] = None,
603
+ token_type_ids: Optional[torch.Tensor] = None,
604
+ position_ids: Optional[torch.Tensor] = None,
605
+ labels: Optional[torch.Tensor] = None,
606
+ output_attentions: Optional[bool] = None,
607
+ output_hidden_states: Optional[bool] = None,
608
+ return_dict: Optional[bool] = None,
609
+ **kwargs
610
+ ) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
611
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
612
+ num_choices = input_ids.shape[1]
613
+
614
+ flat_input_ids = input_ids.view(-1, input_ids.size(-1))
615
+ flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
616
+
617
+ sequence_output, contextualized_embeddings, attention_probs = self.get_contextualized_embeddings(flat_input_ids, flat_attention_mask)
618
+ logits = self.head(sequence_output)
619
+ reshaped_logits = logits.view(-1, num_choices)
620
+
621
+ loss = None
622
+ if labels is not None:
623
+ loss_fct = nn.CrossEntropyLoss()
624
+ loss = loss_fct(reshaped_logits, labels)
625
+
626
+ if not return_dict:
627
+ output = (
628
+ reshaped_logits,
629
+ *([contextualized_embeddings] if output_hidden_states else []),
630
+ *([attention_probs] if output_attentions else [])
631
+ )
632
+ return ((loss,) + output) if loss is not None else output
633
+
634
+ return MultipleChoiceModelOutput(
635
+ loss=loss,
636
+ logits=reshaped_logits,
637
+ hidden_states=contextualized_embeddings if output_hidden_states else None,
638
+ attentions=attention_probs if output_attentions else None
639
+ )
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9b8a2eb7cc776544e7862a5ecbd1a0523f67176b521c53d5a2b4c80a198cb208
3
+ size 525164345
spacial_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "[BOS]", "eos_token": "[EOS]", "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "PreTrainedTokenizerFast",
3
+ "bos_token": "[BOS]",
4
+ "eos_token": "[EOS]",
5
+ "unk_token": "[UNK]",
6
+ "sep_token": "[SEP]",
7
+ "pad_token": "[PAD]",
8
+ "cls_token": "[CLS]",
9
+ "mask_token": "[MASK]"
10
+ }