readme: add initial version
Browse files
README.md
CHANGED
@@ -1,3 +1,89 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-sa-3.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-sa-3.0
|
3 |
+
language:
|
4 |
+
- de
|
5 |
+
library_name: flair
|
6 |
+
---
|
7 |
+
|
8 |
+
# Flair xLSTM Embeddings (German Wikipedia, Forward)
|
9 |
+
|
10 |
+
Research & development of Flair xLSTM Embeddings (Forward) trained on [German Wikipedia dump](https://huggingface.co/datasets/gwlms/dewiki-20230701-flair-corpus).
|
11 |
+
|
12 |
+
The Flair team is currently working on the integration of xLSTM (both LM training and fine-tuning models for downstream tasks).
|
13 |
+
Check out the `xlstm` [branch in the Flair repository](https://github.com/flairNLP/flair/tree/xlstm) - many thanks to [Patrick Haller](https://huggingface.co/PatrickHaller) for the work on it.
|
14 |
+
|
15 |
+
# Training
|
16 |
+
|
17 |
+
The current model was trained with commit `18ef331` from the [`xlstm` branch](https://github.com/flairNLP/flair/tree/xlstm). The `xlstm` [library](https://github.com/NX-AI/xlstm) needs to be installed manually - also check that `pip3 install Ninja` is installed.
|
18 |
+
|
19 |
+
The German Wikipedia dump from [this repository](https://huggingface.co/datasets/gwlms/dewiki-20230701-flair-corpus) is used, including sharding the corpus into a Flair-compatible format:
|
20 |
+
|
21 |
+
* `valid.txt` -> Validation corpus
|
22 |
+
* `test.txt` -> Test corpus
|
23 |
+
* `train` -> Folder with text files as training corpus
|
24 |
+
|
25 |
+
The model was trained with the following parameters for 2 epochs:
|
26 |
+
|
27 |
+
```python3
|
28 |
+
import flair
|
29 |
+
import torch
|
30 |
+
|
31 |
+
from flair.data import SubTokenDictionary
|
32 |
+
from flair.models import xLSTMLanguageModel
|
33 |
+
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus
|
34 |
+
|
35 |
+
from transformers import AutoTokenizer
|
36 |
+
|
37 |
+
flair.device = torch.device('cuda:0')
|
38 |
+
|
39 |
+
is_forward_lm = True
|
40 |
+
|
41 |
+
dictionary = SubTokenDictionary.load("gwlms/bert-base-dewiki-v1")
|
42 |
+
|
43 |
+
corpus = TextCorpus("/home/ubuntu/splitted_corpus",
|
44 |
+
dictionary,
|
45 |
+
is_forward_lm,
|
46 |
+
character_level=False,
|
47 |
+
random_case_flip=True,
|
48 |
+
)
|
49 |
+
|
50 |
+
xlstm_ablation_1 = """
|
51 |
+
mlstm_block:
|
52 |
+
mlstm:
|
53 |
+
conv1d_kernel_size: 2
|
54 |
+
qkv_proj_blocksize: 2
|
55 |
+
num_heads: 2
|
56 |
+
slstm_block:
|
57 |
+
slstm:
|
58 |
+
backend: cuda
|
59 |
+
num_heads: 2
|
60 |
+
conv1d_kernel_size: 2
|
61 |
+
bias_init: powerlaw_blockdependent
|
62 |
+
feedforward:
|
63 |
+
proj_factor: 1.3
|
64 |
+
act_fn: gelu
|
65 |
+
context_length: 256
|
66 |
+
num_blocks: 7
|
67 |
+
embedding_dim: 128
|
68 |
+
slstm_at: [1]
|
69 |
+
"""
|
70 |
+
|
71 |
+
language_model = xLSTMLanguageModel(dictionary, xlstm_cfg=xlstm_ablation_1, is_forward_lm=True)
|
72 |
+
print(language_model)
|
73 |
+
|
74 |
+
trainer = LanguageModelTrainer(language_model, corpus)
|
75 |
+
|
76 |
+
trainer.train("xflair-german-wikipedia-xlstm_ablation_1-bs64-lr5-e2",
|
77 |
+
sequence_length=256,
|
78 |
+
mini_batch_size=64,
|
79 |
+
learning_rate=5,
|
80 |
+
patience=50,
|
81 |
+
max_epochs=2,
|
82 |
+
checkpoint=False,
|
83 |
+
num_workers=4,
|
84 |
+
)
|
85 |
+
```
|
86 |
+
|
87 |
+
# Caveats
|
88 |
+
|
89 |
+
Notice: this model integration is heavily under development. And in the process of finding good hyper-parameters. Also downstream experiments are coming very soon.
|