stefan-it commited on
Commit
609e6b7
·
verified ·
1 Parent(s): 2228461

readme: add initial version

Browse files
Files changed (1) hide show
  1. README.md +89 -3
README.md CHANGED
@@ -1,3 +1,89 @@
1
- ---
2
- license: cc-by-sa-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-3.0
3
+ language:
4
+ - de
5
+ library_name: flair
6
+ ---
7
+
8
+ # Flair xLSTM Embeddings (German Wikipedia, Forward)
9
+
10
+ Research & development of Flair xLSTM Embeddings (Forward) trained on [German Wikipedia dump](https://huggingface.co/datasets/gwlms/dewiki-20230701-flair-corpus).
11
+
12
+ The Flair team is currently working on the integration of xLSTM (both LM training and fine-tuning models for downstream tasks).
13
+ Check out the `xlstm` [branch in the Flair repository](https://github.com/flairNLP/flair/tree/xlstm) - many thanks to [Patrick Haller](https://huggingface.co/PatrickHaller) for the work on it.
14
+
15
+ # Training
16
+
17
+ The current model was trained with commit `18ef331` from the [`xlstm` branch](https://github.com/flairNLP/flair/tree/xlstm). The `xlstm` [library](https://github.com/NX-AI/xlstm) needs to be installed manually - also check that `pip3 install Ninja` is installed.
18
+
19
+ The German Wikipedia dump from [this repository](https://huggingface.co/datasets/gwlms/dewiki-20230701-flair-corpus) is used, including sharding the corpus into a Flair-compatible format:
20
+
21
+ * `valid.txt` -> Validation corpus
22
+ * `test.txt` -> Test corpus
23
+ * `train` -> Folder with text files as training corpus
24
+
25
+ The model was trained with the following parameters for 2 epochs:
26
+
27
+ ```python3
28
+ import flair
29
+ import torch
30
+
31
+ from flair.data import SubTokenDictionary
32
+ from flair.models import xLSTMLanguageModel
33
+ from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus
34
+
35
+ from transformers import AutoTokenizer
36
+
37
+ flair.device = torch.device('cuda:0')
38
+
39
+ is_forward_lm = True
40
+
41
+ dictionary = SubTokenDictionary.load("gwlms/bert-base-dewiki-v1")
42
+
43
+ corpus = TextCorpus("/home/ubuntu/splitted_corpus",
44
+ dictionary,
45
+ is_forward_lm,
46
+ character_level=False,
47
+ random_case_flip=True,
48
+ )
49
+
50
+ xlstm_ablation_1 = """
51
+ mlstm_block:
52
+ mlstm:
53
+ conv1d_kernel_size: 2
54
+ qkv_proj_blocksize: 2
55
+ num_heads: 2
56
+ slstm_block:
57
+ slstm:
58
+ backend: cuda
59
+ num_heads: 2
60
+ conv1d_kernel_size: 2
61
+ bias_init: powerlaw_blockdependent
62
+ feedforward:
63
+ proj_factor: 1.3
64
+ act_fn: gelu
65
+ context_length: 256
66
+ num_blocks: 7
67
+ embedding_dim: 128
68
+ slstm_at: [1]
69
+ """
70
+
71
+ language_model = xLSTMLanguageModel(dictionary, xlstm_cfg=xlstm_ablation_1, is_forward_lm=True)
72
+ print(language_model)
73
+
74
+ trainer = LanguageModelTrainer(language_model, corpus)
75
+
76
+ trainer.train("xflair-german-wikipedia-xlstm_ablation_1-bs64-lr5-e2",
77
+ sequence_length=256,
78
+ mini_batch_size=64,
79
+ learning_rate=5,
80
+ patience=50,
81
+ max_epochs=2,
82
+ checkpoint=False,
83
+ num_workers=4,
84
+ )
85
+ ```
86
+
87
+ # Caveats
88
+
89
+ Notice: this model integration is heavily under development. And in the process of finding good hyper-parameters. Also downstream experiments are coming very soon.