kiddothe2b commited on
Commit
cfdc1d3
·
1 Parent(s): f8ce38f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -8
README.md CHANGED
@@ -1,23 +1,33 @@
1
  ---
 
 
 
2
  tags:
3
- - generated_from_trainer
4
  model-index:
5
- - name: roberta-large-cased
6
  results: []
 
 
 
 
7
  ---
8
 
9
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
10
  should probably proofread and complete it, then remove this comment. -->
11
 
12
- # roberta-large-cased
13
 
14
- This model was trained from scratch on an unknown dataset.
15
- It achieves the following results on the evaluation set:
16
- - Loss: 0.6314
17
 
18
  ## Model description
19
 
20
- More information needed
 
 
 
 
 
21
 
22
  ## Intended uses & limitations
23
 
@@ -25,7 +35,7 @@ More information needed
25
 
26
  ## Training and evaluation data
27
 
28
- More information needed
29
 
30
  ## Training procedure
31
 
@@ -78,3 +88,25 @@ The following hyperparameters were used during training:
78
  - Pytorch 1.12.0+cu102
79
  - Datasets 2.7.0
80
  - Tokenizers 0.12.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ pipeline_tag: fill-mask
4
+ license: cc-by-sa-4.0
5
  tags:
6
+ - legal
7
  model-index:
8
+ - name: lexlms/roberta-large
9
  results: []
10
+ widget:
11
+ - text: "The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of police."
12
+ datasets:
13
+ - lexlms/lexfiles
14
  ---
15
 
16
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
  should probably proofread and complete it, then remove this comment. -->
18
 
19
+ # LexLM large
20
 
21
+ This model was continued pre-trained from RoBERTa large (https://huggingface.co/roberta-large) on the LeXFiles corpus (https://huggingface.co/datasets/lexlms/lexfiles).
 
 
22
 
23
  ## Model description
24
 
25
+ LexLM (Base/Large) are our newly released RoBERTa models. We follow a series of best-practices in language model development:
26
+ * We warm-start (initialize) our models from the original RoBERTa checkpoints (base or large) of Liu et al. (2019).
27
+ * We train a new tokenizer of 50k BPEs, but we reuse the original embeddings for all lexically overlapping tokens (Pfeiffer et al., 2021).
28
+ * We continue pre-training our models on the diverse LeXFiles corpus for additional 1M steps with batches of 512 samples, and a 20/30% masking rate (Wettig et al., 2022), for base/large models, respectively.
29
+ * We use a sentence sampler with exponential smoothing of the sub-corpora sampling rate following Conneau et al. (2019) since there is a disparate proportion of tokens across sub-corpora and we aim to preserve per-corpus capacity (avoid overfitting).
30
+ * We consider mixed cased models, similar to all recently developed large PLMs.
31
 
32
  ## Intended uses & limitations
33
 
 
35
 
36
  ## Training and evaluation data
37
 
38
+ The model was trained on the LeXFiles corpus (https://huggingface.co/datasets/lexlms/lexfiles). For evaluation results, please consider our work "LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development" (Chalkidis* et al, 2023).
39
 
40
  ## Training procedure
41
 
 
88
  - Pytorch 1.12.0+cu102
89
  - Datasets 2.7.0
90
  - Tokenizers 0.12.0
91
+
92
+ ### Citation
93
+
94
+ [*Ilias Chalkidis\*, Nicolas Garneau\*, Catalina E.C. Goanta, Daniel Martin Katz, and Anders Søgaard.*
95
+ *LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development.*
96
+ *2022. In the Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada.*](https://aclanthology.org/xxx/)
97
+ ```
98
+ @inproceedings{chalkidis-garneau-etal-2023-lexlms,
99
+ title = {{LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development}},
100
+ author = "Chalkidis*, Ilias and
101
+ Garneau*, Nicolas and
102
+ Goanta, Catalina and
103
+ Katz, Daniel Martin and
104
+ Søgaard, Anders",
105
+ booktitle = "Proceedings of the 61h Annual Meeting of the Association for Computational Linguistics",
106
+ month = june,
107
+ year = "2023",
108
+ address = "Toronto, Canada",
109
+ publisher = "Association for Computational Linguistics",
110
+ url = "https://aclanthology.org/xxx",
111
+ }
112
+ ```