RomainDarous commited on
Commit
c85df88
·
verified ·
1 Parent(s): ede6b98

Add new SentenceTransformer model

Browse files
1_MultiHeadGeneralizedPooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "sentence_dim": 768,
3
+ "token_dim": 768,
4
+ "num_heads": 8,
5
+ "initialize": "random",
6
+ "pooling_type": 1
7
+ }
2_Dense/config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"in_features": 768, "out_features": 512, "bias": true, "activation_function": "torch.nn.modules.activation.Tanh"}
2_Dense/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff5f2ac27c53fec07c80c20bff5f26280603dac02c520f0abbae57b0886265bf
3
+ size 1575072
README.md ADDED
@@ -0,0 +1,911 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - bn
4
+ - cs
5
+ - de
6
+ - en
7
+ - et
8
+ - fi
9
+ - fr
10
+ - gu
11
+ - ha
12
+ - hi
13
+ - is
14
+ - ja
15
+ - kk
16
+ - km
17
+ - lt
18
+ - lv
19
+ - pl
20
+ - ps
21
+ - ru
22
+ - ta
23
+ - tr
24
+ - uk
25
+ - xh
26
+ - zh
27
+ - zu
28
+ - ne
29
+ - ro
30
+ - si
31
+ tags:
32
+ - sentence-transformers
33
+ - sentence-similarity
34
+ - feature-extraction
35
+ - generated_from_trainer
36
+ - dataset_size:1327190
37
+ - loss:CoSENTLoss
38
+ base_model: sentence-transformers/distiluse-base-multilingual-cased-v2
39
+ widget:
40
+ - source_sentence: यहाँका केही धार्मिक सम्पदाहरू यस प्रकार रहेका छन्।
41
+ sentences:
42
+ - A party works journalists from advertisements about a massive Himalayan post.
43
+ - Some religious affiliations here remain.
44
+ - In Spain, the strict opposition of Roman Catholic churches is found to have assumed
45
+ a marriage similar to male beach wives.
46
+ - source_sentence: Das Feuer konnte rasch wieder gelöscht werden.
47
+ sentences:
48
+ - In particular, Spot has an exclusive software platform that is only specially
49
+ developed for Spot, and users can set up the Spot robot function themselves through
50
+ a variety of applications.
51
+ - The fire was quickly extinguished.
52
+ - The PSG has made it clear that the Italian national will not be allowed to leave
53
+ in any condition, and Barcelona feels the reflections of this interest by losing
54
+ Neymar's greatest values.
55
+ - source_sentence: He possesses a pistol with silver bullets for protection from vampires
56
+ and werewolves.
57
+ sentences:
58
+ - Er besitzt eine Pistole mit silbernen Kugeln zum Schutz vor Vampiren und Werwölfen.
59
+ - Bibimbap umfasst Reis, Spinat, Rettich, Bohnensprossen.
60
+ - BSAC profitierte auch von den großen, aber nicht unbeschränkten persönlichen Vermögen
61
+ von Rhodos und Beit vor ihrem Tod.
62
+ - source_sentence: To the west of the Badger Head Inlier is the Port Sorell Formation,
63
+ a tectonic mélange of marine sediments and dolerite.
64
+ sentences:
65
+ - Er brennt einen Speer und brennt Flammen aus seinem Mund, wenn er wütend ist.
66
+ - Westlich des Badger Head Inlier befindet sich die Port Sorell Formation, eine
67
+ tektonische Mischung aus Sedimenten und Dolerit.
68
+ - Public Lynching and Mob Violence under Modi Government
69
+ - source_sentence: Garnizoana otomană se retrage în sudul Dunării, iar după 164 de
70
+ ani cetatea intră din nou sub stăpânirea europenilor.
71
+ sentences:
72
+ - This is because, once again, we have taken into account the fact that we have
73
+ adopted a large number of legislative proposals.
74
+ - Helsinki University ranks 75th among universities for 2010.
75
+ - Ottoman garnisoana is withdrawing into the south of the Danube and, after 164
76
+ years, it is once again under the control of Europeans.
77
+ datasets:
78
+ - RicardoRei/wmt-da-human-evaluation
79
+ - wmt/wmt20_mlqe_task1
80
+ pipeline_tag: sentence-similarity
81
+ library_name: sentence-transformers
82
+ metrics:
83
+ - pearson_cosine
84
+ - spearman_cosine
85
+ model-index:
86
+ - name: SentenceTransformer based on sentence-transformers/distiluse-base-multilingual-cased-v2
87
+ results:
88
+ - task:
89
+ type: semantic-similarity
90
+ name: Semantic Similarity
91
+ dataset:
92
+ name: sts eval
93
+ type: sts-eval
94
+ metrics:
95
+ - type: pearson_cosine
96
+ value: 0.42072704811442524
97
+ name: Pearson Cosine
98
+ - type: spearman_cosine
99
+ value: 0.41492248565322287
100
+ name: Spearman Cosine
101
+ - type: pearson_cosine
102
+ value: 0.04798468697271309
103
+ name: Pearson Cosine
104
+ - type: spearman_cosine
105
+ value: 0.09163381637023821
106
+ name: Spearman Cosine
107
+ - type: pearson_cosine
108
+ value: 0.13419394852857455
109
+ name: Pearson Cosine
110
+ - type: spearman_cosine
111
+ value: 0.14021002112020048
112
+ name: Spearman Cosine
113
+ - type: pearson_cosine
114
+ value: 0.3686145842456057
115
+ name: Pearson Cosine
116
+ - type: spearman_cosine
117
+ value: 0.37403547930478337
118
+ name: Spearman Cosine
119
+ - type: pearson_cosine
120
+ value: 0.4036712785577461
121
+ name: Pearson Cosine
122
+ - type: spearman_cosine
123
+ value: 0.40203424777388935
124
+ name: Spearman Cosine
125
+ - type: pearson_cosine
126
+ value: 0.4765959009301104
127
+ name: Pearson Cosine
128
+ - type: spearman_cosine
129
+ value: 0.45931707741919825
130
+ name: Spearman Cosine
131
+ - type: pearson_cosine
132
+ value: 0.30588658376090044
133
+ name: Pearson Cosine
134
+ - type: spearman_cosine
135
+ value: 0.26881979874382245
136
+ name: Spearman Cosine
137
+ - task:
138
+ type: semantic-similarity
139
+ name: Semantic Similarity
140
+ dataset:
141
+ name: sts test
142
+ type: sts-test
143
+ metrics:
144
+ - type: pearson_cosine
145
+ value: 0.41673846273409015
146
+ name: Pearson Cosine
147
+ - type: spearman_cosine
148
+ value: 0.413125969680318
149
+ name: Spearman Cosine
150
+ - type: pearson_cosine
151
+ value: 0.025760972016236502
152
+ name: Pearson Cosine
153
+ - type: spearman_cosine
154
+ value: 0.06798878866242045
155
+ name: Spearman Cosine
156
+ - type: pearson_cosine
157
+ value: 0.14352602331425646
158
+ name: Pearson Cosine
159
+ - type: spearman_cosine
160
+ value: 0.19612784355376908
161
+ name: Spearman Cosine
162
+ - type: pearson_cosine
163
+ value: 0.3719362123606391
164
+ name: Pearson Cosine
165
+ - type: spearman_cosine
166
+ value: 0.37629168606256713
167
+ name: Spearman Cosine
168
+ - type: pearson_cosine
169
+ value: 0.39800102996751985
170
+ name: Pearson Cosine
171
+ - type: spearman_cosine
172
+ value: 0.40749186555429473
173
+ name: Spearman Cosine
174
+ - type: pearson_cosine
175
+ value: 0.42084642716136017
176
+ name: Pearson Cosine
177
+ - type: spearman_cosine
178
+ value: 0.4185137269420985
179
+ name: Spearman Cosine
180
+ - type: pearson_cosine
181
+ value: 0.31870110899456183
182
+ name: Pearson Cosine
183
+ - type: spearman_cosine
184
+ value: 0.2675729909480732
185
+ name: Spearman Cosine
186
+ ---
187
+
188
+ # SentenceTransformer based on sentence-transformers/distiluse-base-multilingual-cased-v2
189
+
190
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) on the [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation), [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) and [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) datasets. It maps sentences & paragraphs to a 512-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
191
+
192
+ ## Model Details
193
+
194
+ ### Model Description
195
+ - **Model Type:** Sentence Transformer
196
+ - **Base model:** [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) <!-- at revision dad0fa1ee4fa6e982d3adbce87c73c02e6aee838 -->
197
+ - **Maximum Sequence Length:** 128 tokens
198
+ - **Output Dimensionality:** 512 dimensions
199
+ - **Similarity Function:** Cosine Similarity
200
+ - **Training Datasets:**
201
+ - [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation)
202
+ - [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
203
+ - [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
204
+ - [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
205
+ - [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
206
+ - [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
207
+ - [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
208
+ - **Languages:** bn, cs, de, en, et, fi, fr, gu, ha, hi, is, ja, kk, km, lt, lv, pl, ps, ru, ta, tr, uk, xh, zh, zu, ne, ro, si
209
+ <!-- - **License:** Unknown -->
210
+
211
+ ### Model Sources
212
+
213
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
214
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
215
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
216
+
217
+ ### Full Model Architecture
218
+
219
+ ```
220
+ SentenceTransformer(
221
+ (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel
222
+ (1): MultiHeadGeneralizedPooling(
223
+ (Q): ModuleList(
224
+ (0-7): 8 x Linear(in_features=96, out_features=1, bias=True)
225
+ )
226
+ (P_K): ModuleList(
227
+ (0-7): 8 x Linear(in_features=768, out_features=96, bias=True)
228
+ )
229
+ )
230
+ (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
231
+ )
232
+ ```
233
+
234
+ ## Usage
235
+
236
+ ### Direct Usage (Sentence Transformers)
237
+
238
+ First install the Sentence Transformers library:
239
+
240
+ ```bash
241
+ pip install -U sentence-transformers
242
+ ```
243
+
244
+ Then you can load this model and run inference.
245
+ ```python
246
+ from sentence_transformers import SentenceTransformer
247
+
248
+ # Download from the 🤗 Hub
249
+ model = SentenceTransformer("RomainDarous/pre_training_dot_product_generalized_model")
250
+ # Run inference
251
+ sentences = [
252
+ 'Garnizoana otomană se retrage în sudul Dunării, iar după 164 de ani cetatea intră din nou sub stăpânirea europenilor.',
253
+ 'Ottoman garnisoana is withdrawing into the south of the Danube and, after 164 years, it is once again under the control of Europeans.',
254
+ 'This is because, once again, we have taken into account the fact that we have adopted a large number of legislative proposals.',
255
+ ]
256
+ embeddings = model.encode(sentences)
257
+ print(embeddings.shape)
258
+ # [3, 512]
259
+
260
+ # Get the similarity scores for the embeddings
261
+ similarities = model.similarity(embeddings, embeddings)
262
+ print(similarities.shape)
263
+ # [3, 3]
264
+ ```
265
+
266
+ <!--
267
+ ### Direct Usage (Transformers)
268
+
269
+ <details><summary>Click to see the direct usage in Transformers</summary>
270
+
271
+ </details>
272
+ -->
273
+
274
+ <!--
275
+ ### Downstream Usage (Sentence Transformers)
276
+
277
+ You can finetune this model on your own dataset.
278
+
279
+ <details><summary>Click to expand</summary>
280
+
281
+ </details>
282
+ -->
283
+
284
+ <!--
285
+ ### Out-of-Scope Use
286
+
287
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
288
+ -->
289
+
290
+ ## Evaluation
291
+
292
+ ### Metrics
293
+
294
+ #### Semantic Similarity
295
+
296
+ * Datasets: `sts-eval`, `sts-test`, `sts-test`, `sts-test`, `sts-test`, `sts-test`, `sts-test` and `sts-test`
297
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
298
+
299
+ | Metric | sts-eval | sts-test |
300
+ |:--------------------|:-----------|:-----------|
301
+ | pearson_cosine | 0.4207 | 0.3187 |
302
+ | **spearman_cosine** | **0.4149** | **0.2676** |
303
+
304
+ #### Semantic Similarity
305
+
306
+ * Dataset: `sts-eval`
307
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
308
+
309
+ | Metric | Value |
310
+ |:--------------------|:-----------|
311
+ | pearson_cosine | 0.048 |
312
+ | **spearman_cosine** | **0.0916** |
313
+
314
+ #### Semantic Similarity
315
+
316
+ * Dataset: `sts-eval`
317
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
318
+
319
+ | Metric | Value |
320
+ |:--------------------|:-----------|
321
+ | pearson_cosine | 0.1342 |
322
+ | **spearman_cosine** | **0.1402** |
323
+
324
+ #### Semantic Similarity
325
+
326
+ * Dataset: `sts-eval`
327
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
328
+
329
+ | Metric | Value |
330
+ |:--------------------|:----------|
331
+ | pearson_cosine | 0.3686 |
332
+ | **spearman_cosine** | **0.374** |
333
+
334
+ #### Semantic Similarity
335
+
336
+ * Dataset: `sts-eval`
337
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
338
+
339
+ | Metric | Value |
340
+ |:--------------------|:----------|
341
+ | pearson_cosine | 0.4037 |
342
+ | **spearman_cosine** | **0.402** |
343
+
344
+ #### Semantic Similarity
345
+
346
+ * Dataset: `sts-eval`
347
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
348
+
349
+ | Metric | Value |
350
+ |:--------------------|:-----------|
351
+ | pearson_cosine | 0.4766 |
352
+ | **spearman_cosine** | **0.4593** |
353
+
354
+ #### Semantic Similarity
355
+
356
+ * Dataset: `sts-eval`
357
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
358
+
359
+ | Metric | Value |
360
+ |:--------------------|:-----------|
361
+ | pearson_cosine | 0.3059 |
362
+ | **spearman_cosine** | **0.2688** |
363
+
364
+ <!--
365
+ ## Bias, Risks and Limitations
366
+
367
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
368
+ -->
369
+
370
+ <!--
371
+ ### Recommendations
372
+
373
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
374
+ -->
375
+
376
+ ## Training Details
377
+
378
+ ### Training Datasets
379
+
380
+ #### wmt_da
381
+
382
+ * Dataset: [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation) at [301de38](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation/tree/301de385bf05b0c00a8f4be74965e186164dd425)
383
+ * Size: 1,285,190 training samples
384
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
385
+ * Approximate statistics based on the first 1000 samples:
386
+ | | sentence1 | sentence2 | score |
387
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:---------------------------------------------------------------|
388
+ | type | string | string | float |
389
+ | details | <ul><li>min: 4 tokens</li><li>mean: 37.89 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 5 tokens</li><li>mean: 37.91 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.72</li><li>max: 1.0</li></ul> |
390
+ * Samples:
391
+ | sentence1 | sentence2 | score |
392
+ |:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------|
393
+ | <code>Im Kanzleramt hatte der in Hamburg lebende türkische Journalist ein T-Shirt mit der türkischen und deutschen Aufschrift "Freiheit für Journalisten" übergezogen und war in die erste Reihe gegangen.</code> | <code>In the Chancellery, the Turkish journalist, who lives in Hamburg, had covered a T-shirt with the Turkish and German inscription "Freedom for Journalists" and had gone into the front row.</code> | <code>0.93</code> |
394
+ | <code>Das Außenministerium in London bezeichnete die Festsetzung des Schiffes als illegal. "Das ist Teil eines Musters von Versuchen, die Freiheit der Meere zu beeinträchtigen. Wir arbeiten mit unseren internationalen Partnern daran, die Schifffahrt und das Internationale Recht aufrechtzuerhalten", hieß es in einer Mitteilung am Freitag.</code> | <code>The State Department in London called the ship's fixing was illegal. ′′ This is part of a pattern of attempts to interfere with sea freedom. We are working with our international partners to maintain shipping and international law ", said a message on Friday.</code> | <code>0.9</code> |
395
+ | <code>Unfortunately, the list it belongs to is that of unique buildings that are in danger of collapse.</code> | <code>Bohužel, seznam patří k jedinečné budovy, které jsou v nebezpečí kolapsu.</code> | <code>0.14</code> |
396
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
397
+ ```json
398
+ {
399
+ "scale": 20.0,
400
+ "similarity_fct": "pairwise_cos_sim"
401
+ }
402
+ ```
403
+
404
+ #### mlqe_en_de
405
+
406
+ * Dataset: [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
407
+ * Size: 7,000 training samples
408
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
409
+ * Approximate statistics based on the first 1000 samples:
410
+ | | sentence1 | sentence2 | score |
411
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
412
+ | type | string | string | float |
413
+ | details | <ul><li>min: 11 tokens</li><li>mean: 23.78 tokens</li><li>max: 44 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 26.51 tokens</li><li>max: 54 tokens</li></ul> | <ul><li>min: 0.06</li><li>mean: 0.86</li><li>max: 1.0</li></ul> |
414
+ * Samples:
415
+ | sentence1 | sentence2 | score |
416
+ |:-------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
417
+ | <code>Early Muslim traders and merchants visited Bengal while traversing the Silk Road in the first millennium.</code> | <code>Frühe muslimische Händler und Kaufleute besuchten Bengalen, während sie im ersten Jahrtausend die Seidenstraße durchquerten.</code> | <code>0.9233333468437195</code> |
418
+ | <code>While Fran dissipated shortly after that, the tropical wave progressed into the northeastern Pacific Ocean.</code> | <code>Während Fran kurz danach zerstreute, entwickelte sich die tropische Welle in den nordöstlichen Pazifischen Ozean.</code> | <code>0.8899999856948853</code> |
419
+ | <code>Distressed securities include such events as restructurings, recapitalizations, and bankruptcies.</code> | <code>Zu den belasteten Wertpapieren gehören Restrukturierungen, Rekapitalisierungen und Insolvenzen.</code> | <code>0.9300000071525574</code> |
420
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
421
+ ```json
422
+ {
423
+ "scale": 20.0,
424
+ "similarity_fct": "pairwise_cos_sim"
425
+ }
426
+ ```
427
+
428
+ #### mlqe_en_zh
429
+
430
+ * Dataset: [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
431
+ * Size: 7,000 training samples
432
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
433
+ * Approximate statistics based on the first 1000 samples:
434
+ | | sentence1 | sentence2 | score |
435
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------|
436
+ | type | string | string | float |
437
+ | details | <ul><li>min: 9 tokens</li><li>mean: 24.09 tokens</li><li>max: 47 tokens</li></ul> | <ul><li>min: 12 tokens</li><li>mean: 29.93 tokens</li><li>max: 74 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.68</li><li>max: 0.98</li></ul> |
438
+ * Samples:
439
+ | sentence1 | sentence2 | score |
440
+ |:-------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------|:---------------------------------|
441
+ | <code>In the late 1980s, the hotel's reputation declined, and it functioned partly as a "backpackers hangout."</code> | <code>在 20 世纪 80 年代末 , 这家旅馆的声誉下降了 , 部分地起到了 "背包吊销" 的作用。</code> | <code>0.40666666626930237</code> |
442
+ | <code>From 1870 to 1915, 36 million Europeans migrated away from Europe.</code> | <code>从 1870 年到 1915 年 , 3, 600 万欧洲人从欧洲移民。</code> | <code>0.8333333730697632</code> |
443
+ | <code>In some photos, the footpads did press into the regolith, especially when they moved sideways at touchdown.</code> | <code>在一些照片中 , 脚垫确实挤进了后台 , 尤其是当他们在触地时侧面移动时。</code> | <code>0.33000001311302185</code> |
444
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
445
+ ```json
446
+ {
447
+ "scale": 20.0,
448
+ "similarity_fct": "pairwise_cos_sim"
449
+ }
450
+ ```
451
+
452
+ #### mlqe_et_en
453
+
454
+ * Dataset: [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
455
+ * Size: 7,000 training samples
456
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
457
+ * Approximate statistics based on the first 1000 samples:
458
+ | | sentence1 | sentence2 | score |
459
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
460
+ | type | string | string | float |
461
+ | details | <ul><li>min: 14 tokens</li><li>mean: 31.88 tokens</li><li>max: 63 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 24.57 tokens</li><li>max: 56 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.67</li><li>max: 1.0</li></ul> |
462
+ * Samples:
463
+ | sentence1 | sentence2 | score |
464
+ |:----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
465
+ | <code>Gruusias vahistati president Mihhail Saakašvili pressibüroo nõunik Simon Kiladze, keda süüdistati spioneerimises.</code> | <code>In Georgia, an adviser to the press office of President Mikhail Saakashvili, Simon Kiladze, was arrested and accused of spying.</code> | <code>0.9466666579246521</code> |
466
+ | <code>Nii teadmissotsioloogia pooldajad tavaliselt Kuhni tõlgendavadki, arendades tema vaated sõnaselgeks relativismiks.</code> | <code>This is how supporters of knowledge sociology usually interpret Kuhn by developing his views into an explicit relativism.</code> | <code>0.9366666674613953</code> |
467
+ | <code>18. jaanuaril 2003 haarasid mitmeid Canberra eeslinnu võsapõlengud, milles hukkus neli ja sai vigastada 435 inimest.</code> | <code>On 18 January 2003, several of the suburbs of Canberra were seized by debt fires which killed four people and injured 435 people.</code> | <code>0.8666666150093079</code> |
468
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
469
+ ```json
470
+ {
471
+ "scale": 20.0,
472
+ "similarity_fct": "pairwise_cos_sim"
473
+ }
474
+ ```
475
+
476
+ #### mlqe_ne_en
477
+
478
+ * Dataset: [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
479
+ * Size: 7,000 training samples
480
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
481
+ * Approximate statistics based on the first 1000 samples:
482
+ | | sentence1 | sentence2 | score |
483
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
484
+ | type | string | string | float |
485
+ | details | <ul><li>min: 17 tokens</li><li>mean: 40.67 tokens</li><li>max: 77 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 24.66 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.39</li><li>max: 1.0</li></ul> |
486
+ * Samples:
487
+ | sentence1 | sentence2 | score |
488
+ |:------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------|:---------------------------------|
489
+ | <code>सामान्‍य बजट प्रायः फेब्रुअरीका अंतिम कार्य दिवसमा लाईन्छ।</code> | <code>A normal budget is usually awarded to the digital working day of February.</code> | <code>0.5600000023841858</code> |
490
+ | <code>कविताका यस्ता स्वरूपमा दुई, तिन वा चार पाउसम्मका मुक्तक, हाइकु, सायरी र लोकसूक्तिहरू पर्दछन् ।</code> | <code>The book consists of two, free of her or four paulets, haiku, Sairi, and locus in such forms.</code> | <code>0.23666666448116302</code> |
491
+ | <code>ब्रिट्नीले यस बारेमा प्रतिक्रिया ब्यक्ता गरदै भनिन,"कुन ठूलो कुरा हो र?</code> | <code>Britney did not respond to this, saying "which is a big thing and a big thing?</code> | <code>0.21666665375232697</code> |
492
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
493
+ ```json
494
+ {
495
+ "scale": 20.0,
496
+ "similarity_fct": "pairwise_cos_sim"
497
+ }
498
+ ```
499
+
500
+ #### mlqe_ro_en
501
+
502
+ * Dataset: [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
503
+ * Size: 7,000 training samples
504
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
505
+ * Approximate statistics based on the first 1000 samples:
506
+ | | sentence1 | sentence2 | score |
507
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
508
+ | type | string | string | float |
509
+ | details | <ul><li>min: 12 tokens</li><li>mean: 29.44 tokens</li><li>max: 60 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 22.38 tokens</li><li>max: 65 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.68</li><li>max: 1.0</li></ul> |
510
+ * Samples:
511
+ | sentence1 | sentence2 | score |
512
+ |:---------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------|
513
+ | <code>Orașul va fi împărțit în patru districte, iar suburbiile în 10 mahalale.</code> | <code>The city will be divided into four districts and suburbs into 10 mahalals.</code> | <code>0.4699999988079071</code> |
514
+ | <code>La scurt timp după aceasta, au devenit cunoscute debarcările germane de la Trondheim, Bergen și Stavanger, precum și luptele din Oslofjord.</code> | <code>In the light of the above, the Authority concludes that the aid granted to ADIF is compatible with the internal market pursuant to Article 61 (3) (c) of the EEA Agreement.</code> | <code>0.02666666731238365</code> |
515
+ | <code>Până în vara 1791, în Clubul iacobinilor au dominat reprezentanții monarhismului liberal constituțional.</code> | <code>Until the summer of 1791, representatives of liberal constitutional monarchism dominated in the Jacobins Club.</code> | <code>0.8733333349227905</code> |
516
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
517
+ ```json
518
+ {
519
+ "scale": 20.0,
520
+ "similarity_fct": "pairwise_cos_sim"
521
+ }
522
+ ```
523
+
524
+ #### mlqe_si_en
525
+
526
+ * Dataset: [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
527
+ * Size: 7,000 training samples
528
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
529
+ * Approximate statistics based on the first 1000 samples:
530
+ | | sentence1 | sentence2 | score |
531
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
532
+ | type | string | string | float |
533
+ | details | <ul><li>min: 8 tokens</li><li>mean: 18.19 tokens</li><li>max: 38 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 22.31 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.51</li><li>max: 1.0</li></ul> |
534
+ * Samples:
535
+ | sentence1 | sentence2 | score |
536
+ |:----------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------|
537
+ | <code>ඇපලෝ 4 සැටර්න් V බූස්ටරයේ ප්‍රථම පර්යේෂණ පියාසැරිය විය.</code> | <code>The first research flight of the Apollo 4 Saturn V Booster.</code> | <code>0.7966666221618652</code> |
538
+ | <code>මෙහි අවපාතය සැලකීමේ දී, මෙහි 48%ක අවරෝහණය $ මිලියන 125කට අධික චිත්‍රපටයක් ලද තෙවන කුඩාම අවපාතය වේ.</code> | <code>In conjunction with the depression here, 48 % of obesity here is the third smallest depression in over $ 125 million film.</code> | <code>0.17666666209697723</code> |
539
+ | <code>එසේම "බකමූණන් මගින් මෙම රාක්ෂසියගේ රාත්‍රී හැසිරීම සංකේතවත් වන බව" පවසයි.</code> | <code>Also "the owl says that this monster's night behavior is symbolic".</code> | <code>0.8799999952316284</code> |
540
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
541
+ ```json
542
+ {
543
+ "scale": 20.0,
544
+ "similarity_fct": "pairwise_cos_sim"
545
+ }
546
+ ```
547
+
548
+ ### Evaluation Datasets
549
+
550
+ #### wmt_da
551
+
552
+ * Dataset: [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation) at [301de38](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation/tree/301de385bf05b0c00a8f4be74965e186164dd425)
553
+ * Size: 1,285,190 evaluation samples
554
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
555
+ * Approximate statistics based on the first 1000 samples:
556
+ | | sentence1 | sentence2 | score |
557
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:---------------------------------------------------------------|
558
+ | type | string | string | float |
559
+ | details | <ul><li>min: 4 tokens</li><li>mean: 36.94 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 5 tokens</li><li>mean: 37.23 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.69</li><li>max: 1.0</li></ul> |
560
+ * Samples:
561
+ | sentence1 | sentence2 | score |
562
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------|
563
+ | <code>After playing classic 1982 track Eminence Front, Daltrey called it quits. he has struggled with vocal issues and apparently is under strict instructions from his surgeon.</code> | <code>Nachdem er 1982 den klassischen Track Eminence Front gespielt hatte, nannte Daltrey es beendet. Er hat mit Stimmproblemen zu kämpfen und steht offenbar unter strengen Anweisungen seines Chirurgen.</code> | <code>0.715</code> |
564
+ | <code>જ્યારે કોંગ્રેસે આ બાબતનો વિરોધ કર્યો છે.</code> | <code>While Congress has resisted the matter.</code> | <code>0.77</code> |
565
+ | <code>Police are currently investigating a series of antisemitic comments posted on the Grime artist's social media accounts.</code> | <code>警方目前正在调查在这位污垢艺术家的社交媒体账户上发布的一系列反犹评论。</code> | <code>0.66</code> |
566
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
567
+ ```json
568
+ {
569
+ "scale": 20.0,
570
+ "similarity_fct": "pairwise_cos_sim"
571
+ }
572
+ ```
573
+
574
+ #### mlqe_en_de
575
+
576
+ * Dataset: [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
577
+ * Size: 1,000 evaluation samples
578
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
579
+ * Approximate statistics based on the first 1000 samples:
580
+ | | sentence1 | sentence2 | score |
581
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
582
+ | type | string | string | float |
583
+ | details | <ul><li>min: 11 tokens</li><li>mean: 24.11 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 26.66 tokens</li><li>max: 55 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.81</li><li>max: 1.0</li></ul> |
584
+ * Samples:
585
+ | sentence1 | sentence2 | score |
586
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
587
+ | <code>Resuming her patrols, Constitution managed to recapture the American sloop Neutrality on 27 March and, a few days later, the French ship Carteret.</code> | <code>Mit der Wiederaufnahme ihrer Patrouillen gelang es der Verfassung, am 27. März die amerikanische Schleuderneutralität und wenige Tage später das französische Schiff Carteret zurückzuerobern.</code> | <code>0.9033333659172058</code> |
588
+ | <code>Blaine's nomination alienated many Republicans who viewed Blaine as ambitious and immoral.</code> | <code>Blaines Nominierung entfremdete viele Republikaner, die Blaine als ehrgeizig und unmoralisch betrachteten.</code> | <code>0.9216666221618652</code> |
589
+ | <code>This initiated a brief correspondence between the two which quickly descended into political rancor.</code> | <code>Dies leitete eine kurze Korrespondenz zwischen den beiden ein, die schnell zu politischem Groll abstieg.</code> | <code>0.878333330154419</code> |
590
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
591
+ ```json
592
+ {
593
+ "scale": 20.0,
594
+ "similarity_fct": "pairwise_cos_sim"
595
+ }
596
+ ```
597
+
598
+ #### mlqe_en_zh
599
+
600
+ * Dataset: [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
601
+ * Size: 1,000 evaluation samples
602
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
603
+ * Approximate statistics based on the first 1000 samples:
604
+ | | sentence1 | sentence2 | score |
605
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
606
+ | type | string | string | float |
607
+ | details | <ul><li>min: 9 tokens</li><li>mean: 23.75 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 29.56 tokens</li><li>max: 67 tokens</li></ul> | <ul><li>min: 0.26</li><li>mean: 0.65</li><li>max: 0.9</li></ul> |
608
+ * Samples:
609
+ | sentence1 | sentence2 | score |
610
+ |:---------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------|:--------------------------------|
611
+ | <code>Freeman briefly stayed with the king before returning to Accra via Whydah, Ahgwey and Little Popo.</code> | <code>弗里曼在经过惠达、阿格威和小波波回到阿克拉之前与国王一起住了一会儿。</code> | <code>0.6683333516120911</code> |
612
+ | <code>Fantastic Fiction "Scratches in the Sky, Ben Peek, Agog!</code> | <code>奇特的虚构 "天空中的碎片 , 本佩克 , 阿戈 !</code> | <code>0.71833336353302</code> |
613
+ | <code>For Hermann Keller, the running quavers and semiquavers "suffuse the setting with health and strength."</code> | <code>对赫尔曼 · 凯勒来说 , 跑步的跳跃者和半跳跃者 "让环境充满健康和力量" 。</code> | <code>0.7066666483879089</code> |
614
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
615
+ ```json
616
+ {
617
+ "scale": 20.0,
618
+ "similarity_fct": "pairwise_cos_sim"
619
+ }
620
+ ```
621
+
622
+ #### mlqe_et_en
623
+
624
+ * Dataset: [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
625
+ * Size: 1,000 evaluation samples
626
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
627
+ * Approximate statistics based on the first 1000 samples:
628
+ | | sentence1 | sentence2 | score |
629
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
630
+ | type | string | string | float |
631
+ | details | <ul><li>min: 12 tokens</li><li>mean: 32.4 tokens</li><li>max: 58 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 24.87 tokens</li><li>max: 47 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.6</li><li>max: 0.99</li></ul> |
632
+ * Samples:
633
+ | sentence1 | sentence2 | score |
634
+ |:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------|:---------------------------------|
635
+ | <code>Jackson pidas seal kõne, öeldes, et James Brown on tema suurim inspiratsioon.</code> | <code>Jackson gave a speech there saying that James Brown is his greatest inspiration.</code> | <code>0.9833333492279053</code> |
636
+ | <code>Kaanelugu rääkis loo kolme ungarlase üleelamistest Ungari revolutsiooni päevil.</code> | <code>The life of the Man spoke of a story of three Hungarians living in the days of the Hungarian Revolution.</code> | <code>0.28999999165534973</code> |
637
+ | <code>Teise maailmasõja ajal oli ta mitme Saksa juhatusele alluvate eesti väeosa ülem.</code> | <code>During World War II, he was the commander of several of the German leadership.</code> | <code>0.4516666829586029</code> |
638
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
639
+ ```json
640
+ {
641
+ "scale": 20.0,
642
+ "similarity_fct": "pairwise_cos_sim"
643
+ }
644
+ ```
645
+
646
+ #### mlqe_ne_en
647
+
648
+ * Dataset: [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
649
+ * Size: 1,000 evaluation samples
650
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
651
+ * Approximate statistics based on the first 1000 samples:
652
+ | | sentence1 | sentence2 | score |
653
+ |:--------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:-----------------------------------------------------------------|
654
+ | type | string | string | float |
655
+ | details | <ul><li>min: 17 tokens</li><li>mean: 41.03 tokens</li><li>max: 85 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 24.77 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.05</li><li>mean: 0.36</li><li>max: 0.92</li></ul> |
656
+ * Samples:
657
+ | sentence1 | sentence2 | score |
658
+ |:------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------|:---------------------------------|
659
+ | <code>१८९२ तिर भवानीदत्त पाण्डेले 'मुद्रा राक्षस'को अनुवाद गरे।</code> | <code>Around 1892, Bhavani Pandit translated the 'money monster'.</code> | <code>0.8416666388511658</code> |
660
+ | <code>यस बच्चाको मुखले आमाको स्तन यस बच्चाको मुखले आमाको स्तन राम्ररी च्यापेको छ ।</code> | <code>The breasts of this child's mouth are taped well with the mother's mouth.</code> | <code>0.2150000035762787</code> |
661
+ | <code>बुवाको बन्दुक चोरेर हिँडेका बराललाई केआई सिंहले अब गोली ल्याउन लगाए ।...</code> | <code>Kei Singh, who stole the boy's closet, took the bullet to bring it now..</code> | <code>0.27000001072883606</code> |
662
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
663
+ ```json
664
+ {
665
+ "scale": 20.0,
666
+ "similarity_fct": "pairwise_cos_sim"
667
+ }
668
+ ```
669
+
670
+ #### mlqe_ro_en
671
+
672
+ * Dataset: [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
673
+ * Size: 1,000 evaluation samples
674
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
675
+ * Approximate statistics based on the first 1000 samples:
676
+ | | sentence1 | sentence2 | score |
677
+ |:--------|:-----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:----------------------------------------------------------------|
678
+ | type | string | string | float |
679
+ | details | <ul><li>min: 14 tokens</li><li>mean: 30.25 tokens</li><li>max: 59 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 22.7 tokens</li><li>max: 55 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.68</li><li>max: 1.0</li></ul> |
680
+ * Samples:
681
+ | sentence1 | sentence2 | score |
682
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------|
683
+ | <code>Cornwallis se afla înconjurat pe uscat de forțe armate net superioare și retragerea pe mare era îndoielnică din cauza flotei franceze.</code> | <code>Cornwallis was surrounded by shore by higher armed forces and the sea withdrawal was doubtful due to the French fleet.</code> | <code>0.8199999928474426</code> |
684
+ | <code>thumbrightuprightDansatori [[cretani de muzică tradițională.</code> | <code>Number of employees employed in the production of the like product in the Union.</code> | <code>0.009999999776482582</code> |
685
+ | <code>Potrivit documentelor vremii și tradiției orale, aceasta a fost cea mai grea perioadă din istoria orașului.</code> | <code>According to the documents of the oral weather and tradition, this was the hardest period in the city's history.</code> | <code>0.5383332967758179</code> |
686
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
687
+ ```json
688
+ {
689
+ "scale": 20.0,
690
+ "similarity_fct": "pairwise_cos_sim"
691
+ }
692
+ ```
693
+
694
+ #### mlqe_si_en
695
+
696
+ * Dataset: [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
697
+ * Size: 1,000 evaluation samples
698
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
699
+ * Approximate statistics based on the first 1000 samples:
700
+ | | sentence1 | sentence2 | score |
701
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------|
702
+ | type | string | string | float |
703
+ | details | <ul><li>min: 8 tokens</li><li>mean: 18.12 tokens</li><li>max: 36 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 22.18 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.51</li><li>max: 0.99</li></ul> |
704
+ * Samples:
705
+ | sentence1 | sentence2 | score |
706
+ |:----------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:--------------------------------|
707
+ | <code>එයට ශි්‍ර ලංකාවේ සාමය ඇති කිරිමටත් නැති කිරිමටත් පුළුවන්.</code> | <code>It can also cause peace in Sri Lanka.</code> | <code>0.3199999928474426</code> |
708
+ | <code>ඔහු මනෝ විද්‍යාව, සමාජ විද්‍යාව, ඉතිහාසය හා සන්නිවේදනය යන විෂය ක්ෂේත්‍රයන් පිලිබදවද අධ්‍යයනයන් සිදු කිරීමට උත්සාහ කරන ලදි.</code> | <code>He attempted to do subjects in psychology, sociology, history and communication.</code> | <code>0.5366666913032532</code> |
709
+ | <code>එහෙත් කිසිදු මිනිසෙක්‌ හෝ ගැහැනියෙක්‌ එලිමහනක නොවූහ.</code> | <code>But no man or woman was eliminated.</code> | <code>0.2783333361148834</code> |
710
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
711
+ ```json
712
+ {
713
+ "scale": 20.0,
714
+ "similarity_fct": "pairwise_cos_sim"
715
+ }
716
+ ```
717
+
718
+ ### Training Hyperparameters
719
+ #### Non-Default Hyperparameters
720
+
721
+ - `eval_strategy`: steps
722
+ - `per_device_train_batch_size`: 64
723
+ - `per_device_eval_batch_size`: 64
724
+ - `num_train_epochs`: 2
725
+ - `warmup_ratio`: 0.1
726
+
727
+ #### All Hyperparameters
728
+ <details><summary>Click to expand</summary>
729
+
730
+ - `overwrite_output_dir`: False
731
+ - `do_predict`: False
732
+ - `eval_strategy`: steps
733
+ - `prediction_loss_only`: True
734
+ - `per_device_train_batch_size`: 64
735
+ - `per_device_eval_batch_size`: 64
736
+ - `per_gpu_train_batch_size`: None
737
+ - `per_gpu_eval_batch_size`: None
738
+ - `gradient_accumulation_steps`: 1
739
+ - `eval_accumulation_steps`: None
740
+ - `torch_empty_cache_steps`: None
741
+ - `learning_rate`: 5e-05
742
+ - `weight_decay`: 0.0
743
+ - `adam_beta1`: 0.9
744
+ - `adam_beta2`: 0.999
745
+ - `adam_epsilon`: 1e-08
746
+ - `max_grad_norm`: 1.0
747
+ - `num_train_epochs`: 2
748
+ - `max_steps`: -1
749
+ - `lr_scheduler_type`: linear
750
+ - `lr_scheduler_kwargs`: {}
751
+ - `warmup_ratio`: 0.1
752
+ - `warmup_steps`: 0
753
+ - `log_level`: passive
754
+ - `log_level_replica`: warning
755
+ - `log_on_each_node`: True
756
+ - `logging_nan_inf_filter`: True
757
+ - `save_safetensors`: True
758
+ - `save_on_each_node`: False
759
+ - `save_only_model`: False
760
+ - `restore_callback_states_from_checkpoint`: False
761
+ - `no_cuda`: False
762
+ - `use_cpu`: False
763
+ - `use_mps_device`: False
764
+ - `seed`: 42
765
+ - `data_seed`: None
766
+ - `jit_mode_eval`: False
767
+ - `use_ipex`: False
768
+ - `bf16`: False
769
+ - `fp16`: False
770
+ - `fp16_opt_level`: O1
771
+ - `half_precision_backend`: auto
772
+ - `bf16_full_eval`: False
773
+ - `fp16_full_eval`: False
774
+ - `tf32`: None
775
+ - `local_rank`: 0
776
+ - `ddp_backend`: None
777
+ - `tpu_num_cores`: None
778
+ - `tpu_metrics_debug`: False
779
+ - `debug`: []
780
+ - `dataloader_drop_last`: False
781
+ - `dataloader_num_workers`: 0
782
+ - `dataloader_prefetch_factor`: None
783
+ - `past_index`: -1
784
+ - `disable_tqdm`: False
785
+ - `remove_unused_columns`: True
786
+ - `label_names`: None
787
+ - `load_best_model_at_end`: False
788
+ - `ignore_data_skip`: False
789
+ - `fsdp`: []
790
+ - `fsdp_min_num_params`: 0
791
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
792
+ - `fsdp_transformer_layer_cls_to_wrap`: None
793
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
794
+ - `deepspeed`: None
795
+ - `label_smoothing_factor`: 0.0
796
+ - `optim`: adamw_torch
797
+ - `optim_args`: None
798
+ - `adafactor`: False
799
+ - `group_by_length`: False
800
+ - `length_column_name`: length
801
+ - `ddp_find_unused_parameters`: None
802
+ - `ddp_bucket_cap_mb`: None
803
+ - `ddp_broadcast_buffers`: False
804
+ - `dataloader_pin_memory`: True
805
+ - `dataloader_persistent_workers`: False
806
+ - `skip_memory_metrics`: True
807
+ - `use_legacy_prediction_loop`: False
808
+ - `push_to_hub`: False
809
+ - `resume_from_checkpoint`: None
810
+ - `hub_model_id`: None
811
+ - `hub_strategy`: every_save
812
+ - `hub_private_repo`: None
813
+ - `hub_always_push`: False
814
+ - `gradient_checkpointing`: False
815
+ - `gradient_checkpointing_kwargs`: None
816
+ - `include_inputs_for_metrics`: False
817
+ - `include_for_metrics`: []
818
+ - `eval_do_concat_batches`: True
819
+ - `fp16_backend`: auto
820
+ - `push_to_hub_model_id`: None
821
+ - `push_to_hub_organization`: None
822
+ - `mp_parameters`:
823
+ - `auto_find_batch_size`: False
824
+ - `full_determinism`: False
825
+ - `torchdynamo`: None
826
+ - `ray_scope`: last
827
+ - `ddp_timeout`: 1800
828
+ - `torch_compile`: False
829
+ - `torch_compile_backend`: None
830
+ - `torch_compile_mode`: None
831
+ - `dispatch_batches`: None
832
+ - `split_batches`: None
833
+ - `include_tokens_per_second`: False
834
+ - `include_num_input_tokens_seen`: False
835
+ - `neftune_noise_alpha`: None
836
+ - `optim_target_modules`: None
837
+ - `batch_eval_metrics`: False
838
+ - `eval_on_start`: False
839
+ - `use_liger_kernel`: False
840
+ - `eval_use_gather_object`: False
841
+ - `average_tokens_across_devices`: False
842
+ - `prompts`: None
843
+ - `batch_sampler`: batch_sampler
844
+ - `multi_dataset_batch_sampler`: proportional
845
+
846
+ </details>
847
+
848
+ ### Training Logs
849
+ | Epoch | Step | Training Loss | wmt da loss | mlqe en de loss | mlqe en zh loss | mlqe et en loss | mlqe ne en loss | mlqe ro en loss | mlqe si en loss | sts-eval_spearman_cosine | sts-test_spearman_cosine |
850
+ |:-----:|:-----:|:-------------:|:-----------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:------------------------:|:------------------------:|
851
+ | 0.4 | 6690 | 7.7892 | 7.5592 | 7.5700 | 7.5692 | 7.5217 | 7.5369 | 7.4978 | 7.5494 | 0.2536 | - |
852
+ | 0.8 | 13380 | 7.5513 | 7.5470 | 7.5928 | 7.5812 | 7.5179 | 7.5207 | 7.4936 | 7.5463 | 0.2642 | - |
853
+ | 1.2 | 20070 | 7.5222 | 7.5460 | 7.6197 | 7.5972 | 7.5218 | 7.5496 | 7.5025 | 7.5633 | 0.2449 | - |
854
+ | 1.6 | 26760 | 7.5019 | 7.5361 | 7.6332 | 7.5854 | 7.5226 | 7.5264 | 7.4937 | 7.5654 | 0.2559 | - |
855
+ | 2.0 | 33450 | 7.4944 | 7.5285 | 7.6266 | 7.5859 | 7.5202 | 7.5183 | 7.4898 | 7.5460 | 0.2688 | 0.2676 |
856
+
857
+
858
+ ### Framework Versions
859
+ - Python: 3.11.10
860
+ - Sentence Transformers: 3.3.1
861
+ - Transformers: 4.47.1
862
+ - PyTorch: 2.3.1+cu121
863
+ - Accelerate: 1.2.1
864
+ - Datasets: 3.2.0
865
+ - Tokenizers: 0.21.0
866
+
867
+ ## Citation
868
+
869
+ ### BibTeX
870
+
871
+ #### Sentence Transformers
872
+ ```bibtex
873
+ @inproceedings{reimers-2019-sentence-bert,
874
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
875
+ author = "Reimers, Nils and Gurevych, Iryna",
876
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
877
+ month = "11",
878
+ year = "2019",
879
+ publisher = "Association for Computational Linguistics",
880
+ url = "https://arxiv.org/abs/1908.10084",
881
+ }
882
+ ```
883
+
884
+ #### CoSENTLoss
885
+ ```bibtex
886
+ @online{kexuefm-8847,
887
+ title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
888
+ author={Su Jianlin},
889
+ year={2022},
890
+ month={Jan},
891
+ url={https://kexue.fm/archives/8847},
892
+ }
893
+ ```
894
+
895
+ <!--
896
+ ## Glossary
897
+
898
+ *Clearly define terms in order to be accessible across audiences.*
899
+ -->
900
+
901
+ <!--
902
+ ## Model Card Authors
903
+
904
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
905
+ -->
906
+
907
+ <!--
908
+ ## Model Card Contact
909
+
910
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
911
+ -->
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "sentence-transformers/distiluse-base-multilingual-cased-v2",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertModel"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "initializer_range": 0.02,
12
+ "max_position_embeddings": 512,
13
+ "model_type": "distilbert",
14
+ "n_heads": 12,
15
+ "n_layers": 6,
16
+ "output_hidden_states": true,
17
+ "output_past": true,
18
+ "pad_token_id": 0,
19
+ "qa_dropout": 0.1,
20
+ "seq_classif_dropout": 0.2,
21
+ "sinusoidal_pos_embds": false,
22
+ "tie_weights_": true,
23
+ "torch_dtype": "float32",
24
+ "transformers_version": "4.47.1",
25
+ "vocab_size": 119547
26
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.3.1",
4
+ "transformers": "4.47.1",
5
+ "pytorch": "2.3.1+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3efd2ce2b8068b2cf385b15f8bb4485b7d2e218c592613a820a90e8ff26d22af
3
+ size 538947416
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_MultiHeadGeneralizedPooling",
12
+ "type": "sentence_pooling.multihead_generalized_pooling.MultiHeadGeneralizedPooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Dense",
18
+ "type": "sentence_transformers.models.Dense"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 128,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "extra_special_tokens": {},
49
+ "full_tokenizer_file": null,
50
+ "mask_token": "[MASK]",
51
+ "max_len": 512,
52
+ "model_max_length": 128,
53
+ "never_split": null,
54
+ "pad_token": "[PAD]",
55
+ "sep_token": "[SEP]",
56
+ "strip_accents": null,
57
+ "tokenize_chinese_chars": true,
58
+ "tokenizer_class": "DistilBertTokenizer",
59
+ "unk_token": "[UNK]"
60
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff