nreimers commited on
Commit
5f05a34
·
1 Parent(s): 9fe8420
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,1103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - mteb
4
+ model-index:
5
+ - name: embed-multilingual-light-v3.0
6
+ results:
7
+ - task:
8
+ type: Classification
9
+ dataset:
10
+ type: mteb/amazon_counterfactual
11
+ name: MTEB AmazonCounterfactualClassification (en)
12
+ config: en
13
+ split: test
14
+ revision: e8379541af4e31359cca9fbcf4b00f2671dba205
15
+ metrics:
16
+ - type: accuracy
17
+ value: 70.02985074626865
18
+ - type: ap
19
+ value: 33.228065779544146
20
+ - type: f1
21
+ value: 64.27173953207297
22
+ - task:
23
+ type: Classification
24
+ dataset:
25
+ type: mteb/amazon_polarity
26
+ name: MTEB AmazonPolarityClassification
27
+ config: default
28
+ split: test
29
+ revision: e2d317d38cd51312af73b3d32a06d1a08b442046
30
+ metrics:
31
+ - type: accuracy
32
+ value: 90.701225
33
+ - type: ap
34
+ value: 87.07178174251762
35
+ - type: f1
36
+ value: 90.69168484877625
37
+ - task:
38
+ type: Classification
39
+ dataset:
40
+ type: mteb/amazon_reviews_multi
41
+ name: MTEB AmazonReviewsClassification (en)
42
+ config: en
43
+ split: test
44
+ revision: 1399c76144fd37290681b995c656ef9b2e06e26d
45
+ metrics:
46
+ - type: accuracy
47
+ value: 46.550000000000004
48
+ - type: f1
49
+ value: 44.7233215588199
50
+ - task:
51
+ type: Retrieval
52
+ dataset:
53
+ type: arguana
54
+ name: MTEB ArguAna
55
+ config: default
56
+ split: test
57
+ revision: None
58
+ metrics:
59
+ - type: ndcg_at_10
60
+ value: 53.369
61
+ - task:
62
+ type: Clustering
63
+ dataset:
64
+ type: mteb/arxiv-clustering-p2p
65
+ name: MTEB ArxivClusteringP2P
66
+ config: default
67
+ split: test
68
+ revision: a122ad7f3f0291bf49cc6f4d32aa80929df69d5d
69
+ metrics:
70
+ - type: v_measure
71
+ value: 44.206988765030744
72
+ - task:
73
+ type: Clustering
74
+ dataset:
75
+ type: mteb/arxiv-clustering-s2s
76
+ name: MTEB ArxivClusteringS2S
77
+ config: default
78
+ split: test
79
+ revision: f910caf1a6075f7329cdf8c1a6135696f37dbd53
80
+ metrics:
81
+ - type: v_measure
82
+ value: 33.913737041277
83
+ - task:
84
+ type: Reranking
85
+ dataset:
86
+ type: mteb/askubuntudupquestions-reranking
87
+ name: MTEB AskUbuntuDupQuestions
88
+ config: default
89
+ split: test
90
+ revision: 2000358ca161889fa9c082cb41daa8dcfb161a54
91
+ metrics:
92
+ - type: map
93
+ value: 58.544257541214925
94
+ - type: mrr
95
+ value: 72.07151651057468
96
+ - task:
97
+ type: STS
98
+ dataset:
99
+ type: mteb/biosses-sts
100
+ name: MTEB BIOSSES
101
+ config: default
102
+ split: test
103
+ revision: d3fb88f8f02e40887cd149695127462bbcf29b4a
104
+ metrics:
105
+ - type: cos_sim_pearson
106
+ value: 84.79582115243736
107
+ - type: cos_sim_spearman
108
+ value: 84.01396250789998
109
+ - type: euclidean_pearson
110
+ value: 83.90766476102458
111
+ - type: euclidean_spearman
112
+ value: 84.01396250789998
113
+ - type: manhattan_pearson
114
+ value: 84.75071274784274
115
+ - type: manhattan_spearman
116
+ value: 85.02482891467078
117
+ - task:
118
+ type: Classification
119
+ dataset:
120
+ type: mteb/banking77
121
+ name: MTEB Banking77Classification
122
+ config: default
123
+ split: test
124
+ revision: 0fd18e25b25c072e09e0d92ab615fda904d66300
125
+ metrics:
126
+ - type: accuracy
127
+ value: 78.12337662337663
128
+ - type: f1
129
+ value: 77.48610340227478
130
+ - task:
131
+ type: Clustering
132
+ dataset:
133
+ type: mteb/biorxiv-clustering-p2p
134
+ name: MTEB BiorxivClusteringP2P
135
+ config: default
136
+ split: test
137
+ revision: 65b79d1d13f80053f67aca9498d9402c2d9f1f40
138
+ metrics:
139
+ - type: v_measure
140
+ value: 38.68268504601174
141
+ - task:
142
+ type: Clustering
143
+ dataset:
144
+ type: mteb/biorxiv-clustering-s2s
145
+ name: MTEB BiorxivClusteringS2S
146
+ config: default
147
+ split: test
148
+ revision: 258694dd0231531bc1fd9de6ceb52a0853c6d908
149
+ metrics:
150
+ - type: v_measure
151
+ value: 32.20870648143671
152
+ - task:
153
+ type: Retrieval
154
+ dataset:
155
+ type: BeIR/cqadupstack
156
+ name: MTEB CQADupstackAndroidRetrieval
157
+ config: default
158
+ split: test
159
+ revision: None
160
+ metrics:
161
+ - type: ndcg_at_10
162
+ value: 46.259
163
+ - task:
164
+ type: Retrieval
165
+ dataset:
166
+ type: BeIR/cqadupstack
167
+ name: MTEB CQADupstackEnglishRetrieval
168
+ config: default
169
+ split: test
170
+ revision: None
171
+ metrics:
172
+ - type: ndcg_at_10
173
+ value: 44.555
174
+ - task:
175
+ type: Retrieval
176
+ dataset:
177
+ type: BeIR/cqadupstack
178
+ name: MTEB CQADupstackGamingRetrieval
179
+ config: default
180
+ split: test
181
+ revision: None
182
+ metrics:
183
+ - type: ndcg_at_10
184
+ value: 56.564
185
+ - task:
186
+ type: Retrieval
187
+ dataset:
188
+ type: BeIR/cqadupstack
189
+ name: MTEB CQADupstackGisRetrieval
190
+ config: default
191
+ split: test
192
+ revision: None
193
+ metrics:
194
+ - type: ndcg_at_10
195
+ value: 36.162
196
+ - task:
197
+ type: Retrieval
198
+ dataset:
199
+ type: BeIR/cqadupstack
200
+ name: MTEB CQADupstackMathematicaRetrieval
201
+ config: default
202
+ split: test
203
+ revision: None
204
+ metrics:
205
+ - type: ndcg_at_10
206
+ value: 26.185000000000002
207
+ - task:
208
+ type: Retrieval
209
+ dataset:
210
+ type: BeIR/cqadupstack
211
+ name: MTEB CQADupstackPhysicsRetrieval
212
+ config: default
213
+ split: test
214
+ revision: None
215
+ metrics:
216
+ - type: ndcg_at_10
217
+ value: 41.547
218
+ - task:
219
+ type: Retrieval
220
+ dataset:
221
+ type: BeIR/cqadupstack
222
+ name: MTEB CQADupstackProgrammersRetrieval
223
+ config: default
224
+ split: test
225
+ revision: None
226
+ metrics:
227
+ - type: ndcg_at_10
228
+ value: 39.042
229
+ - task:
230
+ type: Retrieval
231
+ dataset:
232
+ type: BeIR/cqadupstack
233
+ name: MTEB CQADupstackRetrieval
234
+ config: default
235
+ split: test
236
+ revision: None
237
+ metrics:
238
+ - type: ndcg_at_10
239
+ value: 38.086999999999996
240
+ - task:
241
+ type: Retrieval
242
+ dataset:
243
+ type: BeIR/cqadupstack
244
+ name: MTEB CQADupstackStatsRetrieval
245
+ config: default
246
+ split: test
247
+ revision: None
248
+ metrics:
249
+ - type: ndcg_at_10
250
+ value: 32.088
251
+ - task:
252
+ type: Retrieval
253
+ dataset:
254
+ type: BeIR/cqadupstack
255
+ name: MTEB CQADupstackTexRetrieval
256
+ config: default
257
+ split: test
258
+ revision: None
259
+ metrics:
260
+ - type: ndcg_at_10
261
+ value: 27.006999999999998
262
+ - task:
263
+ type: Retrieval
264
+ dataset:
265
+ type: BeIR/cqadupstack
266
+ name: MTEB CQADupstackUnixRetrieval
267
+ config: default
268
+ split: test
269
+ revision: None
270
+ metrics:
271
+ - type: ndcg_at_10
272
+ value: 37.336999999999996
273
+ - task:
274
+ type: Retrieval
275
+ dataset:
276
+ type: BeIR/cqadupstack
277
+ name: MTEB CQADupstackWebmastersRetrieval
278
+ config: default
279
+ split: test
280
+ revision: None
281
+ metrics:
282
+ - type: ndcg_at_10
283
+ value: 38.011
284
+ - task:
285
+ type: Retrieval
286
+ dataset:
287
+ type: BeIR/cqadupstack
288
+ name: MTEB CQADupstackWordpressRetrieval
289
+ config: default
290
+ split: test
291
+ revision: None
292
+ metrics:
293
+ - type: ndcg_at_10
294
+ value: 32.287
295
+ - task:
296
+ type: Retrieval
297
+ dataset:
298
+ type: climate-fever
299
+ name: MTEB ClimateFEVER
300
+ config: default
301
+ split: test
302
+ revision: None
303
+ metrics:
304
+ - type: ndcg_at_10
305
+ value: 24.804000000000002
306
+ - task:
307
+ type: Retrieval
308
+ dataset:
309
+ type: dbpedia-entity
310
+ name: MTEB DBPedia
311
+ config: default
312
+ split: test
313
+ revision: None
314
+ metrics:
315
+ - type: ndcg_at_10
316
+ value: 38.055
317
+ - task:
318
+ type: Classification
319
+ dataset:
320
+ type: mteb/emotion
321
+ name: MTEB EmotionClassification
322
+ config: default
323
+ split: test
324
+ revision: 4f58c6b202a23cf9a4da393831edf4f9183cad37
325
+ metrics:
326
+ - type: accuracy
327
+ value: 46.665
328
+ - type: f1
329
+ value: 40.77568559660878
330
+ - task:
331
+ type: Retrieval
332
+ dataset:
333
+ type: fever
334
+ name: MTEB FEVER
335
+ config: default
336
+ split: test
337
+ revision: None
338
+ metrics:
339
+ - type: ndcg_at_10
340
+ value: 85.52499999999999
341
+ - task:
342
+ type: Retrieval
343
+ dataset:
344
+ type: fiqa
345
+ name: MTEB FiQA2018
346
+ config: default
347
+ split: test
348
+ revision: None
349
+ metrics:
350
+ - type: ndcg_at_10
351
+ value: 36.161
352
+ - task:
353
+ type: Retrieval
354
+ dataset:
355
+ type: hotpotqa
356
+ name: MTEB HotpotQA
357
+ config: default
358
+ split: test
359
+ revision: None
360
+ metrics:
361
+ - type: ndcg_at_10
362
+ value: 66.878
363
+ - task:
364
+ type: Classification
365
+ dataset:
366
+ type: mteb/imdb
367
+ name: MTEB ImdbClassification
368
+ config: default
369
+ split: test
370
+ revision: 3d86128a09e091d6018b6d26cad27f2739fc2db7
371
+ metrics:
372
+ - type: accuracy
373
+ value: 85.6372
374
+ - type: ap
375
+ value: 80.54846874011302
376
+ - type: f1
377
+ value: 85.61438421821343
378
+ - task:
379
+ type: Retrieval
380
+ dataset:
381
+ type: msmarco
382
+ name: MTEB MSMARCO
383
+ config: default
384
+ split: test
385
+ revision: None
386
+ metrics:
387
+ - type: ndcg_at_10
388
+ value: 40.487
389
+ - task:
390
+ type: Classification
391
+ dataset:
392
+ type: mteb/mtop_domain
393
+ name: MTEB MTOPDomainClassification (en)
394
+ config: en
395
+ split: test
396
+ revision: d80d48c1eb48d3562165c59d59d0034df9fff0bf
397
+ metrics:
398
+ - type: accuracy
399
+ value: 91.8559051527588
400
+ - type: f1
401
+ value: 91.6271749996447
402
+ - task:
403
+ type: Classification
404
+ dataset:
405
+ type: mteb/mtop_intent
406
+ name: MTEB MTOPIntentClassification (en)
407
+ config: en
408
+ split: test
409
+ revision: ae001d0e6b1228650b7bd1c2c65fb50ad11a8aba
410
+ metrics:
411
+ - type: accuracy
412
+ value: 62.17738258093936
413
+ - type: f1
414
+ value: 45.80307070449218
415
+ - task:
416
+ type: Classification
417
+ dataset:
418
+ type: mteb/amazon_massive_intent
419
+ name: MTEB MassiveIntentClassification (en)
420
+ config: en
421
+ split: test
422
+ revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7
423
+ metrics:
424
+ - type: accuracy
425
+ value: 67.42434431741762
426
+ - type: f1
427
+ value: 65.39580264698957
428
+ - task:
429
+ type: Classification
430
+ dataset:
431
+ type: mteb/amazon_massive_scenario
432
+ name: MTEB MassiveScenarioClassification (en)
433
+ config: en
434
+ split: test
435
+ revision: 7d571f92784cd94a019292a1f45445077d0ef634
436
+ metrics:
437
+ - type: accuracy
438
+ value: 72.60928043039677
439
+ - type: f1
440
+ value: 72.30912915707411
441
+ - task:
442
+ type: Clustering
443
+ dataset:
444
+ type: mteb/medrxiv-clustering-p2p
445
+ name: MTEB MedrxivClusteringP2P
446
+ config: default
447
+ split: test
448
+ revision: e7a26af6f3ae46b30dde8737f02c07b1505bcc73
449
+ metrics:
450
+ - type: v_measure
451
+ value: 35.17967476592229
452
+ - task:
453
+ type: Clustering
454
+ dataset:
455
+ type: mteb/medrxiv-clustering-s2s
456
+ name: MTEB MedrxivClusteringS2S
457
+ config: default
458
+ split: test
459
+ revision: 35191c8c0dca72d8ff3efcd72aa802307d469663
460
+ metrics:
461
+ - type: v_measure
462
+ value: 30.993641089208683
463
+ - task:
464
+ type: Reranking
465
+ dataset:
466
+ type: mteb/mind_small
467
+ name: MTEB MindSmallReranking
468
+ config: default
469
+ split: test
470
+ revision: 3bdac13927fdc888b903db93b2ffdbd90b295a69
471
+ metrics:
472
+ - type: map
473
+ value: 31.362481813275295
474
+ - type: mrr
475
+ value: 32.43717742343303
476
+ - task:
477
+ type: Retrieval
478
+ dataset:
479
+ type: nfcorpus
480
+ name: MTEB NFCorpus
481
+ config: default
482
+ split: test
483
+ revision: None
484
+ metrics:
485
+ - type: ndcg_at_10
486
+ value: 32.123000000000005
487
+ - task:
488
+ type: Retrieval
489
+ dataset:
490
+ type: nq
491
+ name: MTEB NQ
492
+ config: default
493
+ split: test
494
+ revision: None
495
+ metrics:
496
+ - type: ndcg_at_10
497
+ value: 55.51199999999999
498
+ - task:
499
+ type: Retrieval
500
+ dataset:
501
+ type: quora
502
+ name: MTEB QuoraRetrieval
503
+ config: default
504
+ split: test
505
+ revision: None
506
+ metrics:
507
+ - type: ndcg_at_10
508
+ value: 87.847
509
+ - task:
510
+ type: Clustering
511
+ dataset:
512
+ type: mteb/reddit-clustering
513
+ name: MTEB RedditClustering
514
+ config: default
515
+ split: test
516
+ revision: 24640382cdbf8abc73003fb0fa6d111a705499eb
517
+ metrics:
518
+ - type: v_measure
519
+ value: 49.4973643968247
520
+ - task:
521
+ type: Clustering
522
+ dataset:
523
+ type: mteb/reddit-clustering-p2p
524
+ name: MTEB RedditClusteringP2P
525
+ config: default
526
+ split: test
527
+ revision: 282350215ef01743dc01b456c7f5241fa8937f16
528
+ metrics:
529
+ - type: v_measure
530
+ value: 60.2135284243427
531
+ - task:
532
+ type: Retrieval
533
+ dataset:
534
+ type: scidocs
535
+ name: MTEB SCIDOCS
536
+ config: default
537
+ split: test
538
+ revision: None
539
+ metrics:
540
+ - type: ndcg_at_10
541
+ value: 17.1
542
+ - task:
543
+ type: STS
544
+ dataset:
545
+ type: mteb/sickr-sts
546
+ name: MTEB SICK-R
547
+ config: default
548
+ split: test
549
+ revision: a6ea5a8cab320b040a23452cc28066d9beae2cee
550
+ metrics:
551
+ - type: cos_sim_pearson
552
+ value: 83.7330191296952
553
+ - type: cos_sim_spearman
554
+ value: 77.03523134004043
555
+ - type: euclidean_pearson
556
+ value: 80.86067787185137
557
+ - type: euclidean_spearman
558
+ value: 77.03522959536473
559
+ - type: manhattan_pearson
560
+ value: 80.76089708603587
561
+ - type: manhattan_spearman
562
+ value: 76.86245377437302
563
+ - task:
564
+ type: STS
565
+ dataset:
566
+ type: mteb/sts12-sts
567
+ name: MTEB STS12
568
+ config: default
569
+ split: test
570
+ revision: a0d554a64d88156834ff5ae9920b964011b16384
571
+ metrics:
572
+ - type: cos_sim_pearson
573
+ value: 80.46387812633851
574
+ - type: cos_sim_spearman
575
+ value: 73.21878234127571
576
+ - type: euclidean_pearson
577
+ value: 76.82160699895033
578
+ - type: euclidean_spearman
579
+ value: 73.21878234127571
580
+ - type: manhattan_pearson
581
+ value: 76.75657006349886
582
+ - type: manhattan_spearman
583
+ value: 73.19160258034827
584
+ - task:
585
+ type: STS
586
+ dataset:
587
+ type: mteb/sts13-sts
588
+ name: MTEB STS13
589
+ config: default
590
+ split: test
591
+ revision: 7e90230a92c190f1bf69ae9002b8cea547a64cca
592
+ metrics:
593
+ - type: cos_sim_pearson
594
+ value: 79.06411399119807
595
+ - type: cos_sim_spearman
596
+ value: 79.49916779764082
597
+ - type: euclidean_pearson
598
+ value: 79.3356521660954
599
+ - type: euclidean_spearman
600
+ value: 79.49916779764082
601
+ - type: manhattan_pearson
602
+ value: 79.04971532119936
603
+ - type: manhattan_spearman
604
+ value: 79.16859911220654
605
+ - task:
606
+ type: STS
607
+ dataset:
608
+ type: mteb/sts14-sts
609
+ name: MTEB STS14
610
+ config: default
611
+ split: test
612
+ revision: 6031580fec1f6af667f0bd2da0a551cf4f0b2375
613
+ metrics:
614
+ - type: cos_sim_pearson
615
+ value: 80.6940934994372
616
+ - type: cos_sim_spearman
617
+ value: 76.9552055757283
618
+ - type: euclidean_pearson
619
+ value: 79.52818133592284
620
+ - type: euclidean_spearman
621
+ value: 76.9552055757283
622
+ - type: manhattan_pearson
623
+ value: 79.35220459438406
624
+ - type: manhattan_spearman
625
+ value: 76.85314462036561
626
+ - task:
627
+ type: STS
628
+ dataset:
629
+ type: mteb/sts15-sts
630
+ name: MTEB STS15
631
+ config: default
632
+ split: test
633
+ revision: ae752c7c21bf194d8b67fd573edf7ae58183cbe3
634
+ metrics:
635
+ - type: cos_sim_pearson
636
+ value: 85.58608774451231
637
+ - type: cos_sim_spearman
638
+ value: 86.42805701554927
639
+ - type: euclidean_pearson
640
+ value: 86.01117122595934
641
+ - type: euclidean_spearman
642
+ value: 86.42805701554927
643
+ - type: manhattan_pearson
644
+ value: 86.01345208923057
645
+ - type: manhattan_spearman
646
+ value: 86.43179450307953
647
+ - task:
648
+ type: STS
649
+ dataset:
650
+ type: mteb/sts16-sts
651
+ name: MTEB STS16
652
+ config: default
653
+ split: test
654
+ revision: 4d8694f8f0e0100860b497b999b3dbed754a0513
655
+ metrics:
656
+ - type: cos_sim_pearson
657
+ value: 83.18733039014667
658
+ - type: cos_sim_spearman
659
+ value: 84.3339529564109
660
+ - type: euclidean_pearson
661
+ value: 83.54530885349595
662
+ - type: euclidean_spearman
663
+ value: 84.3339529564109
664
+ - type: manhattan_pearson
665
+ value: 83.47015931913937
666
+ - type: manhattan_spearman
667
+ value: 84.22564786654777
668
+ - task:
669
+ type: STS
670
+ dataset:
671
+ type: mteb/sts17-crosslingual-sts
672
+ name: MTEB STS17 (en-en)
673
+ config: en-en
674
+ split: test
675
+ revision: af5e6fb845001ecf41f4c1e033ce921939a2a68d
676
+ metrics:
677
+ - type: cos_sim_pearson
678
+ value: 87.88402211340522
679
+ - type: cos_sim_spearman
680
+ value: 88.6693290310468
681
+ - type: euclidean_pearson
682
+ value: 88.24947476618257
683
+ - type: euclidean_spearman
684
+ value: 88.6693290310468
685
+ - type: manhattan_pearson
686
+ value: 88.24496656367964
687
+ - type: manhattan_spearman
688
+ value: 88.52029848819545
689
+ - task:
690
+ type: STS
691
+ dataset:
692
+ type: mteb/sts22-crosslingual-sts
693
+ name: MTEB STS22 (en)
694
+ config: en
695
+ split: test
696
+ revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80
697
+ metrics:
698
+ - type: cos_sim_pearson
699
+ value: 64.96467575926597
700
+ - type: cos_sim_spearman
701
+ value: 65.30666900046252
702
+ - type: euclidean_pearson
703
+ value: 66.58031971340725
704
+ - type: euclidean_spearman
705
+ value: 65.30666900046252
706
+ - type: manhattan_pearson
707
+ value: 66.56530433327998
708
+ - type: manhattan_spearman
709
+ value: 65.42121899024113
710
+ - task:
711
+ type: STS
712
+ dataset:
713
+ type: mteb/stsbenchmark-sts
714
+ name: MTEB STSBenchmark
715
+ config: default
716
+ split: test
717
+ revision: b0fddb56ed78048fa8b90373c8a3cfc37b684831
718
+ metrics:
719
+ - type: cos_sim_pearson
720
+ value: 85.31047656296519
721
+ - type: cos_sim_spearman
722
+ value: 85.46101092708824
723
+ - type: euclidean_pearson
724
+ value: 85.75896623084044
725
+ - type: euclidean_spearman
726
+ value: 85.46101092708824
727
+ - type: manhattan_pearson
728
+ value: 85.57323880630182
729
+ - type: manhattan_spearman
730
+ value: 85.23375523080594
731
+ - task:
732
+ type: Reranking
733
+ dataset:
734
+ type: mteb/scidocs-reranking
735
+ name: MTEB SciDocsRR
736
+ config: default
737
+ split: test
738
+ revision: d3c5e1fc0b855ab6097bf1cda04dd73947d7caab
739
+ metrics:
740
+ - type: map
741
+ value: 79.89731978284804
742
+ - type: mrr
743
+ value: 94.28980424078465
744
+ - task:
745
+ type: Retrieval
746
+ dataset:
747
+ type: scifact
748
+ name: MTEB SciFact
749
+ config: default
750
+ split: test
751
+ revision: None
752
+ metrics:
753
+ - type: ndcg_at_10
754
+ value: 67.95
755
+ - task:
756
+ type: PairClassification
757
+ dataset:
758
+ type: mteb/sprintduplicatequestions-pairclassification
759
+ name: MTEB SprintDuplicateQuestions
760
+ config: default
761
+ split: test
762
+ revision: d66bd1f72af766a5cc4b0ca5e00c162f89e8cc46
763
+ metrics:
764
+ - type: cos_sim_accuracy
765
+ value: 99.85643564356435
766
+ - type: cos_sim_ap
767
+ value: 96.59618618212247
768
+ - type: cos_sim_f1
769
+ value: 92.6221335992024
770
+ - type: cos_sim_precision
771
+ value: 92.34592445328032
772
+ - type: cos_sim_recall
773
+ value: 92.9
774
+ - type: dot_accuracy
775
+ value: 99.85643564356435
776
+ - type: dot_ap
777
+ value: 96.5961861821225
778
+ - type: dot_f1
779
+ value: 92.6221335992024
780
+ - type: dot_precision
781
+ value: 92.34592445328032
782
+ - type: dot_recall
783
+ value: 92.9
784
+ - type: euclidean_accuracy
785
+ value: 99.85643564356435
786
+ - type: euclidean_ap
787
+ value: 96.5961861821225
788
+ - type: euclidean_f1
789
+ value: 92.6221335992024
790
+ - type: euclidean_precision
791
+ value: 92.34592445328032
792
+ - type: euclidean_recall
793
+ value: 92.9
794
+ - type: manhattan_accuracy
795
+ value: 99.85841584158416
796
+ - type: manhattan_ap
797
+ value: 96.5578240948512
798
+ - type: manhattan_f1
799
+ value: 92.71523178807946
800
+ - type: manhattan_precision
801
+ value: 94.4963655244029
802
+ - type: manhattan_recall
803
+ value: 91.0
804
+ - type: max_accuracy
805
+ value: 99.85841584158416
806
+ - type: max_ap
807
+ value: 96.5961861821225
808
+ - type: max_f1
809
+ value: 92.71523178807946
810
+ - task:
811
+ type: Clustering
812
+ dataset:
813
+ type: mteb/stackexchange-clustering
814
+ name: MTEB StackExchangeClustering
815
+ config: default
816
+ split: test
817
+ revision: 6cbc1f7b2bc0622f2e39d2c77fa502909748c259
818
+ metrics:
819
+ - type: v_measure
820
+ value: 60.84750068050385
821
+ - task:
822
+ type: Clustering
823
+ dataset:
824
+ type: mteb/stackexchange-clustering-p2p
825
+ name: MTEB StackExchangeClusteringP2P
826
+ config: default
827
+ split: test
828
+ revision: 815ca46b2622cec33ccafc3735d572c266efdb44
829
+ metrics:
830
+ - type: v_measure
831
+ value: 33.96844721192451
832
+ - task:
833
+ type: Reranking
834
+ dataset:
835
+ type: mteb/stackoverflowdupquestions-reranking
836
+ name: MTEB StackOverflowDupQuestions
837
+ config: default
838
+ split: test
839
+ revision: e185fbe320c72810689fc5848eb6114e1ef5ec69
840
+ metrics:
841
+ - type: map
842
+ value: 50.454280909595205
843
+ - type: mrr
844
+ value: 51.24249320940497
845
+ - task:
846
+ type: Summarization
847
+ dataset:
848
+ type: mteb/summeval
849
+ name: MTEB SummEval
850
+ config: default
851
+ split: test
852
+ revision: cda12ad7615edc362dbf25a00fdd61d3b1eaf93c
853
+ metrics:
854
+ - type: cos_sim_pearson
855
+ value: 29.998438678552517
856
+ - type: cos_sim_spearman
857
+ value: 30.409482543506876
858
+ - type: dot_pearson
859
+ value: 29.998443850173224
860
+ - type: dot_spearman
861
+ value: 30.409482543506876
862
+ - task:
863
+ type: Retrieval
864
+ dataset:
865
+ type: trec-covid
866
+ name: MTEB TRECCOVID
867
+ config: default
868
+ split: test
869
+ revision: None
870
+ metrics:
871
+ - type: ndcg_at_10
872
+ value: 78.93
873
+ - task:
874
+ type: Retrieval
875
+ dataset:
876
+ type: webis-touche2020
877
+ name: MTEB Touche2020
878
+ config: default
879
+ split: test
880
+ revision: None
881
+ metrics:
882
+ - type: ndcg_at_10
883
+ value: 29.482999999999997
884
+ - task:
885
+ type: Classification
886
+ dataset:
887
+ type: mteb/toxic_conversations_50k
888
+ name: MTEB ToxicConversationsClassification
889
+ config: default
890
+ split: test
891
+ revision: d7c0de2777da35d6aae2200a62c6e0e5af397c4c
892
+ metrics:
893
+ - type: accuracy
894
+ value: 70.65859999999999
895
+ - type: ap
896
+ value: 15.03693738050973
897
+ - type: f1
898
+ value: 54.94379403846167
899
+ - task:
900
+ type: Classification
901
+ dataset:
902
+ type: mteb/tweet_sentiment_extraction
903
+ name: MTEB TweetSentimentExtractionClassification
904
+ config: default
905
+ split: test
906
+ revision: d604517c81ca91fe16a244d1248fc021f9ecee7a
907
+ metrics:
908
+ - type: accuracy
909
+ value: 64.4567062818336
910
+ - type: f1
911
+ value: 64.48980729427107
912
+ - task:
913
+ type: Clustering
914
+ dataset:
915
+ type: mteb/twentynewsgroups-clustering
916
+ name: MTEB TwentyNewsgroupsClustering
917
+ config: default
918
+ split: test
919
+ revision: 6125ec4e24fa026cec8a478383ee943acfbd5449
920
+ metrics:
921
+ - type: v_measure
922
+ value: 42.08554991843959
923
+ - task:
924
+ type: PairClassification
925
+ dataset:
926
+ type: mteb/twittersemeval2015-pairclassification
927
+ name: MTEB TwitterSemEval2015
928
+ config: default
929
+ split: test
930
+ revision: 70970daeab8776df92f5ea462b6173c0b46fd2d1
931
+ metrics:
932
+ - type: cos_sim_accuracy
933
+ value: 84.75293556654945
934
+ - type: cos_sim_ap
935
+ value: 69.40551043272129
936
+ - type: cos_sim_f1
937
+ value: 65.56335231034026
938
+ - type: cos_sim_precision
939
+ value: 65.79856497475419
940
+ - type: cos_sim_recall
941
+ value: 65.32981530343008
942
+ - type: dot_accuracy
943
+ value: 84.75293556654945
944
+ - type: dot_ap
945
+ value: 69.40550704470631
946
+ - type: dot_f1
947
+ value: 65.56335231034026
948
+ - type: dot_precision
949
+ value: 65.79856497475419
950
+ - type: dot_recall
951
+ value: 65.32981530343008
952
+ - type: euclidean_accuracy
953
+ value: 84.75293556654945
954
+ - type: euclidean_ap
955
+ value: 69.4055136381454
956
+ - type: euclidean_f1
957
+ value: 65.56335231034026
958
+ - type: euclidean_precision
959
+ value: 65.79856497475419
960
+ - type: euclidean_recall
961
+ value: 65.32981530343008
962
+ - type: manhattan_accuracy
963
+ value: 84.6337247422066
964
+ - type: manhattan_ap
965
+ value: 69.13628354134198
966
+ - type: manhattan_f1
967
+ value: 65.46998180715585
968
+ - type: manhattan_precision
969
+ value: 60.58361391694726
970
+ - type: manhattan_recall
971
+ value: 71.21372031662268
972
+ - type: max_accuracy
973
+ value: 84.75293556654945
974
+ - type: max_ap
975
+ value: 69.4055136381454
976
+ - type: max_f1
977
+ value: 65.56335231034026
978
+ - task:
979
+ type: PairClassification
980
+ dataset:
981
+ type: mteb/twitterurlcorpus-pairclassification
982
+ name: MTEB TwitterURLCorpus
983
+ config: default
984
+ split: test
985
+ revision: 8b6510b0b1fa4e4c4f879467980e9be563ec1cdf
986
+ metrics:
987
+ - type: cos_sim_accuracy
988
+ value: 89.04800714091667
989
+ - type: cos_sim_ap
990
+ value: 85.84596325009252
991
+ - type: cos_sim_f1
992
+ value: 78.39228527221042
993
+ - type: cos_sim_precision
994
+ value: 73.58643518205768
995
+ - type: cos_sim_recall
996
+ value: 83.86972590083154
997
+ - type: dot_accuracy
998
+ value: 89.04800714091667
999
+ - type: dot_ap
1000
+ value: 85.8459646697087
1001
+ - type: dot_f1
1002
+ value: 78.39228527221042
1003
+ - type: dot_precision
1004
+ value: 73.58643518205768
1005
+ - type: dot_recall
1006
+ value: 83.86972590083154
1007
+ - type: euclidean_accuracy
1008
+ value: 89.04800714091667
1009
+ - type: euclidean_ap
1010
+ value: 85.84596376376919
1011
+ - type: euclidean_f1
1012
+ value: 78.39228527221042
1013
+ - type: euclidean_precision
1014
+ value: 73.58643518205768
1015
+ - type: euclidean_recall
1016
+ value: 83.86972590083154
1017
+ - type: manhattan_accuracy
1018
+ value: 89.0266620095471
1019
+ - type: manhattan_ap
1020
+ value: 85.80124417850608
1021
+ - type: manhattan_f1
1022
+ value: 78.37817859254879
1023
+ - type: manhattan_precision
1024
+ value: 75.36963321012226
1025
+ - type: manhattan_recall
1026
+ value: 81.63689559593472
1027
+ - type: max_accuracy
1028
+ value: 89.04800714091667
1029
+ - type: max_ap
1030
+ value: 85.8459646697087
1031
+ - type: max_f1
1032
+ value: 78.39228527221042
1033
+ ---
1034
+
1035
+
1036
+ # Cohere embed-multilingual-light-v3.0
1037
+
1038
+ This repository contains the tokenizer for the Cohere `embed-multilingual-light-v3.0` model.
1039
+
1040
+ You can use the embedding model either via the Cohere API, AWS SageMaker or in your private deployments.
1041
+
1042
+ ## Usage Cohere API
1043
+
1044
+ The following code snippet shows the usage of the Cohere API. Install the cohere SDK via:
1045
+ ```
1046
+ pip install -U cohere
1047
+ ```
1048
+
1049
+ Get your free API key on: www.cohere.com
1050
+
1051
+
1052
+ ```python
1053
+ # This snippet shows and example how to use the Cohere Embed V3 models for semantic search.
1054
+ # Make sure to have the Cohere SDK in at least v4.30 install: pip install -U cohere
1055
+ # Get your API key from: www.cohere.com
1056
+ import cohere
1057
+ import numpy as np
1058
+
1059
+ cohere_key = "{YOUR_COHERE_API_KEY}" #Get your API key from www.cohere.com
1060
+ co = cohere.Client(cohere_key)
1061
+
1062
+ docs = ["The capital of France is Paris",
1063
+ "PyTorch is a machine learning framework based on the Torch library.",
1064
+ "The average cat lifespan is between 13-17 years"]
1065
+
1066
+
1067
+ #Encode your documents with input type 'search_document'
1068
+ doc_emb = co.embed(docs, input_type="search_document", model="embed-multilingual-light-v3.0").embeddings
1069
+ doc_emb = np.asarray(doc_emb)
1070
+
1071
+
1072
+ #Encode your query with input type 'search_query'
1073
+ query = "What is Pytorch"
1074
+ query_emb = co.embed([query], input_type="search_query", model="embed-multilingual-light-v3.0").embeddings
1075
+ query_emb = np.asarray(query_emb)
1076
+ query_emb.shape
1077
+
1078
+ #Compute the dot product between query embedding and document embedding
1079
+ scores = np.dot(query_emb, doc_emb.T)[0]
1080
+
1081
+ #Find the highest scores
1082
+ max_idx = np.argsort(-scores)
1083
+
1084
+ print(f"Query: {query}")
1085
+ for idx in max_idx:
1086
+ print(f"Score: {scores[idx]:.2f}")
1087
+ print(docs[idx])
1088
+ print("--------")
1089
+ ```
1090
+
1091
+ ## Usage AWS SageMaker
1092
+ The embedding model can be privately deployed in your AWS Cloud using our [AWS SageMaker marketplace offering](https://aws.amazon.com/marketplace/pp/prodview-z6huxszcqc25i). It runs privately in your VPC, with latencies as low as 5ms for query encoding.
1093
+
1094
+ ## Usage AWS Bedrock
1095
+ Soon the model will also be available via AWS Bedrock. Stay tuned
1096
+
1097
+ ## Private Deployment
1098
+ You want to run the model on your own hardware? [Contact Sales](https://cohere.com/contact-sales) to learn more.
1099
+
1100
+ ## Supported Languages
1101
+ This model was trained on nearly 1B English training pairs and nearly 0.5B Non-English training pairs from 100+ languages.
1102
+
1103
+ Evaluation results can be found in the [Embed V3.0 Benchmark Results spreadsheet](https://docs.google.com/spreadsheets/d/1w7gnHWMDBdEUrmHgSfDnGHJgVQE5aOiXCCwO3uNH_mI/edit?usp=sharing).
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:62c24cdc13d4c9952d63718d6c9fa4c287974249e16b7ade6d5a85e7bbb75626
3
+ size 17082660
tokenizer_config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "clean_up_tokenization_spaces": true,
4
+ "cls_token": "<s>",
5
+ "eos_token": "</s>",
6
+ "mask_token": {
7
+ "__type": "AddedToken",
8
+ "content": "<mask>",
9
+ "lstrip": true,
10
+ "normalized": true,
11
+ "rstrip": false,
12
+ "single_word": false
13
+ },
14
+ "model_max_length": 512,
15
+ "name_or_path": "../sbert_models/cohere-embed-multilingual-v3.0-not-rotated/",
16
+ "pad_token": "<pad>",
17
+ "sep_token": "</s>",
18
+ "special_tokens_map_file": "../sbert_models/cohere-embed-multilingual-v3.0-not-rotated/special_tokens_map.json",
19
+ "tokenizer_class": "XLMRobertaTokenizer",
20
+ "unk_token": "<unk>"
21
+ }