vrashad commited on
Commit
ec55ad6
·
verified ·
1 Parent(s): 452d688

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -3
README.md CHANGED
@@ -1,3 +1,66 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - az
6
+ base_model:
7
+ - sentence-transformers/LaBSE
8
+ pipeline_tag: sentence-similarity
9
+ ---
10
+
11
+ ---
12
+ language:
13
+ - en
14
+ - az
15
+ tags:
16
+ - LaBSE
17
+ - sentence-transformers
18
+ - sentence-similarity
19
+ - dimensionality-reduction
20
+ - bert
21
+ license: apache-2.0
22
+ ---
23
+
24
+ # Small LaBSE for English-Azerbaijani
25
+
26
+ This is an optimized version of [LaBSE (Language-agnostic BERT Sentence Embeddings)](https://huggingface.co/sentence-transformers/LaBSE) specifically for English and Azerbaijani language.
27
+
28
+
29
+ # Benchmark
30
+
31
+ | STSBenchmark | biosses-sts | sickr-sts | sts12-sts | sts13-sts | sts15-sts | sts16-sts | Average Pearson | Model |
32
+ |--------------|-------------|-----------|-----------|-----------|-----------|-----------|-----------------|--------------------------------------|
33
+ | 0.7363 | 0.8148 | 0.7067 | 0.7050 | 0.6535 | 0.7514 | 0.7070 | 0.7250 | sentence-transformers/LaBSE |
34
+ | 0.7400 | 0.8216 | 0.6946 | 0.7098 | 0.6781 | 0.7637 | 0.7222 | 0.7329 | LocalDoc/LaBSE-small-AZ |
35
+ | 0.5830 | 0.2486 | 0.5921 | 0.5593 | 0.5559 | 0.5404 | 0.5289 | 0.5155 | antoinelouis/colbert-xm |
36
+ | 0.7572 | 0.8139 | 0.7328 | 0.7646 | 0.6318 | 0.7542 | 0.7092 | 0.7377 | intfloat/multilingual-e5-large-instruct |
37
+ | 0.7485 | 0.7714 | 0.7271 | 0.7170 | 0.6496 | 0.7570 | 0.7255 | 0.7280 | intfloat/multilingual-e5-large |
38
+ | 0.6960 | 0.8185 | 0.6950 | 0.6752 | 0.5899 | 0.7186 | 0.6790 | 0.6960 | intfloat/multilingual-e5-base |
39
+ | 0.7376 | 0.7917 | 0.7190 | 0.7441 | 0.6286 | 0.7461 | 0.7026 | 0.7242 | intfloat/multilingual-e5-small |
40
+ | 0.7927 | 0.6672 | 0.7758 | 0.8122 | 0.7312 | 0.7831 | 0.7416 | 0.7577 | BAAI/bge-m3 |
41
+
42
+
43
+
44
+ ## How to Use
45
+
46
+ ```python
47
+ from transformers import AutoTokenizer, AutoModel
48
+ import torch
49
+
50
+ # Load model and tokenizer
51
+ tokenizer = AutoTokenizer.from_pretrained("LocalDoc/LaBSE-small-AZ")
52
+ model = AutoModel.from_pretrained("LocalDoc/LaBSE-small-AZ")
53
+
54
+ # Prepare texts
55
+ texts = [
56
+ "Hello world", # English
57
+ "Salam dünya" # Azerbaijani
58
+ ]
59
+
60
+ # Tokenize and generate embeddings
61
+ encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
62
+ with torch.no_grad():
63
+ embeddings = model(**encoded).pooler_output
64
+
65
+ # Compute similarity
66
+ similarity = torch.nn.functional.cosine_similarity(embeddings[0], embeddings[1], dim=0)