upskyy commited on
Commit
13713e6
·
verified ·
1 Parent(s): 3518efa

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md CHANGED
@@ -1,3 +1,342 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - af
4
+ - ar
5
+ - az
6
+ - be
7
+ - bg
8
+ - bn
9
+ - ca
10
+ - ceb
11
+ - cs
12
+ - cy
13
+ - da
14
+ - de
15
+ - el
16
+ - en
17
+ - es
18
+ - et
19
+ - eu
20
+ - fa
21
+ - fi
22
+ - fr
23
+ - gl
24
+ - gu
25
+ - he
26
+ - hi
27
+ - hr
28
+ - ht
29
+ - hu
30
+ - hy
31
+ - id
32
+ - is
33
+ - it
34
+ - ja
35
+ - jv
36
+ - ka
37
+ - kk
38
+ - km
39
+ - kn
40
+ - ko
41
+ - ky
42
+ - lo
43
+ - lt
44
+ - lv
45
+ - mk
46
+ - ml
47
+ - mn
48
+ - mr
49
+ - ms
50
+ - my
51
+ - ne
52
+ - nl
53
+ - 'no'
54
+ - pa
55
+ - pl
56
+ - pt
57
+ - qu
58
+ - ro
59
+ - ru
60
+ - si
61
+ - sk
62
+ - sl
63
+ - so
64
+ - sq
65
+ - sr
66
+ - sv
67
+ - sw
68
+ - ta
69
+ - te
70
+ - th
71
+ - tl
72
+ - tr
73
+ - uk
74
+ - ur
75
+ - vi
76
+ - yo
77
+ - zh
78
+ library_name: sentence-transformers
79
+ tags:
80
+ - korean
81
+ - sentence-transformers
82
+ - transformers
83
+ - multilingual
84
+ - sentence-transformers
85
+ - sentence-similarity
86
+ - feature-extraction
87
+ base_model: BAAI/bge-m3
88
+ datasets: []
89
+ metrics:
90
+ - pearson_cosine
91
+ - spearman_cosine
92
+ - pearson_manhattan
93
+ - spearman_manhattan
94
+ - pearson_euclidean
95
+ - spearman_euclidean
96
+ - pearson_dot
97
+ - spearman_dot
98
+ - pearson_max
99
+ - spearman_max
100
+ widget:
101
+ - source_sentence: 이집트 군대가 형제애를 단속하다
102
+ sentences:
103
+ - 이집트의 군대가 무슬림 형제애를 단속하다
104
+ - 아르헨티나의 기예르모 코리아와 네덜란드의 마틴 버커크의 또 다른 준결승전도 매력적이다.
105
+ - 그것이 사실일 수도 있다고 생각하는 것은 재미있다.
106
+ - source_sentence: 오, 그리고 다시 결혼은 근본적인 인권이라고 주장한다.
107
+ sentences:
108
+ - 특히 결혼은 근본적인 인권이라고 말한 후에.
109
+ - 해변에 있는 흑인과 그의 개...
110
+ - 이란은 핵 프로그램이 평화적인 목적을 위한 것이라고 주장한다
111
+ - source_sentence: 두 사람이 계단을 올라가 건물 안으로 들어간다
112
+ sentences:
113
+ - 글쎄, 나는 우리가 꽤 나빠진 사이트 목록을 만들었고 일부를 정리해야한다는 일부 사이트에서 알았고 지금 법은 슈퍼 펀드이며 당신이 아무리간에
114
+ 독성 폐기물을 일으킨 사람이라면 누구나 알고 있습니다. 결국 당신이 아는 사람은 누구나 땅에 손상을 입혔거나 모두가 기여해야한다는 것을 알고
115
+ 있습니다. 그리고 우리가이 돈을 정리하기 위해 수퍼 펀드 거래를 가져 왔을 때 많은 돈을 벌었습니다. 모든 것을 꺼내서 다시 실행하면 다른
116
+ 지역을 채울 수 있습니다. 음. 확실히 셔먼 시설과 같은 더 나은 솔루션을 가지고있는 것 같습니다. 기름 통에 넣은 다음 시멘트가 깔려있는
117
+ 곳에서 밀봉하십시오.
118
+ - 한 사람이 계단을 올라간다.
119
+ - 두 사람이 함께 계단을 올라간다.
120
+ - source_sentence: 그래, 내가 알아차린 적이 있어
121
+ sentences:
122
+ - 나는 알아차리지 못했다.
123
+ - 이것은 내가 영국의 아서 안데르센 사업부의 파트너인 짐 와디아를 아서 안데르센 경영진이 선택한 것보다 래리 웨인바흐를 안데르센 월드와이드의
124
+ 경영 파트너로 승계하기 위해 안데르센 컨설팅 사업부(현재의 엑센츄어라고 알려져 있음)의 전 관리 파트너인 조지 샤힌에 대한 지지를 표명했을
125
+ 때 가장 명백했다.
126
+ - 나는 메모했다.
127
+ - source_sentence: 여자가 전화를 하는 동안 두 남자가 돈을 위해 악기를 연주한다.
128
+ sentences:
129
+ - 마이크에 대고 노래를 부르고 베이스를 연주하는 남자.
130
+ - 빨대를 사용하는 아이
131
+ - 돈을 위해 악기를 연주하는 사람들
132
+ pipeline_tag: sentence-similarity
133
+ model-index:
134
+ - name: upskyy/bge-m3-korean
135
+ results:
136
+ - task:
137
+ type: semantic-similarity
138
+ name: Semantic Similarity
139
+ dataset:
140
+ name: sts dev
141
+ type: sts-dev
142
+ metrics:
143
+ - type: pearson_cosine
144
+ value: 0.8740181295716805
145
+ name: Pearson Cosine
146
+ - type: spearman_cosine
147
+ value: 0.8723737976913686
148
+ name: Spearman Cosine
149
+ - type: pearson_manhattan
150
+ value: 0.8593266961329962
151
+ name: Pearson Manhattan
152
+ - type: spearman_manhattan
153
+ value: 0.8687629058449345
154
+ name: Spearman Manhattan
155
+ - type: pearson_euclidean
156
+ value: 0.8597907936339472
157
+ name: Pearson Euclidean
158
+ - type: spearman_euclidean
159
+ value: 0.8693987158996017
160
+ name: Spearman Euclidean
161
+ - type: pearson_dot
162
+ value: 0.8683777071455441
163
+ name: Pearson Dot
164
+ - type: spearman_dot
165
+ value: 0.8665500024614361
166
+ name: Spearman Dot
167
+ - type: pearson_max
168
+ value: 0.8740181295716805
169
+ name: Pearson Max
170
+ - type: spearman_max
171
+ value: 0.8723737976913686
172
+ name: Spearman Max
173
+ ---
174
+
175
+ # upskyy/bge-m3-korean
176
+
177
+ This model is korsts and kornli finetuning model from [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
178
+
179
+ ## Model Details
180
+
181
+ ### Model Description
182
+ - **Model Type:** Sentence Transformer
183
+ - **Base model:** [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) <!-- at revision 5617a9f61b028005a4858fdac845db406aefb181 -->
184
+ - **Maximum Sequence Length:** 512 tokens
185
+ - **Output Dimensionality:** 1024 tokens
186
+ - **Similarity Function:** Cosine Similarity
187
+ <!-- - **Training Dataset:** Unknown -->
188
+ <!-- - **Language:** Unknown -->
189
+ <!-- - **License:** Unknown -->
190
+
191
+ ### Full Model Architecture
192
+
193
+ ```
194
+ SentenceTransformer(
195
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
196
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
197
+ )
198
+ ```
199
+
200
+
201
+ ## Usage
202
+
203
+ ### Usage (Sentence-Transformers)
204
+
205
+
206
+ First install the Sentence Transformers library:
207
+
208
+ ```bash
209
+ pip install -U sentence-transformers
210
+ ```
211
+
212
+ Then you can load this model and run inference.
213
+ ```python
214
+ from sentence_transformers import SentenceTransformer
215
+
216
+ # Download from the 🤗 Hub
217
+ model = SentenceTransformer("upskyy/bge-m3-korean")
218
+
219
+ # Run inference
220
+ sentences = [
221
+ '아이를 가진 엄마가 해변을 걷는다.',
222
+ '두 사람이 해변을 걷는다.',
223
+ '한 남자가 해변에서 개를 산책시킨다.',
224
+ ]
225
+ embeddings = model.encode(sentences)
226
+ print(embeddings.shape)
227
+ # [3, 768]
228
+
229
+ # Get the similarity scores for the embeddings
230
+ similarities = model.similarity(embeddings, embeddings)
231
+ print(similarities.shape)
232
+ # [3, 3]
233
+ print(similarities)
234
+ ```
235
+
236
+ ### Usage (HuggingFace Transformers)
237
+
238
+ Without sentence-transformers, you can use the model like this:
239
+ First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
240
+
241
+ ```python
242
+ from transformers import AutoTokenizer, AutoModel
243
+ import torch
244
+
245
+
246
+ # Mean Pooling - Take attention mask into account for correct averaging
247
+ def mean_pooling(model_output, attention_mask):
248
+ token_embeddings = model_output[0] # First element of model_output contains all token embeddings
249
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
250
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
251
+
252
+
253
+ # Sentences we want sentence embeddings for
254
+ sentences = ["안녕하세요?", "한국어 문장 임베딩을 위한 버트 모델입니다."]
255
+
256
+ # Load model from HuggingFace Hub
257
+ tokenizer = AutoTokenizer.from_pretrained("upskyy/bge-m3-korean")
258
+ model = AutoModel.from_pretrained("upskyy/bge-m3-korean")
259
+
260
+ # Tokenize sentences
261
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
262
+
263
+ # Compute token embeddings
264
+ with torch.no_grad():
265
+ model_output = model(**encoded_input)
266
+
267
+ # Perform pooling. In this case, mean pooling.
268
+ sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
269
+
270
+ print("Sentence embeddings:")
271
+ print(sentence_embeddings)
272
+ ```
273
+
274
+ ## Evaluation
275
+
276
+ ### Metrics
277
+
278
+ #### Semantic Similarity
279
+ * Dataset: `sts-dev`
280
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
281
+
282
+ | Metric | Value |
283
+ | :----------------- | :--------- |
284
+ | pearson_cosine | 0.874 |
285
+ | spearman_cosine | 0.8724 |
286
+ | pearson_manhattan | 0.8593 |
287
+ | spearman_manhattan | 0.8688 |
288
+ | pearson_euclidean | 0.8598 |
289
+ | spearman_euclidean | 0.8694 |
290
+ | pearson_dot | 0.8684 |
291
+ | spearman_dot | 0.8666 |
292
+ | pearson_max | 0.874 |
293
+ | **spearman_max** | **0.8724** |
294
+
295
+ <!--
296
+ ## Bias, Risks and Limitations
297
+
298
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
299
+ -->
300
+
301
+ <!--
302
+ ### Recommendations
303
+
304
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
305
+ -->
306
+
307
+
308
+ ### Framework Versions
309
+ - Python: 3.10.13
310
+ - Sentence Transformers: 3.0.1
311
+ - Transformers: 4.42.4
312
+ - PyTorch: 2.3.0+cu121
313
+ - Accelerate: 0.30.1
314
+ - Datasets: 2.16.1
315
+ - Tokenizers: 0.19.1
316
+
317
+ ## Citation
318
+
319
+ ### BibTeX
320
+
321
+ ```bibtex
322
+ @misc{bge-m3,
323
+ title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
324
+ author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
325
+ year={2024},
326
+ eprint={2402.03216},
327
+ archivePrefix={arXiv},
328
+ primaryClass={cs.CL}
329
+ }
330
+ ```
331
+
332
+ ```bibtex
333
+ @inproceedings{reimers-2019-sentence-bert,
334
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
335
+ author = "Reimers, Nils and Gurevych, Iryna",
336
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
337
+ month = "11",
338
+ year = "2019",
339
+ publisher = "Association for Computational Linguistics",
340
+ url = "https://arxiv.org/abs/1908.10084",
341
+ }
342
+ ```
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "XLMRobertaModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 1024,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 4096,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 8194,
16
+ "model_type": "xlm-roberta",
17
+ "num_attention_heads": 16,
18
+ "num_hidden_layers": 24,
19
+ "output_past": true,
20
+ "pad_token_id": 1,
21
+ "position_embedding_type": "absolute",
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.42.4",
24
+ "type_vocab_size": 1,
25
+ "use_cache": true,
26
+ "vocab_size": 250002
27
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8f36f5b113e21e68f8440a44fb9355569d81ab1d213d156d02bedc2ef0b508b3
3
+ size 2271064456
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9a6af42442a3e3e9f05f618eae0bb2d98ca4f6a6406cb80ef7a4fa865204d61
3
+ size 17083052
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "mask_token": "<mask>",
49
+ "model_max_length": 512,
50
+ "pad_token": "<pad>",
51
+ "sep_token": "</s>",
52
+ "sp_model_kwargs": {},
53
+ "tokenizer_class": "XLMRobertaTokenizer",
54
+ "unk_token": "<unk>"
55
+ }