Update README.md
Browse files
README.md
CHANGED
@@ -27,8 +27,8 @@ widget:
|
|
27 |
|
28 |
---
|
29 |
|
30 |
-
# NB-SBERT
|
31 |
-
NB-SBERT is a [SentenceTransformers](https://www.SBERT.net) model trained on a [machine translated version of the MNLI dataset](https://huggingface.co/datasets/NbAiLab/mnli-norwegian), starting from [nb-bert-base](https://huggingface.co/NbAiLab/nb-bert-base).
|
32 |
|
33 |
The model maps sentences & paragraphs to a 768 dimensional dense vector space. This vector can be used for tasks like clustering and semantic search. Below we give some examples on how to use the model. The easiest way is to simply measure the cosine distance between two sentences. Sentences that are close to each other in meaning, will have a small cosine distance and a similarity close to 1. The model is trained in such a way that similar sentences in different languages should also be close to each other. Ideally, an English-Norwegian sentence pair should have high similarity.
|
34 |
|
@@ -46,7 +46,7 @@ Then you can use the model like this:
|
|
46 |
from sentence_transformers import SentenceTransformer, util
|
47 |
sentences = ["This is a Norwegian boy", "Dette er en norsk gutt"]
|
48 |
|
49 |
-
model = SentenceTransformer('NbAiLab/nb-sbert')
|
50 |
embeddings = model.encode(sentences)
|
51 |
print(embeddings)
|
52 |
|
@@ -83,8 +83,8 @@ def mean_pooling(model_output, attention_mask):
|
|
83 |
sentences = ["This is a Norwegian boy", "Dette er en norsk gutt"]
|
84 |
|
85 |
# Load model from HuggingFace Hub
|
86 |
-
tokenizer = AutoTokenizer.from_pretrained('NbAiLab/nb-sbert')
|
87 |
-
model = AutoModel.from_pretrained('NbAiLab/nb-sbert')
|
88 |
|
89 |
# Tokenize sentences
|
90 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
@@ -107,7 +107,7 @@ print(scipy_cosine_scores)
|
|
107 |
|
108 |
```
|
109 |
## SetFit - Few Shot Classification
|
110 |
-
[SetFit](https://github.com/huggingface/setfit) is a method for using sentence-transformers to solve one of major problem that all NLP researchers are facing: Too few labeled training examples. The 'nb-sbert' can be plugged directly into the SetFit library. Please see [this tutorial](https://huggingface.co/blog/setfit) for how to use this technique.
|
111 |
|
112 |
|
113 |
## Keyword Extraction
|
@@ -120,7 +120,7 @@ pip install keybert
|
|
120 |
```python
|
121 |
from keybert import KeyBERT
|
122 |
from sentence_transformers import SentenceTransformer
|
123 |
-
sentence_model = SentenceTransformer("NbAiLab/nb-sbert")
|
124 |
kw_model = KeyBERT(model=sentence_model)
|
125 |
|
126 |
doc = """
|
@@ -139,10 +139,10 @@ The [KeyBERT homepage](https://github.com/MaartenGr/KeyBERT) provides other seve
|
|
139 |
## Topic Modeling
|
140 |
To analyse a group of documents and determine the topics, has a lot of use cases. [BERTopic](https://github.com/MaartenGr/BERTopic) combines the power of sentence transformers with c-TF-IDF to create clusters for easily interpretable topics.
|
141 |
|
142 |
-
It would take too much time to explain topic modeling here. Instead we recommend that you take a look at the link above, as well as the [documentation](https://maartengr.github.io/BERTopic/index.html). The main adaptation you would need to do to use the Norwegian nb-sbert, is to add the following:
|
143 |
|
144 |
```python
|
145 |
-
topic_model = BERTopic(embedding_model='NbAiLab/nb-sbert').fit(docs)
|
146 |
```
|
147 |
|
148 |
## Similarity Search
|
@@ -161,7 +161,7 @@ import numpy as np
|
|
161 |
from sentence_transformers import SentenceTransformer, util
|
162 |
sentences = ["This is a Norwegian boy", "Dette er en norsk gutt", "A red house"]
|
163 |
|
164 |
-
model = SentenceTransformer('NbAiLab/nb-sbert')
|
165 |
embeddings = model.encode(sentences)
|
166 |
index, index_infos = build_index(embeddings, save_on_disk=False)
|
167 |
|
|
|
27 |
|
28 |
---
|
29 |
|
30 |
+
# NB-SBERT-BASE
|
31 |
+
NB-SBERT-BASE is a [SentenceTransformers](https://www.SBERT.net) model trained on a [machine translated version of the MNLI dataset](https://huggingface.co/datasets/NbAiLab/mnli-norwegian), starting from [nb-bert-base](https://huggingface.co/NbAiLab/nb-bert-base).
|
32 |
|
33 |
The model maps sentences & paragraphs to a 768 dimensional dense vector space. This vector can be used for tasks like clustering and semantic search. Below we give some examples on how to use the model. The easiest way is to simply measure the cosine distance between two sentences. Sentences that are close to each other in meaning, will have a small cosine distance and a similarity close to 1. The model is trained in such a way that similar sentences in different languages should also be close to each other. Ideally, an English-Norwegian sentence pair should have high similarity.
|
34 |
|
|
|
46 |
from sentence_transformers import SentenceTransformer, util
|
47 |
sentences = ["This is a Norwegian boy", "Dette er en norsk gutt"]
|
48 |
|
49 |
+
model = SentenceTransformer('NbAiLab/nb-sbert-base')
|
50 |
embeddings = model.encode(sentences)
|
51 |
print(embeddings)
|
52 |
|
|
|
83 |
sentences = ["This is a Norwegian boy", "Dette er en norsk gutt"]
|
84 |
|
85 |
# Load model from HuggingFace Hub
|
86 |
+
tokenizer = AutoTokenizer.from_pretrained('NbAiLab/nb-sbert-base')
|
87 |
+
model = AutoModel.from_pretrained('NbAiLab/nb-sbert-base')
|
88 |
|
89 |
# Tokenize sentences
|
90 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
|
|
107 |
|
108 |
```
|
109 |
## SetFit - Few Shot Classification
|
110 |
+
[SetFit](https://github.com/huggingface/setfit) is a method for using sentence-transformers to solve one of major problem that all NLP researchers are facing: Too few labeled training examples. The 'nb-sbert-base' can be plugged directly into the SetFit library. Please see [this tutorial](https://huggingface.co/blog/setfit) for how to use this technique.
|
111 |
|
112 |
|
113 |
## Keyword Extraction
|
|
|
120 |
```python
|
121 |
from keybert import KeyBERT
|
122 |
from sentence_transformers import SentenceTransformer
|
123 |
+
sentence_model = SentenceTransformer("NbAiLab/nb-sbert-base")
|
124 |
kw_model = KeyBERT(model=sentence_model)
|
125 |
|
126 |
doc = """
|
|
|
139 |
## Topic Modeling
|
140 |
To analyse a group of documents and determine the topics, has a lot of use cases. [BERTopic](https://github.com/MaartenGr/BERTopic) combines the power of sentence transformers with c-TF-IDF to create clusters for easily interpretable topics.
|
141 |
|
142 |
+
It would take too much time to explain topic modeling here. Instead we recommend that you take a look at the link above, as well as the [documentation](https://maartengr.github.io/BERTopic/index.html). The main adaptation you would need to do to use the Norwegian nb-sbert-base, is to add the following:
|
143 |
|
144 |
```python
|
145 |
+
topic_model = BERTopic(embedding_model='NbAiLab/nb-sbert-base').fit(docs)
|
146 |
```
|
147 |
|
148 |
## Similarity Search
|
|
|
161 |
from sentence_transformers import SentenceTransformer, util
|
162 |
sentences = ["This is a Norwegian boy", "Dette er en norsk gutt", "A red house"]
|
163 |
|
164 |
+
model = SentenceTransformer('NbAiLab/nb-sbert-base')
|
165 |
embeddings = model.encode(sentences)
|
166 |
index, index_infos = build_index(embeddings, save_on_disk=False)
|
167 |
|