pascalhuerten commited on
Commit
3d8304a
1 Parent(s): 07eba36

Update Readme

Browse files
Files changed (1) hide show
  1. README.md +131 -79
README.md CHANGED
@@ -1,119 +1,166 @@
1
  ---
2
  license: mit
3
- datasets: []
4
- language: []
5
  library_name: sentence-transformers
6
  pipeline_tag: sentence-similarity
7
  tags:
8
  - sentence-transformers
9
  - sentence-similarity
10
- - feature-extraction
 
 
 
11
  widget: []
12
  ---
13
 
14
- # SentenceTransformer
15
 
16
- This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
17
 
18
- ## Model Details
19
 
20
- ### Model Description
21
- - **Model Type:** Sentence Transformer
22
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
23
- - **Maximum Sequence Length:** 512 tokens
24
- - **Output Dimensionality:** 768 tokens
25
- - **Similarity Function:** Cosine Similarity
26
- <!-- - **Training Dataset:** Unknown -->
27
- <!-- - **Language:** Unknown -->
28
- <!-- - **License:** Unknown -->
29
 
30
- ### Model Sources
 
 
 
 
 
31
 
32
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
33
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
34
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
35
 
36
- ### Full Model Architecture
 
 
 
 
 
37
 
38
- ```
39
- SentenceTransformer(
40
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
41
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
42
- (2): Normalize()
43
- )
44
- ```
45
 
46
  ## Usage
47
 
48
- ### Direct Usage (Sentence Transformers)
49
 
50
- First install the Sentence Transformers library:
51
-
52
- ```bash
53
- pip install -U sentence-transformers
54
- ```
55
-
56
- Then you can load this model and run inference.
57
  ```python
58
  from sentence_transformers import SentenceTransformer
 
 
59
 
60
  # Download from the 🤗 Hub
61
- model = SentenceTransformer("sentence_transformers_model_id")
62
  # Run inference
63
- sentences = [
64
- 'The weather is lovely today.',
65
- "It's so sunny outside!",
66
- 'He drove to the stadium.',
67
- ]
68
- embeddings = model.encode(sentences)
69
- print(embeddings.shape)
70
- # [3, 768]
71
-
72
- # Get the similarity scores for the embeddings
73
- similarities = model.similarity(embeddings, embeddings)
74
- print(similarities.shape)
75
- # [3, 3]
76
  ```
77
 
78
- <!--
79
- ### Direct Usage (Transformers)
80
-
81
- <details><summary>Click to see the direct usage in Transformers</summary>
82
 
83
- </details>
84
- -->
85
 
86
- <!--
87
- ### Downstream Usage (Sentence Transformers)
 
 
 
88
 
89
- You can finetune this model on your own dataset.
90
 
91
- <details><summary>Click to expand</summary>
 
 
 
 
 
 
 
 
 
92
 
93
- </details>
94
- -->
95
 
96
- <!--
97
- ### Out-of-Scope Use
 
 
 
 
98
 
99
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
100
- -->
 
 
101
 
102
- <!--
103
- ## Bias, Risks and Limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
106
- -->
107
 
108
- <!--
109
- ### Recommendations
110
 
111
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
112
- -->
 
 
 
 
 
113
 
114
- ## Training Details
115
 
 
 
 
 
 
 
 
116
  ### Framework Versions
 
117
  - Python: 3.10.12
118
  - Sentence Transformers: 3.0.1
119
  - Transformers: 4.41.2
@@ -122,9 +169,16 @@ You can finetune this model on your own dataset.
122
  - Datasets: 2.20.0
123
  - Tokenizers: 0.19.1
124
 
125
- ## Citation
 
 
 
 
 
126
 
127
- ### BibTeX
 
 
128
 
129
  <!--
130
  ## Glossary
@@ -132,11 +186,9 @@ You can finetune this model on your own dataset.
132
  *Clearly define terms in order to be accessible across audiences.*
133
  -->
134
 
135
- <!--
136
  ## Model Card Authors
137
 
138
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
139
- -->
140
 
141
  <!--
142
  ## Model Card Contact
 
1
  ---
2
  license: mit
3
+ language:
4
+ - de
5
  library_name: sentence-transformers
6
  pipeline_tag: sentence-similarity
7
  tags:
8
  - sentence-transformers
9
  - sentence-similarity
10
+ - information retrieval
11
+ - education
12
+ - competency
13
+ - course
14
  widget: []
15
  ---
16
 
17
+ # isy-thl/multilingual-e5-base-course-skill-tuned
18
 
19
+ ## Overview
20
 
21
+ The **isy-thl/multilingual-e5-base-course-skill-tuned** is a finetuned version of the [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) model. The primary goal of this finetuning process was to enhance the model's capabilities in information retrieval, specifically for identifying the most relevant skills associated with a given course description in the German language.
22
 
23
+ ## Capabilities
 
 
 
 
 
 
 
 
24
 
25
+ - **Enhanced Skill Retrieval:**
26
+ - The model excels at identifying and retrieving the most relevant skills for a given course description in German, which can be leveraged for various applications in educational technology.
27
+ - **Multilingual Capability:**
28
+ - While optimized for German, the underlying base model [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) supports multiple languages, making it adaptable for future multilingual finetuning endeavors.
29
+ - **Scalability:**
30
+ - The model can handle input sequences up to 512 tokens in length, making it suitable for processing comprehensive course descriptions.
31
 
32
+ ## Limitations and Considerations
 
 
33
 
34
+ - **Language Limitation:**
35
+ - The finetuning was specifically targeted at German language content. While the base model supports multiple languages, this particular finetuned version may not perform as well on non-German texts without additional training.
36
+ - **Data Bias:**
37
+ - The performance and reliability of the model are dependent on the quality of the annotated data in the training dataset. Any biases present in the training data may affect the model's output.
38
+ - **Retrieval Scope:**
39
+ - The model is optimized for educational contexts and may not generalize as effectively to other domains without further finetuning.
40
 
41
+ ## Performance
42
+ - Coming soon
 
 
 
 
 
43
 
44
  ## Usage
45
 
46
+ ### Sentence Similarity
47
 
 
 
 
 
 
 
 
48
  ```python
49
  from sentence_transformers import SentenceTransformer
50
+ import numpy as np
51
+ from sklearn.metrics.pairwise import cosine_similarity
52
 
53
  # Download from the 🤗 Hub
54
+ model = SentenceTransformer("isy-thl/multilingual-e5-base-course-skill-tuned")
55
  # Run inference
56
+ query = [['query: ','WordPress Grundlagen\n Dieser Kurs vermittelt grundlegende Fähigkeiten zur Erstellung eines Web-Blogs in Wordpress. Sie lernen WordPress zu installieren...']]
57
+ corpus = [['passage: ','WordPress'],
58
+ ['passage: ','Website-Wireframe erstellen'],
59
+ ['passage: ','Software für Content-Management-Systeme nutzen']]
60
+ query_embeddings = model.encode(query)
61
+ corpus_embeddings = model.encode(corpus)
62
+ similarities = cosine_similarity(query_embeddings,corpus_embeddings)
63
+ retrieved_doc_id = np.argmax(similarities)
64
+ print(retrieved_doc_id)
 
 
 
 
65
  ```
66
 
67
+ ### Information Retrieval
 
 
 
68
 
69
+ First install the langchain and chromadb library:
 
70
 
71
+ ```bash
72
+ pip install -U langchain
73
+ pip install -U langchain-community
74
+ pip install -U chromadb
75
+ ```
76
 
77
+ Then you can load this model, create a vectordatabase and run semantic searches.
78
 
79
+ ```python
80
+ from langchain_community.embeddings import HuggingFaceBgeEmbeddings
81
+ from langchain.vectorstores import Chroma
82
+
83
+ # Download model and set embed instructions.
84
+ embedding = HuggingFaceBgeEmbeddings(
85
+ model_name='isy-thl/multilingual-e5-base-course-skill-tuned',
86
+ query_instruction='query: '',
87
+ embed_instruction='passage: '
88
+ )
89
 
90
+ # Load your documents.
91
+ documents = ...
92
 
93
+ # Create vector database.
94
+ db = Chroma.from_documents(
95
+ documents=documents,
96
+ embedding=embedding,
97
+ collection_metadata={'hnsw:space': 'cosine'},
98
+ )
99
 
100
+ # Search database for closest semantic matches.
101
+ query = 'WordPress Grundlagen\n Dieser Kurs vermittelt grundlegende Fähigkeiten zur Erstellung eines Web-Blogs in Wordpress. Sie lernen WordPress zu installieren...'
102
+ db.similarity_search_with_relevance_scores(query, 20)
103
+ ```
104
 
105
+ ## Finetuning Details
106
+
107
+ ### Finetuning Dataset
108
+
109
+ - The model was finetuned on the [German Course Competency Alignment Dataset](pascalhuerten/course_competency_alignment_de), which includes alignments of course descriptions to the skill taxonomies of ESCO (European Skills, Competences, Qualifications and Occupations) and GRETA (a competency model for professional teaching competencies in adult education).
110
+ - This dataset was compiled as part of the **WISY@KI** project, with major contributions from the **Institut für Interaktive Systeme** at the **University of Applied Sciences Lübeck**, the **Kursportal Schleswig-Holstein**, and **Weiterbildung Hessen eV**. Special thanks to colleagues from **MyEduLife** and **Trainspot**.
111
+
112
+ ### Finetuning Process
113
+
114
+ - **Hardware Used:**
115
+ - Single NVIDIA T4 GPU with 15 GB VRAM.
116
+ - **Duration:**
117
+ - 2000 data points: ~15 minutes.
118
+ - **Training Parameters:**
119
+ ```bash
120
+ torchrun --nproc_per_node 1 \
121
+ -m FlagEmbedding.baai_general_embedding.finetune.run \
122
+ --output_dir multilingual_e5_base_finetuned \
123
+ --model_name_or_path intfloat/multilingual-e5-base \
124
+ --train_data ./course_competency_alignment_de.jsonl \
125
+ --learning_rate 1e-5 \
126
+ --fp16 \
127
+ --num_train_epochs 5 \
128
+ --per_device_train_batch_size 4 \
129
+ --dataloader_drop_last True \
130
+ --normlized True \
131
+ --temperature 0.02 \
132
+ --query_max_len 512 \
133
+ --passage_max_len 64 \
134
+ --train_group_size 4 \
135
+ --negatives_cross_device \
136
+ --logging_steps 10 \
137
+ --save_steps 1500 \
138
+ --query_instruction_for_retrieval ""
139
+ ```
140
 
141
+ ## Model Details
 
142
 
143
+ ### Model Description
 
144
 
145
+ - **Model Type:** Sentence Transformer
146
+ - **Base Model:** [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)
147
+ - **Maximum Sequence Length:** 512 tokens
148
+ - **Output Dimensionality:** 768 tokens
149
+ - **Similarity Function:** Cosine Similarity
150
+ - **Language:** German
151
+ - **License:** MIT
152
 
153
+ ### Full Model Architecture
154
 
155
+ ```
156
+ SentenceTransformer(
157
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
158
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
159
+ (2): Normalize()
160
+ )
161
+ ```
162
  ### Framework Versions
163
+
164
  - Python: 3.10.12
165
  - Sentence Transformers: 3.0.1
166
  - Transformers: 4.41.2
 
169
  - Datasets: 2.20.0
170
  - Tokenizers: 0.19.1
171
 
172
+ ### Acknowledgments
173
+
174
+ Special thanks to the contributors from the **Institut für Interaktive Systeme**, **Kursportal Schleswig-Holstein**, **Weiterbildung Hessen eV**, **MyEduLife**, and **Trainspot** for their invaluable support and contributions to the dataset and finetuning process.
175
+
176
+ **Funding:**
177
+ This project was funded by the **Federal Ministry of Education and Research**.
178
 
179
+ <!-- ## Citation -->
180
+
181
+ <!-- ### BibTeX -->
182
 
183
  <!--
184
  ## Glossary
 
186
  *Clearly define terms in order to be accessible across audiences.*
187
  -->
188
 
 
189
  ## Model Card Authors
190
 
191
+ Pascal Hürten, [email protected]
 
192
 
193
  <!--
194
  ## Model Card Contact