File size: 9,430 Bytes
07eba36
 
3d8304a
 
07eba36
 
 
 
 
3d8304a
 
 
 
07eba36
 
 
3d8304a
07eba36
3d8304a
07eba36
3d8304a
07eba36
3d8304a
07eba36
3d8304a
 
 
 
 
 
07eba36
3d8304a
fae7e06
c2d27e4
fae7e06
c2d27e4
 
 
 
 
 
 
 
 
 
 
07eba36
 
 
3d8304a
07eba36
 
 
3d8304a
 
07eba36
 
3d8304a
07eba36
3d8304a
 
 
 
 
 
 
 
 
07eba36
 
3d8304a
07eba36
3d8304a
07eba36
3d8304a
 
 
 
 
07eba36
3d8304a
07eba36
3d8304a
 
 
 
 
 
 
 
 
 
07eba36
3d8304a
 
07eba36
3d8304a
 
 
 
 
 
07eba36
3d8304a
 
 
 
07eba36
3d8304a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
07eba36
3d8304a
07eba36
3d8304a
07eba36
3d8304a
 
 
 
 
 
 
07eba36
3d8304a
07eba36
3d8304a
 
 
 
 
 
 
07eba36
3d8304a
07eba36
 
 
 
 
 
 
 
3d8304a
 
 
 
 
 
07eba36
da6a129
 
 
 
 
 
3d8304a
 
 
07eba36
 
 
 
 
 
 
 
 
3d8304a
07eba36
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
license: mit
language:
- de
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- information retrieval
- education
- competency
- course
widget: []
---

# isy-thl/multilingual-e5-base-course-skill-tuned

## Overview

The **isy-thl/multilingual-e5-base-course-skill-tuned** is a finetuned version of the [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) model. The primary goal of this finetuning process was to enhance the model's capabilities in information retrieval, specifically for identifying the most relevant skills associated with a given course description in the German language.

## Capabilities

- **Enhanced Skill Retrieval:** 
  - The model excels at identifying and retrieving the most relevant skills for a given course description in German, which can be leveraged for various applications in educational technology.
- **Multilingual Capability:**
  - While optimized for German, the underlying base model [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) supports multiple languages, making it adaptable for future multilingual finetuning endeavors.
- **Scalability:**
  - The model can handle input sequences up to 512 tokens in length, making it suitable for processing comprehensive course descriptions.

## Performance

To evaluate the model, all ESCO (x=13895) and GRETA (x=23) skills were embedded using the model under assessment and stored in a vector database. For each query in the evaluation dataset, the top 30 most relevant candidates were retrieved based on cosine similarity. Metrics such as accuracy, precision, recall, NDCG, MRR, and MAP were then calculated. For reranker evaluation, the reranker was used to re-rank the top 30 candidates chosen by the fine-tuned bi-encoder model. The evaluation results were split for the ESCO and GRETA use cases:

**ESCO Use Case**
![Evaluation results for ESCO use-case comparing intfloat/multilingual-e5-base, isy-thl/multilingual-e5-base-course-skill-tuned and also a version reranked with isy-thl/bge-reranker-base-course-skill-tuned](https://cdn-uploads.huggingface.co/production/uploads/64481ef1e6161a1f32e60d96/x5xqyU-_raRyVOGqGVpq-.png)


**GRETA Use Case**
![Evaluation results for GRETA use-case comparing intfloat/multilingual-e5-base, isy-thl/multilingual-e5-base-course-skill-tuned and also a version reranked with isy-thl/bge-reranker-base-course-skill-tuned](https://cdn-uploads.huggingface.co/production/uploads/64481ef1e6161a1f32e60d96/DU2d1WSThMLuyvb3tNNpz.png)


The results demonstrate that fine-tuning significantly enhanced the performance of the model, often more than doubling the performance of the non-fine-tuned base model. Notably, fine-tuning with training data from both use cases outperformed fine-tuning with training data from only the target skill taxonomy. This suggests that the models learn more than just specific skills from the training data and are capable of generalizing. Further research could evaluate the model's performance on an unknown skill taxonomy, where we expect it to perform better as well.

The fine-tuned BI-Encoder model (isy-thl/multilingual-e5-base-course-skill-tuned) shows exceptional performance for the target task, providing significant improvements over the base model. To maximize retrieval success, it is recommended to complement the BI-Encoder model with the reranker (isy-thl/bge-reranker-base-course-skill-tuned), especially in scenarios where the computational cost is justified by the need for higher accuracy and precision.

## Usage

### Sentence Similarity

```python
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Download from the 🤗 Hub
model = SentenceTransformer("isy-thl/multilingual-e5-base-course-skill-tuned")
# Run inference
query  = [['query: ','WordPress Grundlagen\n Dieser Kurs vermittelt grundlegende Fähigkeiten zur Erstellung eines Web-Blogs in Wordpress. Sie lernen WordPress zu installieren...']]
corpus = [['passage: ','WordPress'],
          ['passage: ','Website-Wireframe erstellen'],
          ['passage: ','Software für Content-Management-Systeme nutzen']]
query_embeddings = model.encode(query)
corpus_embeddings = model.encode(corpus)
similarities = cosine_similarity(query_embeddings,corpus_embeddings)
retrieved_doc_id = np.argmax(similarities)
print(retrieved_doc_id)
```

### Information Retrieval

First install the langchain and chromadb library:

```bash
pip install -U langchain
pip install -U langchain-community
pip install -U chromadb
```

Then you can load this model, create a vectordatabase and run semantic searches.

```python
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import Chroma

# Download model and set embed instructions.
embedding = HuggingFaceBgeEmbeddings(
    model_name='isy-thl/multilingual-e5-base-course-skill-tuned',
    query_instruction='query: '',
    embed_instruction='passage: '
)

# Load your documents.
documents = ...

# Create vector database.
db = Chroma.from_documents(
    documents=documents,
    embedding=embedding,
    collection_metadata={'hnsw:space': 'cosine'},
)

# Search database for closest semantic matches.
query = 'WordPress Grundlagen\n Dieser Kurs vermittelt grundlegende Fähigkeiten zur Erstellung eines Web-Blogs in Wordpress. Sie lernen WordPress zu installieren...'
db.similarity_search_with_relevance_scores(query, 20)
```

## Finetuning Details

### Finetuning Dataset

  - The model was finetuned on the [German Course Competency Alignment Dataset](pascalhuerten/course_competency_alignment_de), which includes alignments of course descriptions to the skill taxonomies of ESCO (European Skills, Competences, Qualifications and Occupations) and GRETA (a competency model for professional teaching competencies in adult education).
  - This dataset was compiled as part of the **WISY@KI** project, with major contributions from the **Institut für Interaktive Systeme** at the **University of Applied Sciences Lübeck**, the **Kursportal Schleswig-Holstein**, and **Weiterbildung Hessen eV**. Special thanks to colleagues from **MyEduLife** and **Trainspot**.

### Finetuning Process

- **Hardware Used:**
  - Single NVIDIA T4 GPU with 15 GB VRAM.
- **Duration:**
  - 2000 data points: ~15 minutes.
- **Training Parameters:**
  ```bash
  torchrun --nproc_per_node 1 \
  -m FlagEmbedding.baai_general_embedding.finetune.run \
  --output_dir multilingual_e5_base_finetuned \
  --model_name_or_path intfloat/multilingual-e5-base \
  --train_data ./course_competency_alignment_de.jsonl \
  --learning_rate 1e-5 \
  --fp16 \
  --num_train_epochs 5 \
  --per_device_train_batch_size 4 \
  --dataloader_drop_last True \
  --normlized True \
  --temperature 0.02 \
  --query_max_len 512 \
  --passage_max_len 64 \
  --train_group_size 4 \
  --negatives_cross_device \
  --logging_steps 10 \
  --save_steps 1500 \
  --query_instruction_for_retrieval ""
  ```

## Model Details

### Model Description

- **Model Type:** Sentence Transformer
- **Base Model:** [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768 tokens
- **Similarity Function:** Cosine Similarity
- **Language:** German
- **License:** MIT

### Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
```
### Framework Versions

- Python: 3.10.12
- Sentence Transformers: 3.0.1
- Transformers: 4.41.2
- PyTorch: 2.3.0+cu121
- Accelerate: 0.32.1
- Datasets: 2.20.0
- Tokenizers: 0.19.1

### Acknowledgments

Special thanks to the contributors from the **Institut für Interaktive Systeme**, **Kursportal Schleswig-Holstein**, **Weiterbildung Hessen eV**, **MyEduLife**, and **Trainspot** for their invaluable support and contributions to the dataset and finetuning process.

**Funding:**
This project was funded by the **Federal Ministry of Education and Research**.

<div style="display: flex; align-items: center; gap: 35px;">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/64481ef1e6161a1f32e60d96/Yy8QPa3w_lBp9XbGYEDs0.jpeg" alt="BMBF Logo" style="width: 150px; height: auto;">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/64481ef1e6161a1f32e60d96/qznw4LESrVmdm-iZdwmNq.jpeg" alt="THL Logo" style="width: 150px; height: auto;">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/64481ef1e6161a1f32e60d96/Sybom1Tr45NkHbHq75EED.png" alt="WISY@KI Logo" style="width: 150px; height: auto;">
</div>

<!-- ## Citation -->

<!-- ### BibTeX -->

<!--
## Glossary

*Clearly define terms in order to be accessible across audiences.*
-->

## Model Card Authors

Pascal Hürten, [email protected]

<!--
## Model Card Contact

*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->