isy-thl
/

multilingual-e5-base-course-skill-tuned

@@ -1,119 +1,166 @@
 ---
 license: mit
-datasets: []
-language: []
 library_name: sentence-transformers
 pipeline_tag: sentence-similarity
 tags:
 - sentence-transformers
 - sentence-similarity
-- feature-extraction
 widget: []
 ---
-# SentenceTransformer
-This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
-## Model Details
-### Model Description
-- **Model Type:** Sentence Transformer
-<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
-- **Maximum Sequence Length:** 512 tokens
-- **Output Dimensionality:** 768 tokens
-- **Similarity Function:** Cosine Similarity
-<!-- - **Training Dataset:** Unknown -->
-<!-- - **Language:** Unknown -->
-<!-- - **License:** Unknown -->
-### Model Sources
-- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
-- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
-- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
-### Full Model Architecture
-```
-SentenceTransformer(
-  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
-  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
-  (2): Normalize()
-)
-```
 ## Usage
-### Direct Usage (Sentence Transformers)
-First install the Sentence Transformers library:
-```bash
-pip install -U sentence-transformers
-```
-Then you can load this model and run inference.
 ```python
 from sentence_transformers import SentenceTransformer
 # Download from the 🤗 Hub
-model = SentenceTransformer("sentence_transformers_model_id")
 # Run inference
-sentences = [
-    'The weather is lovely today.',
-    "It's so sunny outside!",
-    'He drove to the stadium.',
-]
-embeddings = model.encode(sentences)
-print(embeddings.shape)
-# [3, 768]
-# Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
-print(similarities.shape)
-# [3, 3]
 ```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
-## Training Details
 ### Framework Versions
 - Python: 3.10.12
 - Sentence Transformers: 3.0.1
 - Transformers: 4.41.2
@@ -122,9 +169,16 @@ You can finetune this model on your own dataset.
 - Datasets: 2.20.0
 - Tokenizers: 0.19.1
-## Citation
-### BibTeX
 <!--
 ## Glossary
@@ -132,11 +186,9 @@ You can finetune this model on your own dataset.
 *Clearly define terms in order to be accessible across audiences.*
 -->
-<!--
 ## Model Card Authors
-*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
--->
 <!--
 ## Model Card Contact

 ---
 license: mit
+language:
+- de
 library_name: sentence-transformers
 pipeline_tag: sentence-similarity
 tags:
 - sentence-transformers
 - sentence-similarity
+- information retrieval
+- education
+- competency
+- course
 widget: []
 ---
+# isy-thl/multilingual-e5-base-course-skill-tuned
+## Overview
+The **isy-thl/multilingual-e5-base-course-skill-tuned** is a finetuned version of the [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) model. The primary goal of this finetuning process was to enhance the model's capabilities in information retrieval, specifically for identifying the most relevant skills associated with a given course description in the German language.
+## Capabilities
+- **Enhanced Skill Retrieval:**
+  - The model excels at identifying and retrieving the most relevant skills for a given course description in German, which can be leveraged for various applications in educational technology.
+- **Multilingual Capability:**
+  - While optimized for German, the underlying base model [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) supports multiple languages, making it adaptable for future multilingual finetuning endeavors.
+- **Scalability:**
+  - The model can handle input sequences up to 512 tokens in length, making it suitable for processing comprehensive course descriptions.
+## Limitations and Considerations
+- **Language Limitation:**
+  - The finetuning was specifically targeted at German language content. While the base model supports multiple languages, this particular finetuned version may not perform as well on non-German texts without additional training.
+- **Data Bias:**
+  - The performance and reliability of the model are dependent on the quality of the annotated data in the training dataset. Any biases present in the training data may affect the model's output.
+- **Retrieval Scope:**
+  - The model is optimized for educational contexts and may not generalize as effectively to other domains without further finetuning.
+## Performance
+- Coming soon
 ## Usage
+### Sentence Similarity
 ```python
 from sentence_transformers import SentenceTransformer
+import numpy as np
+from sklearn.metrics.pairwise import cosine_similarity
 # Download from the 🤗 Hub
+model = SentenceTransformer("isy-thl/multilingual-e5-base-course-skill-tuned")
 # Run inference
+query  = [['query: ','WordPress Grundlagen\n Dieser Kurs vermittelt grundlegende Fähigkeiten zur Erstellung eines Web-Blogs in Wordpress. Sie lernen WordPress zu installieren...']]
+corpus = [['passage: ','WordPress'],
+          ['passage: ','Website-Wireframe erstellen'],
+          ['passage: ','Software für Content-Management-Systeme nutzen']]
+query_embeddings = model.encode(query)
+corpus_embeddings = model.encode(corpus)
+similarities = cosine_similarity(query_embeddings,corpus_embeddings)
+retrieved_doc_id = np.argmax(similarities)
+print(retrieved_doc_id)
 ```
+### Information Retrieval
+First install the langchain and chromadb library:
+```bash
+pip install -U langchain
+pip install -U langchain-community
+pip install -U chromadb
+```
+Then you can load this model, create a vectordatabase and run semantic searches.
+```python
+from langchain_community.embeddings import HuggingFaceBgeEmbeddings
+from langchain.vectorstores import Chroma
+# Download model and set embed instructions.
+embedding = HuggingFaceBgeEmbeddings(
+    model_name='isy-thl/multilingual-e5-base-course-skill-tuned',
+    query_instruction='query: '',
+    embed_instruction='passage: '
+)
+# Load your documents.
+documents = ...
+# Create vector database.
+db = Chroma.from_documents(
+    documents=documents,
+    embedding=embedding,
+    collection_metadata={'hnsw:space': 'cosine'},
+)
+# Search database for closest semantic matches.
+query = 'WordPress Grundlagen\n Dieser Kurs vermittelt grundlegende Fähigkeiten zur Erstellung eines Web-Blogs in Wordpress. Sie lernen WordPress zu installieren...'
+db.similarity_search_with_relevance_scores(query, 20)
+```
+## Finetuning Details
+### Finetuning Dataset
+  - The model was finetuned on the [German Course Competency Alignment Dataset](pascalhuerten/course_competency_alignment_de), which includes alignments of course descriptions to the skill taxonomies of ESCO (European Skills, Competences, Qualifications and Occupations) and GRETA (a competency model for professional teaching competencies in adult education).
+  - This dataset was compiled as part of the **WISY@KI** project, with major contributions from the **Institut für Interaktive Systeme** at the **University of Applied Sciences Lübeck**, the **Kursportal Schleswig-Holstein**, and **Weiterbildung Hessen eV**. Special thanks to colleagues from **MyEduLife** and **Trainspot**.
+### Finetuning Process
+- **Hardware Used:**
+  - Single NVIDIA T4 GPU with 15 GB VRAM.
+- **Duration:**
+  - 2000 data points: ~15 minutes.
+- **Training Parameters:**
+  ```bash
+  torchrun --nproc_per_node 1 \
+  -m FlagEmbedding.baai_general_embedding.finetune.run \
+  --output_dir multilingual_e5_base_finetuned \
+  --model_name_or_path intfloat/multilingual-e5-base \
+  --train_data ./course_competency_alignment_de.jsonl \
+  --learning_rate 1e-5 \
+  --fp16 \
+  --num_train_epochs 5 \
+  --per_device_train_batch_size 4 \
+  --dataloader_drop_last True \
+  --normlized True \
+  --temperature 0.02 \
+  --query_max_len 512 \
+  --passage_max_len 64 \
+  --train_group_size 4 \
+  --negatives_cross_device \
+  --logging_steps 10 \
+  --save_steps 1500 \
+  --query_instruction_for_retrieval ""
+  ```
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Base Model:** [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)
+- **Maximum Sequence Length:** 512 tokens
+- **Output Dimensionality:** 768 tokens
+- **Similarity Function:** Cosine Similarity
+- **Language:** German
+- **License:** MIT
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
+  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+  (2): Normalize()
+)
+```
 ### Framework Versions
 - Python: 3.10.12
 - Sentence Transformers: 3.0.1
 - Transformers: 4.41.2
 - Datasets: 2.20.0
 - Tokenizers: 0.19.1
+### Acknowledgments
+Special thanks to the contributors from the **Institut für Interaktive Systeme**, **Kursportal Schleswig-Holstein**, **Weiterbildung Hessen eV**, **MyEduLife**, and **Trainspot** for their invaluable support and contributions to the dataset and finetuning process.
+**Funding:**
+This project was funded by the **Federal Ministry of Education and Research**.
+<!-- ## Citation -->
+<!-- ### BibTeX -->
 <!--
 ## Glossary
 *Clearly define terms in order to be accessible across audiences.*
 -->
 ## Model Card Authors
+Pascal Hürten, [email protected]
 <!--
 ## Model Card Contact