Hailay
/

Geez_word2vec_skipgram.model

Sentence Similarity

Transformers

Inference Endpoints

Model card Files Files and versions Community

Hailay commited on Jun 25, 2024

Commit

26e3b4b

verified ·

1 Parent(s): e67ae0a

Update README.md

Browse files

Files changed (1) hide show

README.md +49 -17

README.md CHANGED Viewed

@@ -1,39 +1,71 @@
 datasets:
 - Hailay/TigQA
-# Geez Word2Vec Model
-This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy.
-## Model Description
-The Word2Vec model in this repository has been trained to generate word embeddings for Geez script Tigrinya text. The model captures semantic relationships between words in the Geez language based on their context in the TIGQA dataset.
 ## Usage
-To use the trained Word2Vec model, follow these steps:
-1. Clone this repository to your local machine.
-2. Install the required dependencies (`spacy`, `gensim`).
-3. Load the model using the provided Python code.
-4. Use the model to generate Geez script Tigrinya text word embeddings.
-Example usage:
 ```python
 from gensim.models import Word2Vec
-# Load the trained Word2Vec model
-model = Word2Vec.load("Geez_word2vec_skipgram.model")
 # Get a vector for a word
-word_vector = model.wv['ሰብ']
 print(f"Vector for 'ሰብ': {word_vector}")
 # Find the most similar words
 similar_words = model.wv.most_similar('ሰብ')
 print(f"Words similar to 'ሰብ': {similar_words}")
-Dataset Source
 The dataset for training this model contains text data in the Geez script of the Tigrinya language.
 It is a publicly available dataset as part of an NLP resource for low-resource languages for research and development.

+---
+license: apache-2.0
+datasets:
+- Hailay/TigQA
+pipeline_tag: sentence-similarity
+---
 datasets:
 - Hailay/TigQA
+# Geez Word2Vec Skipgram Model
+This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy.
 ## Usage
+You can download and use the model in your Python code as follows:
 ```python
 from gensim.models import Word2Vec
+# URL of the model file on Hugging Face
+model_url = "https://huggingface.co/Hailay/Geez_word2vec_skipgram.model/resolve/main/Geez_word2vec_skipgram.model"
+# Load the trained Word2Vec model directly from the URL
+model = Word2Vec.load(model_url)
 # Get a vector for a word
+word_vector = model.wv['ሰብ']
 print(f"Vector for 'ሰብ': {word_vector}")
 # Find the most similar words
 similar_words = model.wv.most_similar('ሰብ')
 print(f"Words similar to 'ሰብ': {similar_words}")
+#Visualizing Word Vectors
+You can visualize the word vectors using t-SNE:
+import matplotlib.pyplot as plt
+from sklearn.manifold import TSNE
+import numpy as np
+# Words to visualize but you can change the words from the trained vocublary
+words = ['ሰብ', 'ዓለም', 'ሰላም', 'ሓይሊ','ጊዜ', 'ባህሪ']
+# Get the vectors for the words
+word_vectors = np.array([model.wv[word] for word in words])
+# Reduce dimensionality using t-SNE with a lower perplexity value
+perplexity_value = min(5, len(words) - 1)
+tsne = TSNE(n_components=2, perplexity=perplexity_value, random_state=0)
+word_vectors_2d = tsne.fit_transform(word_vectors)
+# Create a scatter plot
+plt.figure(figsize=(10, 6))
+plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1], edgecolors='k', c='r')
+# Add annotations to the points
+for i, word in enumerate(words):
+    plt.annotate(word, xy=(word_vectors_2d[i, 0], word_vectors_2d[i, 1]), xytext=(5, 2),
+                 textcoords='offset points', ha='right', va='bottom')
+plt.title('2D Visualization of Word2Vec Embeddings')
+plt.xlabel('TSNE Component 1')
+plt.ylabel('TSNE Component 2')
+plt.grid(True)
+plt.show()
+##Dataset Source
 The dataset for training this model contains text data in the Geez script of the Tigrinya language.
 It is a publicly available dataset as part of an NLP resource for low-resource languages for research and development.