File size: 2,452 Bytes
26e3b4b
 
 
 
 
 
014222a
 
cb2baa9
26e3b4b
cb2baa9
26e3b4b
cb2baa9
 
 
26e3b4b
cb2baa9
 
 
 
26e3b4b
 
 
 
 
cb2baa9
014222a
26e3b4b
cb2baa9
 
014222a
cb2baa9
 
 
26e3b4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
014222a
440a706
b8b687d
cb2baa9
440a706
cb2baa9
 
014222a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
license: apache-2.0
datasets:
- Hailay/TigQA
pipeline_tag: sentence-similarity
---
datasets:
- Hailay/TigQA

# Geez Word2Vec Skipgram Model

This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy.

## Usage

You can download and use the model in your Python code as follows:

```python
from gensim.models import Word2Vec

# URL of the model file on Hugging Face
model_url = "https://huggingface.co/Hailay/Geez_word2vec_skipgram.model/resolve/main/Geez_word2vec_skipgram.model"

# Load the trained Word2Vec model directly from the URL
model = Word2Vec.load(model_url)

# Get a vector for a word
word_vector = model.wv['ሰα‰₯']
print(f"Vector for 'ሰα‰₯': {word_vector}")

# Find the most similar words
similar_words = model.wv.most_similar('ሰα‰₯')
print(f"Words similar to 'ሰα‰₯': {similar_words}")

#Visualizing Word Vectors
You can visualize the word vectors using t-SNE:
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import numpy as np

# Words to visualize but you can change the words from the trained vocublary 
words = ['ሰα‰₯', 'α‹“αˆˆαˆ', 'αˆ°αˆ‹αˆ', 'αˆ“α‹­αˆŠ','αŒŠα‹œ', 'α‰£αˆ…αˆͺ']

# Get the vectors for the words
word_vectors = np.array([model.wv[word] for word in words])

# Reduce dimensionality using t-SNE with a lower perplexity value
perplexity_value = min(5, len(words) - 1)
tsne = TSNE(n_components=2, perplexity=perplexity_value, random_state=0)
word_vectors_2d = tsne.fit_transform(word_vectors)

# Create a scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1], edgecolors='k', c='r')

# Add annotations to the points
for i, word in enumerate(words):
    plt.annotate(word, xy=(word_vectors_2d[i, 0], word_vectors_2d[i, 1]), xytext=(5, 2),
                 textcoords='offset points', ha='right', va='bottom')

plt.title('2D Visualization of Word2Vec Embeddings')
plt.xlabel('TSNE Component 1')
plt.ylabel('TSNE Component 2')
plt.grid(True)
plt.show()


##Dataset Source
 
The dataset for training this model contains text data in the Geez script of the Tigrinya language.
It is a publicly available dataset as part of an NLP resource for low-resource languages for research and development.

For more information about the TIGQA dataset, visit this link. https://zenodo.org/records/11423987 and from  HornMT

License
This Word2Vec model and its associated files are released under the MIT License.