Update README.md
Browse files
README.md
CHANGED
@@ -1,39 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
datasets:
|
2 |
- Hailay/TigQA
|
3 |
-
# Geez Word2Vec Model
|
4 |
-
|
5 |
-
This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy.
|
6 |
|
7 |
-
|
8 |
|
9 |
-
|
10 |
|
11 |
## Usage
|
12 |
|
13 |
-
|
14 |
-
|
15 |
-
1. Clone this repository to your local machine.
|
16 |
-
2. Install the required dependencies (`spacy`, `gensim`).
|
17 |
-
3. Load the model using the provided Python code.
|
18 |
-
4. Use the model to generate Geez script Tigrinya text word embeddings.
|
19 |
-
|
20 |
-
Example usage:
|
21 |
|
22 |
```python
|
23 |
from gensim.models import Word2Vec
|
24 |
|
25 |
-
#
|
26 |
-
|
|
|
|
|
|
|
27 |
|
28 |
# Get a vector for a word
|
29 |
-
word_vector = model.wv['α°α₯']
|
30 |
print(f"Vector for 'α°α₯': {word_vector}")
|
31 |
|
32 |
# Find the most similar words
|
33 |
similar_words = model.wv.most_similar('α°α₯')
|
34 |
print(f"Words similar to 'α°α₯': {similar_words}")
|
35 |
|
36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
The dataset for training this model contains text data in the Geez script of the Tigrinya language.
|
39 |
It is a publicly available dataset as part of an NLP resource for low-resource languages for research and development.
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- Hailay/TigQA
|
5 |
+
pipeline_tag: sentence-similarity
|
6 |
+
---
|
7 |
datasets:
|
8 |
- Hailay/TigQA
|
|
|
|
|
|
|
9 |
|
10 |
+
# Geez Word2Vec Skipgram Model
|
11 |
|
12 |
+
This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy.
|
13 |
|
14 |
## Usage
|
15 |
|
16 |
+
You can download and use the model in your Python code as follows:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
```python
|
19 |
from gensim.models import Word2Vec
|
20 |
|
21 |
+
# URL of the model file on Hugging Face
|
22 |
+
model_url = "https://huggingface.co/Hailay/Geez_word2vec_skipgram.model/resolve/main/Geez_word2vec_skipgram.model"
|
23 |
+
|
24 |
+
# Load the trained Word2Vec model directly from the URL
|
25 |
+
model = Word2Vec.load(model_url)
|
26 |
|
27 |
# Get a vector for a word
|
28 |
+
word_vector = model.wv['α°α₯']
|
29 |
print(f"Vector for 'α°α₯': {word_vector}")
|
30 |
|
31 |
# Find the most similar words
|
32 |
similar_words = model.wv.most_similar('α°α₯')
|
33 |
print(f"Words similar to 'α°α₯': {similar_words}")
|
34 |
|
35 |
+
#Visualizing Word Vectors
|
36 |
+
You can visualize the word vectors using t-SNE:
|
37 |
+
import matplotlib.pyplot as plt
|
38 |
+
from sklearn.manifold import TSNE
|
39 |
+
import numpy as np
|
40 |
+
|
41 |
+
# Words to visualize but you can change the words from the trained vocublary
|
42 |
+
words = ['α°α₯', 'ααα', 'α°αα', 'ααα','αα', 'α£α
αͺ']
|
43 |
+
|
44 |
+
# Get the vectors for the words
|
45 |
+
word_vectors = np.array([model.wv[word] for word in words])
|
46 |
+
|
47 |
+
# Reduce dimensionality using t-SNE with a lower perplexity value
|
48 |
+
perplexity_value = min(5, len(words) - 1)
|
49 |
+
tsne = TSNE(n_components=2, perplexity=perplexity_value, random_state=0)
|
50 |
+
word_vectors_2d = tsne.fit_transform(word_vectors)
|
51 |
+
|
52 |
+
# Create a scatter plot
|
53 |
+
plt.figure(figsize=(10, 6))
|
54 |
+
plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1], edgecolors='k', c='r')
|
55 |
+
|
56 |
+
# Add annotations to the points
|
57 |
+
for i, word in enumerate(words):
|
58 |
+
plt.annotate(word, xy=(word_vectors_2d[i, 0], word_vectors_2d[i, 1]), xytext=(5, 2),
|
59 |
+
textcoords='offset points', ha='right', va='bottom')
|
60 |
+
|
61 |
+
plt.title('2D Visualization of Word2Vec Embeddings')
|
62 |
+
plt.xlabel('TSNE Component 1')
|
63 |
+
plt.ylabel('TSNE Component 2')
|
64 |
+
plt.grid(True)
|
65 |
+
plt.show()
|
66 |
+
|
67 |
+
|
68 |
+
##Dataset Source
|
69 |
|
70 |
The dataset for training this model contains text data in the Geez script of the Tigrinya language.
|
71 |
It is a publicly available dataset as part of an NLP resource for low-resource languages for research and development.
|