datasets: - Hailay/TigQA language: - ti
Geez Word2Vec Model
This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy.
Model Description
The Word2Vec model in this repository has been trained to generate word embeddings for Geez script Tigrinya text. The model captures semantic relationships between words in the Geez language based on their context in the TIGQA dataset.
Usage
To use the trained Word2Vec model, follow these steps:
- Clone this repository to your local machine.
- Install the required dependencies (
spacy
,gensim
). - Load the model using the provided Python code.
- Use the model to generate Geez script Tigrinya text word embeddings.
Example usage:
from gensim.models import Word2Vec
# Load the trained Word2Vec model
model = Word2Vec.load("Geez_word2vec_skipgram.model")
# Get a vector for a word
word_vector = model.wv['ሰብ']
print(f"Vector for 'ሰብ': {word_vector}")
# Find the most similar words
similar_words = model.wv.most_similar('ሰብ')
print(f"Words similar to 'ሰብ': {similar_words}")
Dataset Source
The TIGQA dataset for training this model contains text data in the Geez script of the Tigrinya language.
It is a publicly available dataset as part of an NLP resource for low-resource languages for research and development.
For more information about the TIGQA dataset, visit this link. https://zenodo.org/records/11423987
License
This Word2Vec model and its associated files are released under the MIT License.