HelpMum-Personal
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,35 +1,90 @@
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
license: mit
|
4 |
-
base_model:
|
5 |
tags:
|
6 |
- translation
|
7 |
- generated_from_trainer
|
8 |
model-index:
|
9 |
-
- name:
|
10 |
results: []
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
---
|
12 |
|
13 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
14 |
should probably proofread and complete it, then remove this comment. -->
|
15 |
|
16 |
-
# eng-to-
|
17 |
|
18 |
-
This model is a
|
|
|
19 |
|
20 |
-
## Model
|
21 |
|
22 |
-
|
|
|
|
|
23 |
|
24 |
-
## Intended uses & limitations
|
25 |
|
26 |
-
More information needed
|
27 |
|
28 |
-
|
29 |
|
30 |
-
|
31 |
|
32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
### Training hyperparameters
|
35 |
|
@@ -40,16 +95,11 @@ The following hyperparameters were used during training:
|
|
40 |
- seed: 42
|
41 |
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
42 |
- lr_scheduler_type: linear
|
43 |
-
- num_epochs:
|
44 |
-
- mixed_precision_training: Native AMP
|
45 |
-
|
46 |
-
### Training results
|
47 |
-
|
48 |
-
|
49 |
|
50 |
### Framework versions
|
51 |
|
52 |
- Transformers 4.44.2
|
53 |
-
- Pytorch 2.4.
|
54 |
-
- Datasets
|
55 |
-
- Tokenizers 0.19.1
|
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
license: mit
|
4 |
+
base_model: facebook/m2m100_418M
|
5 |
tags:
|
6 |
- translation
|
7 |
- generated_from_trainer
|
8 |
model-index:
|
9 |
+
- name: m2m100_418M-nig-en
|
10 |
results: []
|
11 |
+
language:
|
12 |
+
- yo
|
13 |
+
- ig
|
14 |
+
- ha
|
15 |
+
- en
|
16 |
+
pipeline_tag: translation
|
17 |
---
|
18 |
|
19 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
20 |
should probably proofread and complete it, then remove this comment. -->
|
21 |
|
22 |
+
# ai-translator-eng-to-9ja
|
23 |
|
24 |
+
This model is a 418 Million parameter translation model, built for translating from English into Yoruba, Igbo, and Hausa. It was trained on a dataset consisting of 1,500,000 sentences (500,000 for each language), providing high-quality translations for these languages.
|
25 |
+
It was built with the intention of building a system that makes it easier to communicate with LLMs using Igbo, Hausa and Yoruba languages.
|
26 |
|
27 |
+
## Model Details
|
28 |
|
29 |
+
- **Languages Supported**:
|
30 |
+
- Source Language: English
|
31 |
+
- Target Languages: Yoruba, Igbo, Hausa
|
32 |
|
|
|
33 |
|
|
|
34 |
|
35 |
+
### Model Usage
|
36 |
|
37 |
+
To use this model for translation tasks, you can load it from Hugging Face’s `transformers` library:
|
38 |
|
39 |
+
```python
|
40 |
+
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
|
41 |
+
|
42 |
+
# Load the fine-tuned model
|
43 |
+
model = M2M100ForConditionalGeneration.from_pretrained("HelpMum-Personal/ai-translator-eng-to-9ja")
|
44 |
+
tokenizer = M2M100Tokenizer.from_pretrained("HelpMum-Personal/ai-translator-eng-to-9ja")
|
45 |
+
|
46 |
+
# translate English to Igbo
|
47 |
+
eng_text="Healthcare is an important field in virtually every society because it directly affects the well-being and quality of life of individuals. It encompasses a wide range of services and professions, including preventive care, diagnosis, treatment, and management of diseases and conditions."
|
48 |
+
tokenizer.src_lang = "en"
|
49 |
+
tokenizer.tgt_lang = "ig"
|
50 |
+
encoded_eng = tokenizer(eng_text, return_tensors="pt")
|
51 |
+
generated_tokens = model.generate(**encoded_eng, forced_bos_token_id=tokenizer.get_lang_id("ig"))
|
52 |
+
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
53 |
+
|
54 |
+
|
55 |
+
|
56 |
+
# translate English to yoruba
|
57 |
+
eng_text="Healthcare is an important field in virtually every society because it directly affects the well-being and quality of life of individuals. It encompasses a wide range of services and professions, including preventive care, diagnosis, treatment, and management of diseases and conditions. Effective healthcare systems aim to improve health outcomes, reduce the incidence of illness, and ensure that individuals have access to necessary medical services."
|
58 |
+
tokenizer.src_lang = "en"
|
59 |
+
tokenizer.tgt_lang = "yo"
|
60 |
+
encoded_eng = tokenizer(eng_text, return_tensors="pt")
|
61 |
+
generated_tokens = model.generate(**encoded_eng, forced_bos_token_id=tokenizer.get_lang_id("yo"))
|
62 |
+
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
63 |
+
|
64 |
+
# translate English to Hausa
|
65 |
+
eng_text="Healthcare is an important field in virtually every society because it directly affects the well-being and quality of life of individuals. It encompasses a wide range of services and professions, including preventive care, diagnosis, treatment, and management of diseases and conditions. Effective healthcare systems aim to improve health outcomes, reduce the incidence of illness, and ensure that individuals have access to necessary medical services."
|
66 |
+
tokenizer.src_lang = "en"
|
67 |
+
tokenizer.tgt_lang = "ha"
|
68 |
+
encoded_eng = tokenizer(eng_text, return_tensors="pt")
|
69 |
+
generated_tokens = model.generate(**encoded_eng, forced_bos_token_id=tokenizer.get_lang_id("ha"))
|
70 |
+
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
71 |
+
```
|
72 |
+
|
73 |
+
### Supported Language Codes
|
74 |
+
- **English**: `en`
|
75 |
+
- **Yoruba**: `yo`
|
76 |
+
- **Igbo**: `ig`
|
77 |
+
- **Hausa**: `ha`
|
78 |
+
|
79 |
+
|
80 |
+
### Training Dataset
|
81 |
+
|
82 |
+
The training dataset consists of 1,500,000 translation pairs, sourced from a combination of open-source parallel corpora and curated datasets specific to Yoruba, Igbo, and Hausa
|
83 |
+
|
84 |
+
## Limitations
|
85 |
+
|
86 |
+
- While the model performs well across English-to-Yoruba, Igbo, and Hausa translations, performance may vary depending on the complexity and domain of the text.
|
87 |
+
- Translation quality may decrease for extremely long sentences or ambiguous contexts.
|
88 |
|
89 |
### Training hyperparameters
|
90 |
|
|
|
95 |
- seed: 42
|
96 |
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
97 |
- lr_scheduler_type: linear
|
98 |
+
- num_epochs: 3
|
|
|
|
|
|
|
|
|
|
|
99 |
|
100 |
### Framework versions
|
101 |
|
102 |
- Transformers 4.44.2
|
103 |
+
- Pytorch 2.4.0+cu121
|
104 |
+
- Datasets 2.21.0
|
105 |
+
- Tokenizers 0.19.1
|