Update README_English.md
Browse files- README_English.md +13 -10
README_English.md
CHANGED
@@ -11,40 +11,43 @@ metrics:
|
|
11 |
License: MIT
|
12 |
---
|
13 |
|
14 |
-
**Model
|
15 |
|
16 |
-
OpenNMT
|
17 |
|
18 |
**How to translate**
|
19 |
|
20 |
+ Open bash terminal
|
21 |
-
+ Install [Python 3.9](https://www.python.org/downloads/release/python-390/)
|
22 |
+ Install [Open NMT toolkit v.2.2](https://github.com/OpenNMT/OpenNMT-py)
|
23 |
+ Translate an input_text using the NOS-MT-en-gl model with the following command:
|
24 |
|
25 |
```bash
|
26 |
onmt_translate -src input_text -model NOS-MT-en-gl -output ./output_file.txt -replace_unk -gpu 0
|
27 |
```
|
28 |
-
+ The
|
29 |
|
30 |
**Training**
|
31 |
|
32 |
-
|
|
|
|
|
|
|
33 |
|
34 |
**Training process**
|
35 |
|
36 |
-
+
|
37 |
-
+
|
38 |
-
+ Using .yaml in this repository
|
39 |
|
40 |
```bash
|
41 |
onmt_build_vocab -config bpe-en-gl_emb.yaml -n_sample 100000
|
42 |
onmt_train -config bpe-en-gl_emb.yaml
|
43 |
```
|
44 |
|
45 |
-
**
|
46 |
|
47 |
-
|
48 |
|
49 |
**Evaluation**
|
50 |
|
|
|
11 |
License: MIT
|
12 |
---
|
13 |
|
14 |
+
**Model description**
|
15 |
|
16 |
+
Model developed with OpenNMT for the Galician-Spanish pair using the transformer architecture.
|
17 |
|
18 |
**How to translate**
|
19 |
|
20 |
+ Open bash terminal
|
21 |
+
+ Install [Python 3.9](https://www.python.org/downloads/release/python-390/)
|
22 |
+ Install [Open NMT toolkit v.2.2](https://github.com/OpenNMT/OpenNMT-py)
|
23 |
+ Translate an input_text using the NOS-MT-en-gl model with the following command:
|
24 |
|
25 |
```bash
|
26 |
onmt_translate -src input_text -model NOS-MT-en-gl -output ./output_file.txt -replace_unk -gpu 0
|
27 |
```
|
28 |
+
+ The resulting translation will be in the PATH indicated by the -output flag.
|
29 |
|
30 |
**Training**
|
31 |
|
32 |
+
To train this model, we have used authentic and synthetic corpora from [ProxectoNós](https://github.com/proxectonos/corpora).
|
33 |
+
|
34 |
+
Authentic corpora are corpora produced by human translators. Synthetic corpora are Spanish-Portuguese translations, which have been converted to Spanish-Galician by means of Portuguese-Galician translation with Opentrad/Apertium and transliteration for out-of-vocabulary words.
|
35 |
+
|
36 |
|
37 |
**Training process**
|
38 |
|
39 |
+
+ Tokenisation was performed with a modified version of the [linguakit](https://github.com/citiususc/Linguakit) tokeniser (tokenizer.pl) that does not append a new line after each token.
|
40 |
+
+ All BPE models were generated with the script [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py)
|
41 |
+
+ Using the .yaml in this repository, it is possible to replicate the original training process. Before training the model, please verify that the path to each target (tgt) and (src) file is correct. Once this is done, proceed as follows:
|
42 |
|
43 |
```bash
|
44 |
onmt_build_vocab -config bpe-en-gl_emb.yaml -n_sample 100000
|
45 |
onmt_train -config bpe-en-gl_emb.yaml
|
46 |
```
|
47 |
|
48 |
+
**Hyperparameters**
|
49 |
|
50 |
+
You may find the parameters used for this model inside the file bpe-en-gl_emb.yaml
|
51 |
|
52 |
**Evaluation**
|
53 |
|