radinplaid commited on
Commit
e693b59
·
verified ·
1 Parent(s): de500bf

Upload folder using huggingface_hub

Browse files
Files changed (8) hide show
  1. README.md +100 -3
  2. config.json +10 -0
  3. eole-config.yaml +100 -0
  4. model.bin +3 -0
  5. source_vocabulary.json +0 -0
  6. src.spm.model +3 -0
  7. target_vocabulary.json +0 -0
  8. tgt.spm.model +3 -0
README.md CHANGED
@@ -1,3 +1,100 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ - en
5
+ tags:
6
+ - translation
7
+ license: cc-by-4.0
8
+ datasets:
9
+ - quickmt/quickmt-train.zh-en
10
+ model-index:
11
+ - name: quickmt-en-zh
12
+ results:
13
+ - task:
14
+ name: Translation eng-zho
15
+ type: translation
16
+ args: eng-zho
17
+ dataset:
18
+ name: flores101-devtest
19
+ type: flores_101
20
+ args: eng_Latn zho_Hans devtest
21
+ metrics:
22
+ - name: CHRF
23
+ type: chrf
24
+ value: 34.53
25
+ ---
26
+
27
+
28
+ # `quickmt-en-zh` Neural Machine Translation Model
29
+
30
+ # Usage
31
+
32
+ ## Install `quickmt`
33
+
34
+ ```bash
35
+ git clone https://github.com/quickmt/quickmt.git
36
+ pip install ./quickmt/
37
+ ```
38
+
39
+ ## Download model
40
+
41
+ ```bash
42
+ quickmt-model-download quickmt/quickmt-en-zh ./quickmt-en-zh
43
+ ```
44
+
45
+ ## Use model
46
+
47
+ Inference with `quickmt`:
48
+
49
+ ```python
50
+ from quickmt import Translator
51
+
52
+ # Auto-detects GPU, set to "cpu" to force CPU inference
53
+ t = Translator("./quickmt-en-zh/", device="auto")
54
+
55
+ # Translate - set beam size to 5 for higher quality (but slower speed)
56
+ t(["The roe deer (Capreolus capreolus), also known as the roe, western roe deer,[3][4] or European roe,[3] is a species of deer."], beam_size=1)
57
+
58
+ # Get alternative translations by sampling
59
+ # You can pass any cTranslate2 `translate_batch` arguments
60
+ t(["The roe deer (Capreolus capreolus), also known as the roe, western roe deer,[3][4] or European roe,[3] is a species of deer."], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
61
+ ```
62
+
63
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use the model files directly if you want. It would be fairly easy to get them to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
64
+
65
+ # Model Information
66
+
67
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
68
+ - Trained for 82k steps with an effective batch size of 49152, which took less than 1 day on a single RTX 4090 on [vast.ai](https://cloud.vast.ai)
69
+ * Exported for fast inference to []CTranslate2](https://github.com/OpenNMT/CTranslate2) format
70
+ * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.zh-en/tree/main
71
+ * Seperate source and target Sentencepiece tokenizers (size 32k)
72
+ * Transformer "Big"
73
+ - 241,870,080 parameters
74
+ - 8 encoder layers and 2 decoder layers
75
+ - Gated-silu activations
76
+ - Trained and saved in bfloat16
77
+
78
+ See `eole-config.yaml` for more detail.
79
+
80
+ ## Metrics
81
+
82
+ CHRF2 calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the Flores200 `devtest` test set ("eng_Latn"->"zho_Hans").
83
+
84
+ "GPU Time" is the time to translate the flores-devtest corpus using a batch size of 32 on a GTX 1080 GPU. "CPU Time" is the time to translate the following input with a single CPU core:
85
+
86
+ > James Joyce (2 February 1882 – 13 January 1941) was an Irish novelist, poet and literary critic who contributed to the modernist avant-garde movement and is regarded as one of the most influential and important writers of the 20th century.
87
+
88
+ | Model | chrf2 | comet22 | CPU Time (s) | GPU Time (s) |
89
+ | -------------------------------- | ----- | -------- | -------------|------------- |
90
+ | quickmt/quickmt-zh-en | 34.53 | 0.8512 | 1.91 | 3.92 |
91
+ | Helsinki-NLP/opus-mt-zh-en | 29.20 | 0.8236 | 1.50 | 10.10 |
92
+ | facebook/m2m100_418M | 26.63 | 0.7376 | 10.2 | 49.02 |
93
+ | facebook/nllb-200-distilled-600M | 24.68 | 0.7840 | 13.2 | 55.92 |
94
+
95
+ `quickmt-en-zh` is the highest quality and is the fastest on GPU (and not far behind on CPU).
96
+
97
+ Helsinki-NLP/opus-mt-en-zh is one of the most downloaded machine translation models on HuggingFace, and this model is considerably more accurate *and* similar in speed.
98
+
99
+
100
+
config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_source_bos": false,
3
+ "add_source_eos": false,
4
+ "bos_token": "<s>",
5
+ "decoder_start_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "layer_norm_epsilon": 1e-06,
8
+ "multi_query_attention": false,
9
+ "unk_token": "<unk>"
10
+ }
eole-config.yaml ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## IO
2
+ save_data: en_zh/data_spm
3
+ overwrite: True
4
+ seed: 1234
5
+ report_every: 100
6
+ valid_metrics: ["BLEU"]
7
+ tensorboard: true
8
+ tensorboard_log_dir: tensorboard
9
+
10
+ ### Vocab
11
+ src_vocab: en-zh/src.eole.vocab
12
+ tgt_vocab: en-zh/tgt.eole.vocab
13
+ src_vocab_size: 32000
14
+ tgt_vocab_size: 32000
15
+ vocab_size_multiple: 8
16
+ share_vocab: False
17
+ n_sample: 0
18
+
19
+ data:
20
+ corpus_1:
21
+ path_tgt: hf://quickmt/quickmt-train-zh-en/zh
22
+ path_src: hf://quickmt/quickmt-train-zh-en/en
23
+ path_sco: hf://quickmt/quickmt-train-zh-en/sco
24
+ valid:
25
+ path_src: en-zh/dev.eng
26
+ path_tgt: en-zh/dev.zho
27
+
28
+ transforms: [sentencepiece, filtertoolong]
29
+ transforms_configs:
30
+ sentencepiece:
31
+ src_subword_model: "en-zh/src.spm.model"
32
+ tgt_subword_model: "en-zh/tgt.spm.model"
33
+ filtertoolong:
34
+ src_seq_length: 512
35
+ tgt_seq_length: 512
36
+
37
+ training:
38
+ # Run configuration
39
+ model_path: en-zh/model
40
+ keep_checkpoint: 4
41
+ save_checkpoint_steps: 2000
42
+ train_steps: 200000
43
+ valid_steps: 2000
44
+
45
+ # Train on a single GPU
46
+ world_size: 1
47
+ gpu_ranks: [0]
48
+
49
+ # Batching
50
+ batch_type: "tokens"
51
+ batch_size: 8192
52
+ valid_batch_size: 8192
53
+ batch_size_multiple: 8
54
+ accum_count: [6]
55
+ accum_steps: [0]
56
+
57
+ # Optimizer & Compute
58
+ compute_dtype: "bfloat16"
59
+ optim: "pagedadamw8bit"
60
+ learning_rate: 1.0
61
+ warmup_steps: 10000
62
+ decay_method: "noam"
63
+ adam_beta2: 0.998
64
+
65
+ # Data loading
66
+ bucket_size: 262144
67
+ num_workers: 8
68
+ prefetch_factor: 100
69
+
70
+ # Hyperparams
71
+ dropout_steps: [0]
72
+ dropout: [0.1]
73
+ attention_dropout: [0.1]
74
+ max_grad_norm: 0
75
+ label_smoothing: 0.1
76
+ average_decay: 0.0001
77
+ param_init_method: xavier_uniform
78
+ normalization: "tokens"
79
+
80
+ model:
81
+ architecture: "transformer"
82
+ layer_norm: standard
83
+ share_embeddings: false
84
+ share_decoder_embeddings: true
85
+ add_ffnbias: true
86
+ mlp_activation_fn: gated-silu
87
+ add_estimator: false
88
+ add_qkvbias: false
89
+ norm_eps: 1e-6
90
+ hidden_size: 1024
91
+ encoder:
92
+ layers: 8
93
+ decoder:
94
+ layers: 2
95
+ heads: 16
96
+ transformer_ff: 4096
97
+ embeddings:
98
+ word_vec_size: 1024
99
+ position_encoding_type: "SinusoidalInterleaved"
100
+
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:346e81879b33a777f74eeac9ed1e1c17fcb7b5baa943cea1a1114adb10fd5190
3
+ size 493941910
source_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
src.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c373f1d78753313b0dbc337058bf8450e1fdd6fe662a49e0941affce44ec14c5
3
+ size 800955
target_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
tgt.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23d03d562fc3f8fe57e497dac0ece4827c254675a80c103fc4bb4040638ceb67
3
+ size 733978