Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

README.md +100 -3
config.json +10 -0
eole-config.yaml +100 -0
model.bin +3 -0
source_vocabulary.json +0 -0
src.spm.model +3 -0
target_vocabulary.json +0 -0
tgt.spm.model +3 -0

README.md CHANGED Viewed

@@ -1,3 +1,100 @@
----
-license: cc-by-4.0
----

+---
+language:
+- zh
+- en
+tags:
+- translation
+license: cc-by-4.0
+datasets:
+- quickmt/quickmt-train.zh-en
+model-index:
+- name: quickmt-en-zh
+  results:
+  - task:
+      name: Translation eng-zho
+      type: translation
+      args: eng-zho
+    dataset:
+      name: flores101-devtest
+      type: flores_101
+      args: eng_Latn zho_Hans devtest
+    metrics:
+    - name: CHRF
+      type: chrf
+      value: 34.53
+---
+# `quickmt-en-zh` Neural Machine Translation Model
+# Usage
+## Install `quickmt`
+```bash
+git clone https://github.com/quickmt/quickmt.git
+pip install ./quickmt/
+```
+## Download model
+```bash
+quickmt-model-download quickmt/quickmt-en-zh ./quickmt-en-zh
+```
+## Use model
+Inference with `quickmt`:
+```python
+from quickmt import Translator
+# Auto-detects GPU, set to "cpu" to force CPU inference
+t = Translator("./quickmt-en-zh/", device="auto")
+# Translate - set beam size to 5 for higher quality (but slower speed)
+t(["The roe deer (Capreolus capreolus), also known as the roe, western roe deer,[3][4] or European roe,[3] is a species of deer."], beam_size=1)
+# Get alternative translations by sampling
+# You can pass any cTranslate2 `translate_batch` arguments
+t(["The roe deer (Capreolus capreolus), also known as the roe, western roe deer,[3][4] or European roe,[3] is a species of deer."], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
+```
+The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use the model files directly if you want. It would be fairly easy to get them to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
+# Model Information
+* Trained using [`eole`](https://github.com/eole-nlp/eole)
+    - Trained for 82k steps with an effective batch size of 49152, which took less than 1 day on a single RTX 4090 on [vast.ai](https://cloud.vast.ai)
+* Exported for fast inference to []CTranslate2](https://github.com/OpenNMT/CTranslate2) format
+* Training data: https://huggingface.co/datasets/quickmt/quickmt-train.zh-en/tree/main
+* Seperate source and target Sentencepiece tokenizers (size 32k)
+* Transformer "Big"
+    - 241,870,080 parameters
+    - 8 encoder layers and 2 decoder layers
+    - Gated-silu activations
+    - Trained and saved in bfloat16
+See `eole-config.yaml` for more detail.
+## Metrics
+CHRF2 calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the Flores200 `devtest` test set ("eng_Latn"->"zho_Hans").
+"GPU Time" is the time to translate the flores-devtest corpus using a batch size of 32 on a GTX 1080 GPU. "CPU Time" is the time to translate the following input with a single CPU core:
+> James Joyce (2 February 1882 – 13 January 1941) was an Irish novelist, poet and literary critic who contributed to the modernist avant-garde movement and is regarded as one of the most influential and important writers of the 20th century.
+| Model                             | chrf2 | comet22    | CPU Time (s) | GPU Time (s)   |
+| --------------------------------  | ----- | --------   | -------------|-------------   |
+| quickmt/quickmt-zh-en             | 34.53 | 0.8512     | 1.91         | 3.92           |
+| Helsinki-NLP/opus-mt-zh-en        | 29.20 | 0.8236     | 1.50         | 10.10          |
+| facebook/m2m100_418M              | 26.63 | 0.7376     | 10.2         | 49.02          |
+| facebook/nllb-200-distilled-600M  | 24.68 | 0.7840     | 13.2         | 55.92          |
+`quickmt-en-zh` is the highest quality and is the fastest on GPU (and not far behind on CPU).
+Helsinki-NLP/opus-mt-en-zh is one of the most downloaded machine translation models on HuggingFace, and this model is considerably more accurate *and* similar in speed.

config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "add_source_bos": false,
+  "add_source_eos": false,
+  "bos_token": "<s>",
+  "decoder_start_token": "<s>",
+  "eos_token": "</s>",
+  "layer_norm_epsilon": 1e-06,
+  "multi_query_attention": false,
+  "unk_token": "<unk>"
+}

eole-config.yaml ADDED Viewed

	@@ -0,0 +1,100 @@

+## IO
+save_data: en_zh/data_spm
+overwrite: True
+seed: 1234
+report_every: 100
+valid_metrics: ["BLEU"]
+tensorboard: true
+tensorboard_log_dir: tensorboard
+### Vocab
+src_vocab: en-zh/src.eole.vocab
+tgt_vocab: en-zh/tgt.eole.vocab
+src_vocab_size: 32000
+tgt_vocab_size: 32000
+vocab_size_multiple: 8
+share_vocab: False
+n_sample: 0
+data:
+    corpus_1:
+        path_tgt: hf://quickmt/quickmt-train-zh-en/zh
+        path_src: hf://quickmt/quickmt-train-zh-en/en
+        path_sco: hf://quickmt/quickmt-train-zh-en/sco
+    valid:
+        path_src: en-zh/dev.eng
+        path_tgt: en-zh/dev.zho
+transforms: [sentencepiece, filtertoolong]
+transforms_configs:
+  sentencepiece:
+    src_subword_model: "en-zh/src.spm.model"
+    tgt_subword_model: "en-zh/tgt.spm.model"
+  filtertoolong:
+    src_seq_length: 512
+    tgt_seq_length: 512
+training:
+    # Run configuration
+    model_path: en-zh/model
+    keep_checkpoint: 4
+    save_checkpoint_steps: 2000
+    train_steps: 200000
+    valid_steps: 2000
+    # Train on a single GPU
+    world_size: 1
+    gpu_ranks: [0]
+    # Batching
+    batch_type: "tokens"
+    batch_size: 8192
+    valid_batch_size: 8192
+    batch_size_multiple: 8
+    accum_count: [6]
+    accum_steps: [0]
+    # Optimizer & Compute
+    compute_dtype: "bfloat16"
+    optim: "pagedadamw8bit"
+    learning_rate: 1.0
+    warmup_steps: 10000
+    decay_method: "noam"
+    adam_beta2: 0.998
+    # Data loading
+    bucket_size: 262144
+    num_workers: 8
+    prefetch_factor: 100
+    # Hyperparams
+    dropout_steps: [0]
+    dropout: [0.1]
+    attention_dropout: [0.1]
+    max_grad_norm: 0
+    label_smoothing: 0.1
+    average_decay: 0.0001
+    param_init_method: xavier_uniform
+    normalization: "tokens"
+model:
+    architecture: "transformer"
+    layer_norm: standard
+    share_embeddings: false
+    share_decoder_embeddings: true
+    add_ffnbias: true
+    mlp_activation_fn: gated-silu
+    add_estimator: false
+    add_qkvbias: false
+    norm_eps: 1e-6
+    hidden_size: 1024
+    encoder:
+        layers: 8
+    decoder:
+        layers: 2
+    heads: 16
+    transformer_ff: 4096
+    embeddings:
+        word_vec_size: 1024
+        position_encoding_type: "SinusoidalInterleaved"

model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:346e81879b33a777f74eeac9ed1e1c17fcb7b5baa943cea1a1114adb10fd5190
+size 493941910

source_vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

src.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c373f1d78753313b0dbc337058bf8450e1fdd6fe662a49e0941affce44ec14c5
+size 800955

target_vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tgt.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:23d03d562fc3f8fe57e497dac0ece4827c254675a80c103fc4bb4040638ceb67
+size 733978