Spaces:

jesseplusplus
/

easy-translate

Running

App Files Files Community

Iker commited on Sep 1, 2022

Commit

5401d1a

1 Parent(s): 6e4adc1

Support for NLLB and Sampling

Browse files

Files changed (3) hide show

README.md +81 -9
supported_languages.md +214 -2
translate.py +54 -7

README.md CHANGED Viewed

@@ -13,11 +13,7 @@
     <br>
 </p>
-Easy-Translate is a script for translating large text files in your machine using the [M2M100 models](https://arxiv.org/pdf/2010.11125.pdf) from Facebook/Meta AI.  We also privide a [script](#evaluate-translations) for Easy-Evaluation of your translations 🥳
-**M2M100** is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation introduced in this [paper](https://arxiv.org/abs/2010.11125) and first released in [this](https://github.com/pytorch/fairseq/tree/master/examples/m2m_100) repository.
->M2M100 can directly translate between 9,900 directions of 100 languages.
 Easy-Translate is built on top of 🤗HuggingFace's [Transformers](https://huggingface.co/docs/transformers/index) and 🤗HuggingFace's [Accelerate](https://huggingface.co/docs/accelerate/index) library.
@@ -27,26 +23,43 @@ We currently support:
 - BF16 / FP16 / FP32 precision.
 - Automatic batch size finder: Forget CUDA OOM errors. Set an initial batch size, if it doesn't fit, we will automatically adjust it.
 - Sharded Data Parallel to load huge models sharded on multiple GPUs (See: <https://huggingface.co/docs/accelerate/fsdp>).
 >Test the 🔌 Online Demo here: <https://huggingface.co/spaces/Iker/Translate-100-languages>
 ## Supported languages
 See the [Supported languages table](supported_languages.md) for a table of the supported languages and their ids.
-**List of supported languages:**
-Afrikaans, Amharic, Arabic, Asturian, Azerbaijani, Bashkir, Belarusian, Bulgarian, Bengali, Breton, Bosnian, Catalan, Cebuano, Czech, Welsh, Danish, German, Greeek, English, Spanish, Estonian, Persian, Fulah, Finnish, French, WesternFrisian, Irish, Gaelic, Galician, Gujarati, Hausa, Hebrew, Hindi, Croatian, Haitian, Hungarian, Armenian, Indonesian, Igbo, Iloko, Icelandic, Italian, Japanese, Javanese, Georgian, Kazakh, CentralKhmer, Kannada, Korean, Luxembourgish, Ganda, Lingala, Lao, Lithuanian, Latvian, Malagasy, Macedonian, Malayalam, Mongolian, Marathi, Malay, Burmese, Nepali, Dutch, Norwegian, NorthernSotho, Occitan, Oriya, Panjabi, Polish, Pushto, Portuguese, Romanian, Russian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Albanian, Serbian, Swati, Sundanese, Swedish, Swahili, Tamil, Thai, Tagalog, Tswana, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Wolof, Xhosa, Yiddish, Yoruba, Chinese, Zulu
 ## Supported Models
 - **Facebook/m2m100_418M**: <https://huggingface.co/facebook/m2m100_418M>
 - **Facebook/m2m100_1.2B**: <https://huggingface.co/facebook/m2m100_1.2B>
 - **Facebook/m2m100_12B**: <https://huggingface.co/facebook/m2m100-12B-avg-5-ckpt>
-- Any other m2m100 model from HuggingFace's Hub: <https://huggingface.co/models?search=m2m100>
 ## Requirements
@@ -59,6 +72,9 @@ pip install --upgrade accelerate
 HuggingFace Transformers
 pip install --upgrade transformers
 ```
 ## Translate a file
@@ -109,6 +125,62 @@ accelerate launch translate.py \
 --precision fp16
 ```
 ## Evaluate translations
 To run the evaluation script you need to install [bert_score](https://github.com/Tiiiger/bert_score): `pip install bert_score` and 🤗HuggingFace's [Datasets](https://huggingface.co/docs/datasets/index) model: `pip install datasets`.

     <br>
 </p>
+Easy-Translate is a script for translating large text files in your machine using the [M2M100 models](https://arxiv.org/pdf/2010.11125.pdf) and [NLLB200 models](https://research.facebook.com/publications/no-language-left-behind/) from Facebook/Meta AI.  We also privide a [script](#evaluate-translations) for Easy-Evaluation of your translations 🥳
 Easy-Translate is built on top of 🤗HuggingFace's [Transformers](https://huggingface.co/docs/transformers/index) and 🤗HuggingFace's [Accelerate](https://huggingface.co/docs/accelerate/index) library.
 - BF16 / FP16 / FP32 precision.
 - Automatic batch size finder: Forget CUDA OOM errors. Set an initial batch size, if it doesn't fit, we will automatically adjust it.
 - Sharded Data Parallel to load huge models sharded on multiple GPUs (See: <https://huggingface.co/docs/accelerate/fsdp>).
+- Greedy decoding / Beam Search decoding / Multinomial Sampling / Beam-Search Multinomial Sampling
 >Test the 🔌 Online Demo here: <https://huggingface.co/spaces/Iker/Translate-100-languages>
 ## Supported languages
 See the [Supported languages table](supported_languages.md) for a table of the supported languages and their ids.
 ## Supported Models
+### M2M100
+**M2M100** is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation introduced in this [paper](https://arxiv.org/abs/2010.11125) and first released in [this](https://github.com/pytorch/fairseq/tree/master/examples/m2m_100) repository.
+>M2M100 can directly translate between 9,900 directions of 100 languages.
 - **Facebook/m2m100_418M**: <https://huggingface.co/facebook/m2m100_418M>
 - **Facebook/m2m100_1.2B**: <https://huggingface.co/facebook/m2m100_1.2B>
 - **Facebook/m2m100_12B**: <https://huggingface.co/facebook/m2m100-12B-avg-5-ckpt>
+### NLLB200
+**No Language Left Behind (NLLB)** open-sources models capable of delivering high-quality translations directly between any pair of 200+ languages — including low-resource languages like Asturian, Luganda, Urdu and more. It aims to help people communicate with anyone, anywhere, regardless of their language preferences. It was introduced in this [paper](https://research.facebook.com/publications/no-language-left-behind/) and first released in [this](https://github.com/facebookresearch/fairseq/tree/nllb) repository.
+>NLLB can directly translate between +40,000 of +200 languages.
+- **facebook/nllb-200-3.3B**: <https://huggingface.co/facebook/nllb-200-3.3B>
+- **facebook/nllb-200-1.3B**: <https://huggingface.co/facebook/nllb-200-1.3B>
+- **facebook/nllb-200-distilled-1.3B**: <https://huggingface.co/facebook/nllb-200-distilled-1.3B>
+- **facebook/nllb-200-distilled-600M**: <https://huggingface.co/facebook/nllb-200-distilled-600M>
+Any other ModelForSeq2SeqLM from HuggingFace's Hub should work with this library: <https://huggingface.co/models?pipeline_tag=text2text-generation>
 ## Requirements
 HuggingFace Transformers
 pip install --upgrade transformers
+If you find errors using NLLB200, try installing transformers from source:
+pip install git+https://github.com/huggingface/transformers.git
 ```
 ## Translate a file
 --precision fp16
 ```
+### Decoding/Sampling strategies
+You can choose the decoding/sampling strategy to use and the number of candidate translation to output for each input sentence. By default we will use beam-search with 'num_beams' set to 5, and we will output the most likely candidate translation. But you can change this behavior:
+##### Greedy decoding
+```bash
+accelerate launch translate.py \
+--sentences_path sample_text/en.txt \
+--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
+--source_lang en \
+--target_lang es \
+--model_name facebook/m2m100_1.2B \
+--num_beams 1
+```
+##### Multinomial Sampling
+```bash
+accelerate launch translate.py \
+--sentences_path sample_text/en.txt \
+--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
+--source_lang en \
+--target_lang es \
+--model_name facebook/m2m100_1.2B \
+--num_beams 1 \
+--do_sample \
+--temperature 0.5 \
+--top_k 100 \
+--top_p 0.8 \
+--num_return_sequences 1
+```
+##### Beam-Search decoding **(DEFAULT)**
+```bash
+accelerate launch translate.py \
+--sentences_path sample_text/en.txt \
+--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
+--source_lang en \
+--target_lang es \
+--model_name facebook/m2m100_1.2B \
+--num_beams 5 \
+--num_return_sequences 1 \
+```
+##### Beam-Search Multinomial Sampling
+```bash
+accelerate launch translate.py \
+--sentences_path sample_text/en.txt \
+--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
+--source_lang en \
+--target_lang es \
+--model_name facebook/m2m100_1.2B \
+--num_beams 5 \
+--num_return_sequences 1 \
+--do_sample \
+--temperature 0.5 \
+--top_k 100 \
+--top_p 0.8
+```
 ## Evaluate translations
 To run the evaluation script you need to install [bert_score](https://github.com/Tiiiger/bert_score): `pip install bert_score` and 🤗HuggingFace's [Datasets](https://huggingface.co/docs/datasets/index) model: `pip install datasets`.

supported_languages.md CHANGED Viewed

@@ -1,4 +1,10 @@
-## Supported languages
 | Language | Id |
 |---|---|
@@ -101,4 +107,210 @@
 | Yiddish | yi |
 | Yoruba | yo |
 | Chinese | zh |
-| Zulu | zu |

+# List of supported languages
+## Index
+* [M2M100 supported languages](#supported-languages-m2m100)
+* [NLLB200 supported languages](#supported-languages-nllb200)
+## Supported languages M2M100
 | Language | Id |
 |---|---|
 | Yiddish | yi |
 | Yoruba | yo |
 | Chinese | zh |
+| Zulu | zu |
+## Supported languages NLLB200
+| Language id |
+|-------------|
+ | ace_Arab    |
+ | ace_Latn    |
+ | acm_Arab    |
+ | acq_Arab    |
+ | aeb_Arab    |
+ | afr_Latn    |
+ | ajp_Arab    |
+ | aka_Latn    |
+ | amh_Ethi    |
+ | apc_Arab    |
+ | arb_Arab    |
+ | ars_Arab    |
+ | ary_Arab    |
+ | arz_Arab    |
+ | asm_Beng    |
+ | ast_Latn    |
+ | awa_Deva    |
+ | ayr_Latn    |
+ | azb_Arab    |
+ | azj_Latn    |
+ | bak_Cyrl    |
+ | bam_Latn    |
+ | ban_Latn    |
+ | bel_Cyrl    |
+ | bem_Latn    |
+ | ben_Beng    |
+ | bho_Deva    |
+ | bjn_Arab    |
+ | bjn_Latn    |
+ | bod_Tibt    |
+ | bos_Latn    |
+ | bug_Latn    |
+ | bul_Cyrl    |
+ | cat_Latn    |
+ | ceb_Latn    |
+ | ces_Latn    |
+ | cjk_Latn    |
+ | ckb_Arab    |
+ | crh_Latn    |
+ | cym_Latn    |
+ | dan_Latn    |
+ | deu_Latn    |
+ | dik_Latn    |
+ | dyu_Latn    |
+ | dzo_Tibt    |
+ | ell_Grek    |
+ | eng_Latn    |
+ | epo_Latn    |
+ | est_Latn    |
+ | eus_Latn    |
+ | ewe_Latn    |
+ | fao_Latn    |
+ | pes_Arab    |
+ | fij_Latn    |
+ | fin_Latn    |
+ | fon_Latn    |
+ | fra_Latn    |
+ | fur_Latn    |
+ | fuv_Latn    |
+ | gla_Latn    |
+ | gle_Latn    |
+ | glg_Latn    |
+ | grn_Latn    |
+ | guj_Gujr    |
+ | hat_Latn    |
+ | hau_Latn    |
+ | heb_Hebr    |
+ | hin_Deva    |
+ | hne_Deva    |
+ | hrv_Latn    |
+ | hun_Latn    |
+ | hye_Armn    |
+ | ibo_Latn    |
+ | ilo_Latn    |
+ | ind_Latn    |
+ | isl_Latn    |
+ | ita_Latn    |
+ | jav_Latn    |
+ | jpn_Jpan    |
+ | kab_Latn    |
+ | kac_Latn    |
+ | kam_Latn    |
+ | kan_Knda    |
+ | kas_Arab    |
+ | kas_Deva    |
+ | kat_Geor    |
+ | knc_Arab    |
+ | knc_Latn    |
+ | kaz_Cyrl    |
+ | kbp_Latn    |
+ | kea_Latn    |
+ | khm_Khmr    |
+ | kik_Latn    |
+ | kin_Latn    |
+ | kir_Cyrl    |
+ | kmb_Latn    |
+ | kon_Latn    |
+ | kor_Hang    |
+ | kmr_Latn    |
+ | lao_Laoo    |
+ | lvs_Latn    |
+ | lij_Latn    |
+ | lim_Latn    |
+ | lin_Latn    |
+ | lit_Latn    |
+ | lmo_Latn    |
+ | ltg_Latn    |
+ | ltz_Latn    |
+ | lua_Latn    |
+ | lug_Latn    |
+ | luo_Latn    |
+ | lus_Latn    |
+ | mag_Deva    |
+ | mai_Deva    |
+ | mal_Mlym    |
+ | mar_Deva    |
+ | min_Latn    |
+ | mkd_Cyrl    |
+ | plt_Latn    |
+ | mlt_Latn    |
+ | mni_Beng    |
+ | khk_Cyrl    |
+ | mos_Latn    |
+ | mri_Latn    |
+ | zsm_Latn    |
+ | mya_Mymr    |
+ | nld_Latn    |
+ | nno_Latn    |
+ | nob_Latn    |
+ | npi_Deva    |
+ | nso_Latn    |
+ | nus_Latn    |
+ | nya_Latn    |
+ | oci_Latn    |
+ | gaz_Latn    |
+ | ory_Orya    |
+ | pag_Latn    |
+ | pan_Guru    |
+ | pap_Latn    |
+ | pol_Latn    |
+ | por_Latn    |
+ | prs_Arab    |
+ | pbt_Arab    |
+ | quy_Latn    |
+ | ron_Latn    |
+ | run_Latn    |
+ | rus_Cyrl    |
+ | sag_Latn    |
+ | san_Deva    |
+ | sat_Beng    |
+ | scn_Latn    |
+ | shn_Mymr    |
+ | sin_Sinh    |
+ | slk_Latn    |
+ | slv_Latn    |
+ | smo_Latn    |
+ | sna_Latn    |
+ | snd_Arab    |
+ | som_Latn    |
+ | sot_Latn    |
+ | spa_Latn    |
+ | als_Latn    |
+ | srd_Latn    |
+ | srp_Cyrl    |
+ | ssw_Latn    |
+ | sun_Latn    |
+ | swe_Latn    |
+ | swh_Latn    |
+ | szl_Latn    |
+ | tam_Taml    |
+ | tat_Cyrl    |
+ | tel_Telu    |
+ | tgk_Cyrl    |
+ | tgl_Latn    |
+ | tha_Thai    |
+ | tir_Ethi    |
+ | taq_Latn    |
+ | taq_Tfng    |
+ | tpi_Latn    |
+ | tsn_Latn    |
+ | tso_Latn    |
+ | tuk_Latn    |
+ | tum_Latn    |
+ | tur_Latn    |
+ | twi_Latn    |
+ | tzm_Tfng    |
+ | uig_Arab    |
+ | ukr_Cyrl    |
+ | umb_Latn    |
+ | urd_Arab    |
+ | uzn_Latn    |
+ | vec_Latn    |
+ | vie_Latn    |
+ | war_Latn    |
+ | wol_Latn    |
+ | xho_Latn    |
+ | ydd_Hebr    |
+ | yor_Latn    |
+ | yue_Hant    |
+ | zho_Hans    |
+ | zho_Hant    |
+ | zul_Latn    |

translate.py CHANGED Viewed

@@ -1,6 +1,6 @@
 from transformers import (
-    M2M100ForConditionalGeneration,
-    M2M100Tokenizer,
     PreTrainedTokenizerBase,
     DataCollatorForSeq2Seq,
 )
@@ -60,6 +60,10 @@ def main(
     max_length: int = 128,
     num_beams: int = 4,
     num_return_sequences: int = 1,
 ):
     if not os.path.exists(os.path.abspath(os.path.dirname(output_path))):
@@ -70,11 +74,11 @@ def main(
     )
     print(f"Loading tokenizer {model_name}...")
-    tokenizer = M2M100Tokenizer.from_pretrained(
         pretrained_model_name_or_path=model_name, cache_dir=cache_dir
     )
     print(f"Loading model {model_name}...")
-    model = M2M100ForConditionalGeneration.from_pretrained(
         pretrained_model_name_or_path=model_name, cache_dir=cache_dir
     )
@@ -92,12 +96,21 @@ def main(
         raise ValueError("Precision not supported. Supported values: 32, fp16, bf16")
     tokenizer.src_lang = source_lang
-    lang_code_to_idx = tokenizer.lang_code_to_id[target_lang]
     gen_kwargs = {
         "max_length": max_length,
         "num_beams": num_beams,
         "num_return_sequences": num_return_sequences,
     }
     # total_lines: int = count_lines(sentences_path)
@@ -114,10 +127,12 @@ def main(
             f"Num. Devices: {accelerator.num_processes}\n"
             f"Distributed_type: {accelerator.distributed_type}\n"
             f"Max length: {max_length}\n"
-            f"Num beams: {num_beams}\n"
             f"Precision: {model.dtype}\n"
             f"Model: {model_name}\n"
         )
     @find_executable_batch_size(starting_batch_size=starting_batch_size)
     def inference(batch_size):
@@ -167,7 +182,8 @@ def main(
                     if accelerator.is_main_process:
                         if step == len(data_loader) - 1:
                             tgt_text = tgt_text[
-                                : len(data_loader.dataset) - samples_seen
                             ]
                         else:
                             samples_seen += len(tgt_text)
@@ -262,6 +278,33 @@ if __name__ == "__main__":
         help="Precision of the model. bf16, fp16 or 32.",
     )
     args = parser.parse_args()
     main(
@@ -276,4 +319,8 @@ if __name__ == "__main__":
         num_beams=args.num_beams,
         num_return_sequences=args.num_return_sequences,
         precision=args.precision,
     )

 from transformers import (
+    AutoModelForSeq2SeqLM,
+    AutoTokenizer,
     PreTrainedTokenizerBase,
     DataCollatorForSeq2Seq,
 )
     max_length: int = 128,
     num_beams: int = 4,
     num_return_sequences: int = 1,
+    do_sample: bool = False,
+    temperature: float = 1.0,
+    top_k: int = 50,
+    top_p: float = 1.0,
 ):
     if not os.path.exists(os.path.abspath(os.path.dirname(output_path))):
     )
     print(f"Loading tokenizer {model_name}...")
+    tokenizer = AutoTokenizer.from_pretrained(
         pretrained_model_name_or_path=model_name, cache_dir=cache_dir
     )
     print(f"Loading model {model_name}...")
+    model = AutoModelForSeq2SeqLM.from_pretrained(
         pretrained_model_name_or_path=model_name, cache_dir=cache_dir
     )
         raise ValueError("Precision not supported. Supported values: 32, fp16, bf16")
     tokenizer.src_lang = source_lang
+    try:
+        lang_code_to_idx = tokenizer.lang_code_to_id[target_lang]
+    except KeyError:
+        raise KeyError(
+            f"Language {target_lang} not found in tokenizer. Available languages: {tokenizer.lang_code_to_id.keys()}"
+        )
     gen_kwargs = {
         "max_length": max_length,
         "num_beams": num_beams,
         "num_return_sequences": num_return_sequences,
+        "do_sample": do_sample,
+        "temperature": temperature,
+        "top_k": top_k,
+        "top_p": top_p,
     }
     # total_lines: int = count_lines(sentences_path)
             f"Num. Devices: {accelerator.num_processes}\n"
             f"Distributed_type: {accelerator.distributed_type}\n"
             f"Max length: {max_length}\n"
             f"Precision: {model.dtype}\n"
             f"Model: {model_name}\n"
         )
+        print("** Generation parameters **")
+        print("\n".join(f"{k}: {v}" for k, v in gen_kwargs.items()))
+        print("\n")
     @find_executable_batch_size(starting_batch_size=starting_batch_size)
     def inference(batch_size):
                     if accelerator.is_main_process:
                         if step == len(data_loader) - 1:
                             tgt_text = tgt_text[
+                                : len(data_loader.dataset) * num_return_sequences
+                                - samples_seen
                             ]
                         else:
                             samples_seen += len(tgt_text)
         help="Precision of the model. bf16, fp16 or 32.",
     )
+    parser.add_argument(
+        "--do_sample",
+        action="store_true",
+        help="Use sampling instead of beam search.",
+    )
+    parser.add_argument(
+        "--temperature",
+        type=float,
+        default=1.0,
+        help="Temperature for sampling, value used only if do_sample is True.",
+    )
+    parser.add_argument(
+        "--top_k",
+        type=int,
+        default=50,
+        help="If do_sample is True, will sample from the top k most likely tokens.",
+    )
+    parser.add_argument(
+        "--top_p",
+        type=float,
+        default=1.0,
+        help="If do_sample is True, will sample from the top k most likely tokens.",
+    )
     args = parser.parse_args()
     main(
         num_beams=args.num_beams,
         num_return_sequences=args.num_return_sequences,
         precision=args.precision,
+        do_sample=args.do_sample,
+        temperature=args.temperature,
+        top_k=args.top_k,
+        top_p=args.top_p,
     )