mGPT
mGPT is pre-trained on the mC4 dataset using a causal language modeling objective. It was introduced in this paper and first released on this page.
Model description
mGPT is a Transformer-based model which pre-trained on massive multilingual data covering over 101 languages. Similar to GPT-2, It was pre-trained on the raw texts only, with no human labeling. We use the same tokenization and vocabulary as the mT5 model.
Intended uses
You can use the raw model for text generation or using prompts for adapting it to a downstream task.
How to use
You can use this model directly with a pipeline for text generation. Here is how to use this model to get the features of a given text in PyTorch:
from transformers import MT5Tokenizer, GPT2LMHeadModel, TextGenerationPipeline
tokenizer = MT5Tokenizer.from_pretrained("THUMT/mGPT")
model = GPT2LMHeadModel.from_pretrained("THUMT/mGPT")
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
text = "Replace me by any text you'd like."
text = pipeline(text, do_sample=True, max_length=1024)[0]["generated_text"]
Preprocessing
The texts are tokenized using sentencepiece
and a vocabulary size of 250,100. The inputs are sequences of 1,024 consecutive tokens. We use <extra_id_0>
to separate lines in a document.
BibTeX entry and citation info
@misc{tan2021msp,
title={MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better Translators},
author={Zhixing Tan and Xiangwen Zhang and Shuo Wang and Yang Liu},
year={2021},
eprint={2110.06609},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 377