metadata
license: llama2
datasets:
- pkupie/mc2_corpus
language:
- bo
- ug
- mn
- kk
[MC^2Llama-13B]
We continually pretrain llama_chinese_13b with MC^2, which supports Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script.
See details in the paper.
Usage
the model and tokenizer can be loaded via:
from transformers import LlamaForCausalLM, LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("pkupie/mc2-llama-13b")
model = LlamaForCausalLM.from_pretrained("pkupie/mc2-llama-13b")
Citation
@misc{zhang2023mc2,
title={MC^2: A Multilingual Corpus of Minority Languages in China},
author={Chen Zhang and Mingxu Tao and Quzhe Huang and Jiuheng Lin and Zhibin Chen and Yansong Feng},
year={2023},
eprint={2311.08348},
archivePrefix={arXiv},
primaryClass={cs.CL}
}