|
--- |
|
license: llama2 |
|
datasets: |
|
- pkupie/mc2_corpus |
|
language: |
|
- bo |
|
- ug |
|
- mn |
|
- kk |
|
--- |
|
# MC^2Llama-13B |
|
[Github Repo](https://github.com/luciusssss/mc2_corpus) |
|
|
|
|
|
We continually pretrain [llama_chinese_13b](https://huggingface.co/quzhe/llama_chinese_13B) with [MC^2](https://huggingface.co/datasets/pkupie/mc2_corpus), which supports Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script. |
|
|
|
|
|
See details in the [paper](https://arxiv.org/abs/2311.08348). |
|
|
|
*We have also released another model trained on MC^2: [MC^2XLMR-large](https://huggingface.co/pkupie/mc2-xlmr-large).* |
|
|
|
## Usage |
|
the model and tokenizer can be loaded via: |
|
```python |
|
from transformers import LlamaForCausalLM, LlamaTokenizer |
|
|
|
tokenizer = LlamaTokenizer.from_pretrained("pkupie/mc2-llama-13b") |
|
model = LlamaForCausalLM.from_pretrained("pkupie/mc2-llama-13b") |
|
``` |
|
|
|
## Citation |
|
``` |
|
@article{zhang2024mc, |
|
title={MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China}, |
|
author={Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong}, |
|
journal={arXiv preprint arXiv:2311.08348}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
|