Transformers
PyTorch
llama
text-generation-inference
Inference Endpoints

MC^2Llama-13B

Github Repo

We continually pretrain llama_chinese_13b with MC^2, which supports Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script.

See details in the paper.

We have also released another model trained on MC^2: MC^2XLMR-large.

Usage

the model and tokenizer can be loaded via:

from transformers import LlamaForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("pkupie/mc2-llama-13b")
model = LlamaForCausalLM.from_pretrained("pkupie/mc2-llama-13b")

Citation

@article{zhang2024mc,
  title={MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China},
  author={Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong},
  journal={arXiv preprint arXiv:2311.08348},
  year={2024}
}
Downloads last month
7
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model’s pipeline type.

Dataset used to train pkupie/mc2-llama-13b

Collection including pkupie/mc2-llama-13b