Transformers
PyTorch
llama
text-generation-inference
Inference Endpoints
File size: 1,185 Bytes
9a3034b
 
 
 
 
 
 
 
 
 
87bc9c2
9a3034b
 
 
 
 
 
 
 
87bc9c2
 
9a3034b
 
 
 
 
 
 
 
 
 
 
38f3b43
 
 
 
 
9a3034b
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---
license: llama2
datasets:
- pkupie/mc2_corpus
language:
- bo
- ug
- mn
- kk
---
# MC^2Llama-13B
[Github Repo](https://github.com/luciusssss/mc2_corpus)


We continually pretrain [llama_chinese_13b](https://huggingface.co/quzhe/llama_chinese_13B) with [MC^2](https://huggingface.co/datasets/pkupie/mc2_corpus), which supports Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script.


See details in the [paper](https://arxiv.org/abs/2311.08348).

*We have also released another model trained on MC^2: [MC^2XLMR-large](https://huggingface.co/pkupie/mc2-xlmr-large).*

## Usage
the model and tokenizer can be loaded via:
```python
from transformers import LlamaForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("pkupie/mc2-llama-13b")
model = LlamaForCausalLM.from_pretrained("pkupie/mc2-llama-13b")
```

## Citation
```
@article{zhang2024mc,
  title={MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China},
  author={Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong},
  journal={arXiv preprint arXiv:2311.08348},
  year={2024}
}
```