pkupie
/

mc2-llama-13b

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

mc2-llama-13b / README.md

luciusssss's picture

Update README.md

38f3b43 verified 8 months ago

|

1.19 kB

	---
	license: llama2
	datasets:
	- pkupie/mc2_corpus
	language:
	- bo
	- ug
	- mn
	- kk
	---
	# MC^2Llama-13B
	[Github Repo](https://github.com/luciusssss/mc2_corpus)


	We continually pretrain [llama_chinese_13b](https://huggingface.co/quzhe/llama_chinese_13B) with [MC^2](https://huggingface.co/datasets/pkupie/mc2_corpus), which supports Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script.


	See details in the [paper](https://arxiv.org/abs/2311.08348).

	We have also released another model trained on MC^2: [MC^2XLMR-large](https://huggingface.co/pkupie/mc2-xlmr-large).

	## Usage
	the model and tokenizer can be loaded via:
	```python
	from transformers import LlamaForCausalLM, LlamaTokenizer

	tokenizer = LlamaTokenizer.from_pretrained("pkupie/mc2-llama-13b")
	model = LlamaForCausalLM.from_pretrained("pkupie/mc2-llama-13b")
	```

	## Citation
	```
	@article{zhang2024mc,
	title={MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China},
	author={Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong},
	journal={arXiv preprint arXiv:2311.08348},
	year={2024}
	}
	```