pkupie
/

mc2-llama-13b

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

luciusssss commited on Jun 3, 2024

Commit

9a3034b

·

verified ·

1 Parent(s): d8e7fd2

Update README.md

Files changed (1) hide show

README.md +40 -3

README.md CHANGED Viewed

@@ -1,3 +1,40 @@
----
-license: llama2
----

+---
+license: llama2
+datasets:
+- pkupie/mc2_corpus
+language:
+- bo
+- ug
+- mn
+- kk
+---
+# [MC^2Llama-13B]
+[Github Repo](https://github.com/luciusssss/mc2_corpus)
+We continually pretrain [llama_chinese_13b](https://huggingface.co/quzhe/llama_chinese_13B) with [MC^2](https://huggingface.co/datasets/pkupie/mc2_corpus), which supports Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script.
+See details in the [paper](https://arxiv.org/abs/2311.08348).
+## Usage
+the model and tokenizer can be loaded via:
+```python
+from transformers import LlamaForCausalLM, LlamaTokenizer
+tokenizer = LlamaTokenizer.from_pretrained("pkupie/mc2-llama-13b")
+model = LlamaForCausalLM.from_pretrained("pkupie/mc2-llama-13b")
+```
+## Citation
+```
+@misc{zhang2023mc2,
+      title={MC^2: A Multilingual Corpus of Minority Languages in China},
+      author={Chen Zhang and Mingxu Tao and Quzhe Huang and Jiuheng Lin and Zhibin Chen and Yansong Feng},
+      year={2023},
+      eprint={2311.08348},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```