NousResearch
/

Yarn-Solar-10b-64k

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Yarn-Solar-10b-64k / README.md

emozilla's picture

Create README.md

7038186 verified 10 months ago

|

history blame contribute delete

3.5 kB

	---
	datasets:
	- emozilla/yarn-train-tokenized-32k-mistral
	metrics:
	- perplexity
	library_name: transformers
	license: apache-2.0
	language:
	- en
	---

	# Model Card: Yarn-Solar-10b-64k

	[Preprint (arXiv)](https://arxiv.org/abs/2309.00071)
	[GitHub](https://github.com/jquesnelle/yarn)
	![yarn](https://raw.githubusercontent.com/jquesnelle/yarn/solar/data/proofpile-long-small-solar.csv.png)

	## Model Description

	Yarn-Solar-10b-64k is a state-of-the-art language model for long context, further pretrained on two billion long context tokens using the YaRN extension method.
	It is an extension of [SOLAR-10.7B-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-v1.0) and supports a 64k token context window.

	To use, pass `trust_remote_code=True` when loading the model, for example

	```python
	model = AutoModelForCausalLM.from_pretrained("NousResearch/Yarn-Solar-10b-64k",
	attn_implementation="flash_attention_2",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True)
	```

	In addition you will need to use the latest version of `transformers`
	```sh
	pip install git+https://github.com/huggingface/transformers
	```

	## Benchmarks

	Long context benchmarks:
	\| Model \| Context Window \| 4k PPL \| 8k PPL \| 16k PPL \| 32k PPL \| 64k PPL \|
	\|-------\|---------------:\|------:\|----------:\|-----:\|-----:\|------------:\|
	\| [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) \| 8k \| 3.09 \| 2.96 \| - \| - \| - \|
	\| [Yarn-Mistral-7b-64k](https://huggingface.co/NousResearch/Yarn-Mistral-7b-64k) \| 64k \| 3.18 \| 3.04 \| 2.65 \| 2.44 \| 2.20 \|
	\| [Yarn-Mistral-7b-128k](https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k) \| 128k \| 3.21 \| 3.08 \| 2.68 \| 2.47 \| 2.24 \|
	\| [SOLAR-10.7B-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-v1.0) \| 4k \| 3.07 \| - \| - \| - \| - \|
	\| [Yarn-Solar-10b-32k](https://huggingface.co/NousResearch/Yarn-Solar-10b-32k) \| 32k \| 3.09 \| 2.95 \| 2.57 \| 2.31 \| - \|
	\| [Yarn-Solar-10b-64k](https://huggingface.co/NousResearch/Yarn-Solar-10b-64k) \| 64k \| 3.13 \| 2.99 \| 2.61 \| 2.34 \| 2.15 \|

	Short context benchmarks showing that quality degradation is minimal:
	\| Model \| Context Window \| ARC-c \| Hellaswag \| MMLU \| Truthful QA \|
	\|-------\|---------------:\|------:\|----------:\|-----:\|------------:\|
	\| [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) \| 8k \| 59.98 \| 83.31 \| 64.16 \| 42.15 \|
	\| [Yarn-Mistral-7b-64k](https://huggingface.co/NousResearch/Yarn-Mistral-7b-64k) \| 64k \| 59.38 \| 81.21 \| 61.32 \| 42.50 \|
	\| [Yarn-Mistral-7b-128k](https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k) \| 128k \| 58.87 \| 80.58 \| 60.64 \| 42.46 \|
	\| [SOLAR-10.7B-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-v1.0) \| 4k \| 61.95 \| 84.60 \| 65.48 \| 45.04 \|
	\| [Yarn-Solar-10b-32k](https://huggingface.co/NousResearch/Yarn-Solar-10b-32k) \| 32k \| 59.64 \| 83.65 \| 64.36 \| 44.82 \|
	\| [Yarn-Solar-10b-64k](https://huggingface.co/NousResearch/Yarn-Solar-10b-64k) \| 64k \| 59.21 \| 83.08 \| 63.57 \| 45.70 \|

	## Collaborators

	- [bloc97](https://github.com/bloc97): Methods, paper and evals
	- [@theemozilla](https://twitter.com/theemozilla): Methods, paper, model training, and evals
	- [@EnricoShippole](https://twitter.com/EnricoShippole): Model training
	- [honglu2875](https://github.com/honglu2875): Paper and evals

	The authors would like to thank LAION AI for their support of compute for this model.
	It was trained on the [JUWELS](https://www.fz-juelich.de/en/ias/jsc/systems/supercomputers/juwels) supercomputer.