|
--- |
|
base_model: 01-ai/Yi-6B |
|
inference: false |
|
model_creator: 01-ai |
|
model_name: Yi 6B |
|
model_type: yi |
|
prompt_template: 'Human: {prompt} Assistant: |
|
|
|
' |
|
quantized_by: jezzarax |
|
license: apache-2.0 |
|
--- |
|
<!-- markdownlint-disable MD041 --> |
|
|
|
# Yi 6B - GGUF |
|
- Model creator: [01-ai](https://huggingface.co/01-ai) |
|
- Original model: [Yi 34B](https://huggingface.co/01-ai/Yi-6B) |
|
- Readme and repo format by [TheBloke](https://huggingface.co/TheBloke/) and his [Yi-34B-GGUF repo](https://huggingface.co/TheBloke/Yi-34B-GGUF) |
|
|
|
<!-- description start --> |
|
## Description |
|
|
|
This repo contains GGUF format model files for [01-ai's Yi 6B](https://huggingface.co/01-ai/Yi-6B). |
|
|
|
These files were quantised using hardware kindly provided by [Massed Compute](https://massedcompute.com/). |
|
|
|
<!-- description end --> |
|
<!-- README_GGUF.md-about-gguf start --> |
|
### About GGUF |
|
|
|
GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. |
|
|
|
Here is an incomplete list of clients and libraries that are known to support GGUF: |
|
|
|
* [llama.cpp](https://github.com/ggerganov/llama.cpp). The source project for GGUF. Offers a CLI and a server option. |
|
* [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. |
|
* [KoboldCpp](https://github.com/LostRuins/koboldcpp), a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. |
|
* [LM Studio](https://lmstudio.ai/), an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. |
|
* [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), a great web UI with many interesting and unique features, including a full model library for easy model selection. |
|
* [Faraday.dev](https://faraday.dev/), an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. |
|
* [ctransformers](https://github.com/marella/ctransformers), a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. |
|
* [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. |
|
* [candle](https://github.com/huggingface/candle), a Rust ML framework with a focus on performance, including GPU support, and ease of use. |
|
|
|
<!-- README_GGUF.md-about-gguf end --> |
|
<!-- repositories-available start --> |
|
## Repositories available |
|
|
|
* [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/jezzarax/yi-6b-GGUF) |
|
* [01-ai's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/01-ai/Yi-6B) |
|
<!-- repositories-available end --> |
|
|
|
<!-- prompt-template start --> |
|
## Prompt template: Yi |
|
|
|
``` |
|
Human: {prompt} Assistant: |
|
|
|
``` |
|
|
|
<!-- prompt-template end --> |
|
|
|
|
|
<!-- compatibility_gguf start --> |
|
## Compatibility |
|
|
|
These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) |
|
|
|
They are also compatible with many third party UIs and libraries - please see the list at the top of this README. |
|
|
|
## Explanation of quantisation methods |
|
|
|
<details> |
|
<summary>Click to see details</summary> |
|
|
|
The new methods available are: |
|
|
|
* GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) |
|
* GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. |
|
* GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. |
|
* GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw |
|
* GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw |
|
|
|
Refer to the Provided Files table below to see what files use which methods, and how. |
|
</details> |
|
<!-- compatibility_gguf end --> |
|
|
|
<!-- README_GGUF.md-provided-files start --> |
|
## Provided files |
|
|
|
| Name | Quant method | Bits | Size | Use case | |
|
| ---- | ---- | ---- | ---- | ----- | |
|
| [yi-6b.Q2_K.gguf](https://huggingface.co/jezzarax/yi-6b-GGUF/blob/main/yi-6b.Q2_K.gguf) | Q2_K | 2 | 2.5 GB| smallest, significant quality loss - not recommended for most purposes | |
|
| [yi-6b.Q3_K_S.gguf](https://huggingface.co/jezzarax/yi-6b-GGUF/blob/main/yi-6b.Q3_K_S.gguf) | Q3_K_S | 3 | 2.6 GB| very small, high quality loss | |
|
| [yi-6b.Q3_K.gguf](https://huggingface.co/jezzarax/yi-6b-GGUF/blob/main/yi-6b.Q3_K.gguf) | Q3_K_M | 3 | 2.8 GB| very small, high quality loss | |
|
| [yi-6b.Q3_K_L.gguf](https://huggingface.co/jezzarax/yi-6b-GGUF/blob/main/yi-6b.Q3_K_L.gguf) | Q3_K_L | 3 | 3.1 GB| small, substantial quality loss | |
|
| [yi-6b.Q4_K_S.gguf](https://huggingface.co/jezzarax/yi-6b-GGUF/blob/main/yi-6b.Q4_K_S.gguf) | Q4_K_S | 4 | 3.3 GB | small, greater quality loss | |
|
| [yi-6b.Q4_K.gguf](https://huggingface.co/jezzarax/yi-6b-GGUF/blob/main/yi-6b.Q4_K.gguf) | Q4_K | 4 | 3.5 GB GB | medium, balanced quality - recommended | |
|
| [yi-6b.Q5_K_S.gguf](https://huggingface.co/jezzarax/yi-6b-GGUF/blob/main/yi-6b.Q5_K_S.gguf) | Q5_K_S | 5 | 4.0 GB | large, low quality loss - recommended | |
|
| [yi-6b.Q5_K.gguf](https://huggingface.co/jezzarax/yi-6b-GGUF/blob/main/yi-6b.Q5_K.gguf) | Q5_K | 5 | 4.1 GB | large, very low quality loss - recommended | |
|
| [yi-6b.Q6_K.gguf](https://huggingface.co/jezzarax/yi-6b-GGUF/blob/main/yi-6b.Q6_K.gguf) | Q6_K | 6 | 4.7 GB | very large, extremely low quality loss | |
|
| [yi-6b.f16.gguf](https://huggingface.co/jezzarax/yi-6b-GGUF/blob/main/yi-6b.f16.gguf) | f16 | 16 | 12 GB| very large, no quality loss - not recommended | |
|
|
|
**Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. |
|
|
|
|
|
|
|
<!-- README_GGUF.md-provided-files end --> |
|
|
|
<!-- README_GGUF.md-how-to-download start --> |
|
## How to download GGUF files |
|
|
|
**Note for manual downloaders:** You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. |
|
|
|
The following clients/libraries will automatically download models for you, providing a list of available models to choose from: |
|
|
|
* LM Studio |
|
* LoLLMS Web UI |
|
* Faraday.dev |
|
|
|
### In `text-generation-webui` |
|
|
|
Under Download Model, you can enter the model repo: jezzarax/yi-6b-GGUF and below it, a specific filename to download, such as: yi-6b.Q4_K_M.gguf. |
|
|
|
Then click Download. |
|
|
|
### On the command line, including multiple files at once |
|
|
|
I recommend using the `huggingface-hub` Python library: |
|
|
|
```shell |
|
pip3 install huggingface-hub |
|
``` |
|
|
|
Then you can download any individual model file to the current directory, at high speed, with a command like this: |
|
|
|
```shell |
|
huggingface-cli download jezzarax/yi-6b-GGUF yi-6b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False |
|
``` |
|
|
|
<details> |
|
<summary>More advanced huggingface-cli download usage</summary> |
|
|
|
You can also download multiple files at once with a pattern: |
|
|
|
```shell |
|
huggingface-cli download jezzarax/yi-6b-GGUF --local-dir . --local-dir-use-symlinks False --include='*Q4_K*gguf' |
|
``` |
|
|
|
For more documentation on downloading with `huggingface-cli`, please see: [HF -> Hub Python Library -> Download files -> Download from the CLI](https://huggingface.co/docs/huggingface_hub/guides/download#download-from-the-cli). |
|
|
|
To accelerate downloads on fast connections (1Gbit/s or higher), install `hf_transfer`: |
|
|
|
```shell |
|
pip3 install hf_transfer |
|
``` |
|
|
|
And set environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`: |
|
|
|
```shell |
|
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download jezzarax/yi-6b-GGUF yi-6b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False |
|
``` |
|
|
|
Windows Command Line users: You can set the environment variable by running `set HF_HUB_ENABLE_HF_TRANSFER=1` before the download command. |
|
</details> |
|
<!-- README_GGUF.md-how-to-download end --> |
|
|
|
<!-- README_GGUF.md-how-to-run start --> |
|
## Example `llama.cpp` command |
|
|
|
Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later. |
|
|
|
```shell |
|
./main -ngl 32 -m yi-6b.Q4_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Human: {prompt} Assistant:" |
|
``` |
|
|
|
Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. |
|
|
|
Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. |
|
|
|
If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins` |
|
|
|
For other parameters and how to use them, please refer to [the llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md) |
|
|
|
## How to run in `text-generation-webui` |
|
|
|
Further instructions here: [text-generation-webui/docs/llama.cpp.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp.md). |
|
|
|
## How to run from Python code |
|
|
|
You can use GGUF models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) or [ctransformers](https://github.com/marella/ctransformers) libraries. |
|
|
|
### How to load this model in Python code, using ctransformers |
|
|
|
#### First install the package |
|
|
|
Run one of the following commands, according to your system: |
|
|
|
```shell |
|
# Base ctransformers with no GPU acceleration |
|
pip install ctransformers |
|
# Or with CUDA GPU acceleration |
|
pip install ctransformers[cuda] |
|
# Or with AMD ROCm GPU acceleration (Linux only) |
|
CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers |
|
# Or with Metal GPU acceleration for macOS systems only |
|
CT_METAL=1 pip install ctransformers --no-binary ctransformers |
|
``` |
|
|
|
#### Simple ctransformers example code |
|
|
|
```python |
|
from ctransformers import AutoModelForCausalLM |
|
|
|
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system. |
|
llm = AutoModelForCausalLM.from_pretrained("jezzarax/yi-6b-GGUF", model_file="yi-6b.Q4_K_M.gguf", model_type="yi", gpu_layers=50) |
|
|
|
print(llm("AI is going to")) |
|
``` |
|
|
|
## How to use with LangChain |
|
|
|
Here are guides on using llama-cpp-python and ctransformers with LangChain: |
|
|
|
* [LangChain + llama-cpp-python](https://python.langchain.com/docs/integrations/llms/llamacpp) |
|
* [LangChain + ctransformers](https://python.langchain.com/docs/integrations/providers/ctransformers) |
|
|
|
<!-- README_GGUF.md-how-to-run end --> |
|
|
|
<!-- original-model-card start --> |
|
# Original model card: 01-ai's Yi 6B |
|
|
|
<div align="center"> |
|
|
|
<img src="./Yi.svg" width="200px"> |
|
|
|
</div> |
|
|
|
## Introduction |
|
|
|
The **Yi** series models are large language models trained from scratch by |
|
developers at [01.AI](https://01.ai/). The first public release contains two |
|
bilingual(English/Chinese) base models with the parameter sizes of 6B([`Yi-6B`](https://huggingface.co/01-ai/Yi-6B)) |
|
and 34B([`Yi-34B`](https://huggingface.co/01-ai/Yi-34B)). Both of them are trained |
|
with 4K sequence length and can be extended to 32K during inference time. |
|
The [`Yi-6B-200K`](https://huggingface.co/01-ai/Yi-6B-200K) |
|
and [`Yi-34B-200K`](https://huggingface.co/01-ai/Yi-34B-200K) are base model with |
|
200K context length. |
|
|
|
## News |
|
|
|
- 🎯 **2023/11/06**: The base model of [`Yi-6B-200K`](https://huggingface.co/01-ai/Yi-6B-200K) |
|
and [`Yi-34B-200K`](https://huggingface.co/01-ai/Yi-34B-200K) with 200K context length. |
|
- 🎯 **2023/11/02**: The base model of [`Yi-6B`](https://huggingface.co/01-ai/Yi-6B) and |
|
[`Yi-34B`](https://huggingface.co/01-ai/Yi-34B). |
|
|
|
|
|
## Model Performance |
|
|
|
| Model | MMLU | CMMLU | C-Eval | GAOKAO | BBH | Common-sense Reasoning | Reading Comprehension | Math & Code | |
|
| :------------ | :------: | :------: | :------: | :------: | :------: | :--------------------: | :-------------------: | :---------: | |
|
| | 5-shot | 5-shot | 5-shot | 0-shot | 3-shot@1 | - | - | - | |
|
| LLaMA2-34B | 62.6 | - | - | - | 44.1 | 69.9 | 68.0 | 26.0 | |
|
| LLaMA2-70B | 68.9 | 53.3 | - | 49.8 | 51.2 | 71.9 | 69.4 | 36.8 | |
|
| Baichuan2-13B | 59.2 | 62.0 | 58.1 | 54.3 | 48.8 | 64.3 | 62.4 | 23.0 | |
|
| Qwen-14B | 66.3 | 71.0 | 72.1 | 62.5 | 53.4 | 73.3 | 72.5 | **39.8** | |
|
| Skywork-13B | 62.1 | 61.8 | 60.6 | 68.1 | 41.7 | 72.4 | 61.4 | 24.9 | |
|
| InternLM-20B | 62.1 | 59.0 | 58.8 | 45.5 | 52.5 | 78.3 | - | 30.4 | |
|
| Aquila-34B | 67.8 | 71.4 | 63.1 | - | - | - | - | - | |
|
| Falcon-180B | 70.4 | 58.0 | 57.8 | 59.0 | 54.0 | 77.3 | 68.8 | 34.0 | |
|
| Yi-6B | 63.2 | 75.5 | 72.0 | 72.2 | 42.8 | 72.3 | 68.7 | 19.8 | |
|
| Yi-6B-200K | 64.0 | 75.3 | 73.5 | 73.9 | 42.0 | 72.0 | 69.1 | 19.0 | |
|
| **Yi-34B** | **76.3** | **83.7** | 81.4 | 82.8 | **54.3** | **80.1** | 76.4 | 37.1 | |
|
| Yi-34B-200K | 76.1 | 83.6 | **81.9** | **83.4** | 52.7 | 79.7 | **76.6** | 36.3 | |
|
|
|
While benchmarking open-source models, we have observed a disparity between the |
|
results generated by our pipeline and those reported in public sources (e.g. |
|
OpenCompass). Upon conducting a more in-depth investigation of this difference, |
|
we have discovered that various models may employ different prompts, |
|
post-processing strategies, and sampling techniques, potentially resulting in |
|
significant variations in the outcomes. Our prompt and post-processing strategy |
|
remains consistent with the original benchmark, and greedy decoding is employed |
|
during evaluation without any post-processing for the generated content. For |
|
scores that were not reported by the original authors (including scores reported |
|
with different settings), we try to get results with our pipeline. |
|
|
|
To evaluate the model's capability extensively, we adopted the methodology |
|
outlined in Llama2. Specifically, we included PIQA, SIQA, HellaSwag, WinoGrande, |
|
ARC, OBQA, and CSQA to assess common sense reasoning. SquAD, QuAC, and BoolQ |
|
were incorporated to evaluate reading comprehension. CSQA was exclusively tested |
|
using a 7-shot setup, while all other tests were conducted with a 0-shot |
|
configuration. Additionally, we introduced GSM8K (8-shot@1), MATH (4-shot@1), |
|
HumanEval (0-shot@1), and MBPP (3-shot@1) under the category "Math & Code". Due |
|
to technical constraints, we did not test Falcon-180 on QuAC and OBQA; the score |
|
is derived by averaging the scores on the remaining tasks. Since the scores for |
|
these two tasks are generally lower than the average, we believe that |
|
Falcon-180B's performance was not underestimated. |
|
|
|
## Usage |
|
|
|
Please visit our [github repository](https://github.com/01-ai/Yi) for general |
|
guidance on how to use this model. |
|
|
|
## Disclaimer |
|
|
|
Although we use data compliance checking algorithms during the training process |
|
to ensure the compliance of the trained model to the best of our ability, due to |
|
the complexity of the data and the diversity of language model usage scenarios, |
|
we cannot guarantee that the model will generate correct and reasonable output |
|
in all scenarios. Please be aware that there is still a risk of the model |
|
producing problematic outputs. We will not be responsible for any risks and |
|
issues resulting from misuse, misguidance, illegal usage, and related |
|
misinformation, as well as any associated data security concerns. |
|
|
|
## License |
|
|
|
The Yi series models are fully open for academic research and free commercial |
|
usage with permission via applications. All usage must adhere to the [Model |
|
License Agreement 2.0](https://huggingface.co/01-ai/Yi-34B/blob/main/LICENSE). To |
|
apply for the official commercial license, please contact us |
|
([[email protected]](mailto:[email protected])). |
|
|
|
<!-- original-model-card end --> |
|
|
|
|