WizardLM 70B V1.0 – EXL2

Models available:

Link BITS (-b) HEAD BITS (-hb) MEASU-REMENT LENGTH (-ml) LENGTH (-l) CAL DATASET (-c) Size V. Max Context Length Base Model Layers VRAM Min*** VRAM Max*** PPL** Comments                                                                                                                        
here 4.0 6 2048 2048 0000.parquet* 33GB 0.0.2 4096 FP32 80 39GB 44GB 4.15234375 Good results
here 4.0 6 2048 2048 0000.parquet* 33GB 0.0.2 4096 BF16 80 39GB 44GB 4.2421875 Model suffers from poor prompt understanding and logic is affected
here 4.0 8 2048 2048 0000.parquet* 35GB 0.0.2 4096 FP16 80 39GB 44GB 4.24609375 Model suffers from poor prompt understanding and logic is affected
here 5.0 6 2048 2048 0000.parquet* 41GB 0.0.2 4096 FP32 80 47GB 52GB 4.06640625 Best so far. Good results
here 5.0 8 2048 2048 0000.parquet* 44GB 0.0.2 4096 FP16 80 48GB 52GB 4.09765625 Model suffers from poor prompt understanding and logic is affected
here 5.0 6 2048 2048 0000.parquet* 44GB 0.0.1 4096 FP16 80 48GB 52GB 4.0625 Model suffers from poor prompt understanding and logic is affected
here 5.0 6 2048 2048 0000.parquet* 41GB 0.0.2 4096 BF16 80 47GB 52GB 4.09765625 Model suffers from poor prompt understanding and logic is affected
here 6.0 6 2048 2048 0000.parquet* 49GB 0.0.2 4096 FP16 80 56GB 60GB 4.0703125 Model suffers from poor prompt understanding and logic is affected

* wikitext-2-raw-v1

** Evaluated with text-generation-webui ExLlama v0.0.2 on wikitext-2-raw-v1 (stride 512 and max_length 0). For reference, TheBloke_WizardLM-70B-V1.0-GPTQ_gptq-4bit-32g-actorder_True has a score of 4.1015625 in perplexity.

*** Without Flash Attention - For better VRAM optimisation, make sure you install https://github.com/Dao-AILab/flash-attention#installation-and-features

Description:

This repository contains EXL2 model files for WizardLM's WizardLM 70B V1.0.

EXL2 is a new format used by ExLlamaV2 – https://github.com/turboderp/exllamav2. EXL2 is based on the same optimization method as GPTQ. The format allows for mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight.

Prompt template (official):

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {prompt} ASSISTANT: 

Prompt template (suggested):

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER:
{prompt}
ASSISTANT:

Quantization process:

Original Model β†’ (optional) float16 or bfloat16 Model* β†’ Safetensors Model** β†’ EXL2 Model
WizardLM 70B V1.0 β†’ WizardLM 70B V1.0-HF* β†’ Safetensors** β†’ EXL2

Example to convert WizardLM-70B-V1.0-HF to EXL2 4.0 bpw with 6-bit head:

mkdir -p ~/EXL2/WizardLM-70B-V1.0-HF_4bit # Create the output directory
python convert.py -i ~/float16_safetensored/WizardLM-70B-V1.0-HF -o ~/EXL2/WizardLM-70B-V1.0-HF_4bit -c ~/EXL2/0000.parquet -b 4.0 -hb 6

* Use the following script to convert your local pytorch_model bin files to float16 (you can also choose bfloat16) + safetensors all in one go:

Example to convert WizardLM 70B V1.0 directly to float16 safetensors in 10GB shards:

python convert-to-safetensors.py ~/original/WizardLM-70B-V1.0 --output ~/float16_safetensored/WizardLM-70B-V1.0 --max-shard-size 10GB

Use --bf16 if you'd like to try bfloat16 instead, but note that there are concerns about quantization quality – https://github.com/turboderp/exllamav2/issues/30#issuecomment-1719009289

** Use any one of the following scripts to convert your local pytorch_model bin files to safetensors:

Further reading:

Downloads last month
9
Inference Examples
Inference API (serverless) has been turned off for this model.