|
--- |
|
inference: false |
|
--- |
|
4-bit quantization of the vicuna-13b-v1.1 model. |
|
|
|
The delta was added to the original LLaMa weights using FastChat. \ |
|
Quantization and inference with GPTQ-For-LLaMa (commit 58c8ab4). |
|
|
|
Quantization args: $MODEL_DIRECTORY, c4, wbits 4, true-sequential, act-order, groupsize 128. \ |
|
Inference args: $MODEL_DIRECTORY, wbits 4, groupsize 128, load $CHECKPOINT_FILE \ |
|
Add arg device=0 if using GPU for inference. You may have to change min_length and max_length for better inference outputs. |
|
|
|
The separator has been changed to \</s\>. Simple prompt is "Human: $REQUEST\</s\>Assistant:". |
|
|
|
Delta: https://huggingface.co/lmsys/vicuna-13b-delta-v1.1 \ |
|
FastChat: https://github.com/lm-sys/FastChat \ |
|
GTPQ-for-LLaMa: https://github.com/qwopqwop200/GPTQ-for-LLaMa |
|
|