Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
mlabonne 
posted an update Apr 2, 2024
Post
9270
⚡ AutoQuant

AutoQuant is the evolution of my previous AutoGGUF notebook (https://colab.research.google.com/drive/1P646NEg33BZy4BfLDNpTz0V0lwIU3CHu). It allows you to quantize your models in five different formats:

- GGUF: perfect for inference on CPUs (and LM Studio)
- GPTQ/EXL2: fast inference on GPUs
- AWQ: super fast inference on GPUs with vLLM (https://github.com/vllm-project/vllm)
- HQQ: extreme quantization with decent 2-bit and 3-bit models

Once the model is converted, it automatically uploads it on the Hugging Face Hub. To quantize a 7B model, GGUF only needs a T4 GPU, while the other methods require an A100 GPU.

Here's an example of a model I quantized using HQQ and AutoQuant: mlabonne/AlphaMonarch-7B-2bit-HQQ

I hope you'll enjoy it and quantize lots of models! :)

💻 AutoQuant: https://colab.research.google.com/drive/1b6nqC7UZVt8bx4MksX7s656GXPM-eWw4

This could work quite well inside a Space too!

The UX could be really nice imo

You've a A100 at your disposal, Maxim? Or using a cloud version sometime?

·

I simply use Colab if I want to quantize a 7B model. Otherwise, cloud GPUs when needed.

Hello, firstly thanks for this great notebook. I am trying to use this to quantize my fine-tuned model but i am facing following error:
Wrote Hermes-7B-TR/hermes-7b-tr.fp16.bin
./llama.cpp/quantize: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
./llama.cpp/quantize: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

I had some error with numpy which says to upgrade so i just added pip install -u numpy to this code only.

the model i am trying to quantize is: https://huggingface.co/umarigan/Hermes-7B-TR this model is trl tuned and basely it is on mistral.

·

Thanks @umarigan !

I had some error with numpy which says to upgrade so i just added pip install -u numpy to this code only.

I tried it and it doesn't fix this issue, unfortunately. You should be able to ignore it, it doesn't impact the quantization.

I was able to quantize your model using AutoQuant, so not sure where your error comes from: https://huggingface.co/mlabonne/Hermes-7B-TR-GGUF

If you want to go really deep, technically you can make HQQ quants on 16GB GPU, no a100 needed, it just there is no convenient api for that(I used my rtx3080ti laptop for mixtral, and hqq has enough backends to work with t4 i think).

HQQ doesn't quantize "model" in a way where one quant depends on another, it quantizes all nn.Linears separately.

So you can load them one by one manually from shards pytorch-NNNN.bin/.safetensors(especially safetensors as they support lazy loading), quantize, save quantized data on disk, discard quantized and raw data of nn.Linear from VRAM, rinse and repeat. (Lots of models do not use nn.Linear.bias so nn.Linear is always contained in a single shard which helps a lot)

Also hqq mixtral is still slow on my laptop so I while I got quantized model running I never checked how hard it is to save it after such manual quantization in a way that can be easily loaded using the standard api.

Fantastic - thanks so much for sharing. Only a couple 1000 models I want to quant! Using GGUF -my-repo at the moment (a space) :

https://huggingface.co/spaces/ggml-org/gguf-my-repo

Have you or do you know of any ways to use the same COLAB type method (or space or other) to make GGUFs with Imatrix ?

·

Bro i was searching for days for something like this, god damn you are life saver