zamal commited on
Commit
93a9d1f
·
verified ·
1 Parent(s): 9b6eaa8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -3
README.md CHANGED
@@ -1,3 +1,48 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+
6
+ # Molmo-7B-GPTQ-4bit 🚀
7
+
8
+ ## Overview
9
+
10
+ The **Molmo-7B-GPTQ-4bit** model is a transformer-based model fine-tuned for NLP tasks. It has been quantized to 4-bit precision for efficient deployment. This model has been prepared using **bitsandbytes** for 4-bit quantization rather than using **AutoGPTQ**, which does not natively support this model format as of now. The quantization leverages the `BitsAndBytesConfig` from the `transformers` library, enabling highly optimized GPU inference with reduced memory usage.
11
+
12
+ ## Model Card
13
+
14
+ ### Model Information
15
+
16
+ - **Model Name**: Molmo-7B-GPTQ-4bit
17
+ - **Base Model**: [allenai/Molmo-7B-D-0924](https://huggingface.co/allenai/Molmo-7B-D-0924)
18
+ - **Quantization**: 4-bit quantization using `bitsandbytes` instead of `AutoGPTQ`
19
+ - **Repository URL**: [zamal/Molmo-7B-GPTQ-4bit](https://huggingface.co/zamal/Molmo-7B-GPTQ-4bit)
20
+
21
+ ### Technical Details
22
+
23
+ This model is quantized using **bitsandbytes** (not **AutoGPTQ**), as GPTQ currently lacks direct support for NF4 4-bit quantization via the native `AutoGPTQ` methods. This approach allows for highly efficient 4-bit precision inference with minimal loss in performance and reduced memory overhead.
24
+
25
+ #### Key Quantization Configurations:
26
+
27
+ - **bnb_4bit_use_double_quant**: Enabled, for more efficient handling of smaller models.
28
+ - **bnb_4bit_quant_type**: NF4 (Normal Float 4-bit), which is more efficient and accurate for smaller models.
29
+ - **bnb_4bit_compute_dtype**: FP16 (float16) to accelerate GPU-based inference.
30
+
31
+ #### Device Compatibility:
32
+
33
+ - **bitsandbytes** automatically handles device mapping for GPUs via the `device_map="auto"` parameter.
34
+ - **4-bit models** are ideal for GPUs with limited VRAM, allowing inference on larger models without exceeding hardware memory limits.
35
+
36
+ ### Limitations
37
+
38
+ - **Precision Loss**: While the model has been quantized for efficiency, there is a minor trade-off in precision due to the 4-bit quantization, which may slightly affect performance compared to the original full-precision model.
39
+ - **AutoGPTQ Limitation**: As mentioned, **AutoGPTQ** does not natively support this kind of quantization, and this has been achieved through `bitsandbytes` and the `transformers` library.
40
+
41
+ ## Usage
42
+
43
+ ### Installation
44
+
45
+ Make sure you have the necessary dependencies installed:
46
+
47
+ ```bash
48
+ pip install torch transformers bitsandbytes huggingface_hub