Molmo-7B-GPTQ-4bit / README.md
zamal's picture
Update README.md
9c5ff8a verified
metadata
license: apache-2.0

Molmo-7B-GPTQ-4bit πŸš€

Overview

The Molmo-7B-GPTQ-4bit model is a transformer-based model fine-tuned for NLP tasks. It has been quantized to 4-bit precision for efficient deployment. This model has been prepared using bitsandbytes for 4-bit quantization rather than using AutoGPTQ, which does not natively support this model format as of now. The quantization leverages the BitsAndBytesConfig from the transformers library, enabling highly optimized GPU inference with reduced memory usage.

Model Architecture

Model Information

Technical Details

This model is quantized using bitsandbytes (not AutoGPTQ), as GPTQ currently lacks direct support for NF4 4-bit quantization via the native AutoGPTQ methods. This approach allows for highly efficient 4-bit precision inference with minimal loss in performance and reduced memory overhead.

Key Quantization Configurations:

  • bnb_4bit_use_double_quant: Enabled, for more efficient handling of smaller models.
  • bnb_4bit_quant_type: NF4 (Normal Float 4-bit), which is more efficient and accurate for smaller models.
  • bnb_4bit_compute_dtype: FP16 (float16) to accelerate GPU-based inference.

Device Compatibility:

  • bitsandbytes automatically handles device mapping for GPUs via the device_map="auto" parameter.
  • 4-bit models are ideal for GPUs with limited VRAM, allowing inference on larger models without exceeding hardware memory limits.

Limitations

  • Precision Loss: While the model has been quantized for efficiency, there is a minor trade-off in precision due to the 4-bit quantization, which may slightly affect performance compared to the original full-precision model.
  • AutoGPTQ Limitation: As mentioned, AutoGPTQ does not natively support this kind of quantization, and this has been achieved through bitsandbytes and the transformers library.

Usage

Installation

Make sure you have the necessary dependencies installed:

pip install transformers torch Pillow torchvision einops accelerate tensorflow bitsandbytes