--- license: mit --- # Quantized BitNet-B1-58-3B This repository contains a quantized version of the [1bitLLM/bitnet_b1_58-3B](https://huggingface.co/1bitLLM/bitnet_b1_58-3B) model. While the original repository showcases impressive validation results, it emulates BitNet's Linear layers, resulting in memory usage similar to fp16 models. By leveraging the QuantLinear module from [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ), this repository enables the output and execution of a 2-bit quantized model. The quantized model offers significant advantages in terms of model size and memory consumption. With a model size of just 1GB , the quantized 3B model can perform inference with a context size of 2048 while consuming only 4.5GB of VRAM. Furthermore, since the weights used during execution are the same as the original repository, the perplexity (PPL) output remains unchanged. ## Install ``` pip install -r requirements.txt ``` ## Quantization The quantized model is already provided in this repository. However, if you wish to quantize the model yourself, you can load it from 1bitLLM/bitnet_b1_58-3B and save the quantized version (2-bit) to ./bitnet_b1_58-3B_quantized by running the following command: ``` python quantization.py ``` ## Evaluation ``` python eval_ppl.py --hf_path ./ --seqlen 2048 --max_dataset_size 1000 ``` ``` python eval_task.py --hf_path ./ \ --batch_size 1 \ --tasks \ --output_path result.json \ --num_fewshot 0 \ --ctx_size 2048 ```