onebitquantized's picture
Update README.md
8311d35 verified
---
library_name: transformers
license: other
base_model:
- mistralai/Mistral-Large-Instruct-2407
---
# This model has been xMADified!
This repository contains [`mistralai/Mistral-Large-Instruct-2407`](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407) quantized from 16-bit floats to 4-bit integers, using xMAD.ai proprietary technology.
# Why should I use this model?
1. **Memory-efficiency:** The full-precision model is around 250 GB, while this xMADified model is only 65 GB, making it feasible to run on a single 80 GB GPU or 2x 40 GB GPUs.
2. **Accuracy:** This xMADified model preserves the quality of the full-precision model. In the table below, we present the zero-shot accuracy on popular benchmarks of this xMADified model against the [GPTQ](https://github.com/AutoGPTQ/AutoGPTQ)-quantized model. The xMADai model offers higher accuracy than the GPTQ model.
| Model | MMLU STEM | MMLU Humanities | MMLU Social Sciences | MMLU Other | LAMBADA Standard | LAMBADA OpenAI |
|---|---|---|---|---|---|---|
| GPTQ Mistral-Large-Instruct-2407 | 77.26 | 77.83 | 89.57 | 86.03 | 74.95 | 81.04 |
| xMADai Mistral-Large-Instruct-2407 (this model) | **77.26** | **77.98** | **89.57** | **86.26** | **75.20** | **81.29** |
3. **Fine-tuning**: These models are fine-tunable over reduced hardware in mere 3-clicks. Watch our product demo [here](https://www.youtube.com/watch?v=S0wX32kT90s&list=TLGGL9fvmJ-d4xsxODEwMjAyNA)
# How to Run Model
Loading the model checkpoint of this xMADified model requires 65 GB of VRAM. Hence it can be efficiently run on 2x 40 GB GPUs.
**Package prerequisites**: Run the following commands to install the required packages.
```bash
pip install torch==2.4.0 # Run following if you have CUDA version 11.8: pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate optimum
pip install -vvv --no-build-isolation "git+https://github.com/PanQiWei/[email protected]"
```
**Sample Inference Code**
```python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
model_id = "xmadai/Mistral-Large-Instruct-2407-xMADai-INT4"
prompt = [
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
{"role": "user", "content": "What's Deep Learning?"},
]
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
inputs = tokenizer.apply_chat_template(
prompt,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to("cuda")
model = AutoGPTQForCausalLM.from_quantized(
model_id,
device_map='auto',
trust_remote_code=True,
)
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=1024)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
```
# Citation
If you found this model useful, please cite our research paper.
```
@article{zhang2024leanquant,
title={LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid},
author={Zhang, Tianyi and Shrivastava, Anshumali},
journal={arXiv preprint arXiv:2407.10032},
year={2024},
url={https://arxiv.org/abs/2407.10032},
}
```
# Contact Us
For additional xMADified models, access to fine-tuning, and general questions, please contact us at [email protected] and join our waiting list.