azhiboedova
/

Meta-Llama-3.1-8B-Instruct-AQLM-2Bit-1x16

+---
+library_name: transformers
+tags:
+- llama
+- facebook
+- meta
+- llama-3.1
+- conversational
+- text-generation-inference
+---
+**Model Architecture**
+The Llama 3.1 8B model is a state-of-the-art language model designed for a wide range of conversational and text generation tasks. By applying the Adaptive Quantization Learning Mechanism (AQLM) developed by Yandex Research, the model's size has been significantly reduced without sacrificing its powerful capabilities. This approach dynamically adjusts the precision of model parameters during training, optimizing for both performance and efficiency.
+**License**
+- The model operates under the Llama-3 license provided by Meta. For detailed license information, visit: [Llama-3 License](https://llama.meta.com/llama3/license).
+- The quantization technique applied is credited to Yandex Research, detailed in their paper on [Adaptive Quantization Learning Mechanism](https://arxiv.org/abs/2401.06118).
+**Quantization Method**
+Incorporating the innovative AQLM (Adaptive Quantization Learning Mechanism), this model achieves a remarkable balance between size and performance. AQLM fine-tunes the precision of parameters in real-time during training, leading to a streamlined model that maintains the robust capabilities of its full-size counterpart. This quantization method is detailed in the research paper by Yandex Research: [Adaptive Quantization Learning Mechanism](https://arxiv.org/abs/2401.06118).
+The model was compressed using Vast AI with 8x A100 GPUs, taking approximately 5-6 hours to complete the process.
+**Evaluations**
+The quantized Llama 3.1 8B model was rigorously evaluated using the Massive Multitask Language Understanding (MMLU) benchmark with a 5-shot setting. MMLU is a comprehensive benchmark designed to evaluate large language models through multiple-choice questions across 57 subjects, including math, history, law, and ethics. For more information, visit the [MMLU GitHub page](https://github.com/hendrycks/test).
+For evaluation, the library `deepeval.benchmarks.mmlu` was utilized:
+```python
+from deepeval.benchmarks.mmlu import MMLU
+```
+[Colab Notebook with Model Evaluation](https://colab.research.google.com/drive/16hXI7pd9KSTeUMNfGCB0wGMAzcBqzdZM?usp=sharing)
+The results demonstrate that the quantized model maintains competitive performance compared to the original, non-quantized model, with reduced computational and storage requirements.
+**How to use**
+To import this model with Python and run it, you can use the following code:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "azhiboedova/Meta-Llama-3.1-8B-Instruct-AQLM-2Bit-1x16"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id)
+# Example usage
+inputs = tokenizer("Hello, how can I assist you today?", return_tensors="pt")
+outputs = model.generate(inputs["input_ids"])
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+**Model Developers**
+This model is developed by [Laurentiu Petrea](https://www.linkedin.com/in/laurentiupetrea/), based on the cutting-edge Llama-3 architecture from Meta. The model leverages advanced quantization techniques to enhance its performance and efficiency.