Llama-3.2 1B 4-bit Quantized Model

Model Overview

  • Base Model: Meta-Llama/Llama-3.2-1B
  • Model Name: rautaditya/llama-3.2-1b-4bit-gptq
  • Quantization: 4-bit GPTQ (Generative Pretrained Transformer Quantization)

Model Description

This is a 4-bit quantized version of the Llama-3.2 1B model, designed to reduce model size and inference latency while maintaining reasonable performance. The quantization process allows for more efficient deployment on resource-constrained environments.

Key Features

  • Reduced model size
  • Faster inference times
  • Compatible with Hugging Face Transformers
  • GPTQ quantization for optimal compression

Quantization Details

  • Quantization Method: GPTQ (Generative Pretrained Transformer Quantization)
  • Bit Depth: 4-bit
  • Base Model: Llama-3.2 1B
  • Quantization Library: AutoGPTQ

Installation Requirements

pip install transformers accelerate auto-gptq torch

Usage

Transformers Pipeline

from transformers import AutoTokenizer, pipeline

ModelFolder = "rautaditya/llama-3.2-1b-4bit-gptq"
tokenizer = AutoTokenizer.from_pretrained(ModelFolder)
pipe = pipeline(
    "text-generation",
    model=ModelFolder,
    tokenizer=tokenizer,
    device_map="auto"
)

prompt = "What is the meaning of life?"
generated_text = pipe(prompt, max_length=100)
print(generated_text)

Direct Model Loading

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_name = "rautaditya/llama-3.2-1b-4bit-gptq"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_pretrained(
    model_name, 
    device_map="auto"
)

Performance Considerations

  • Memory Efficiency: Significantly reduced memory footprint compared to full-precision model
  • Inference Speed: Faster inference due to reduced computational requirements
  • Potential Accuracy Trade-off: Minor performance degradation compared to full-precision model

Limitations

  • May show slight differences in output quality compared to the original model
  • Performance can vary based on specific use case and inference environment

Recommended Use Cases

  • Low-resource environments
  • Edge computing
  • Mobile applications
  • Embedded systems
  • Rapid prototyping

License

Please refer to the original Meta Llama 3.2 model license for usage restrictions and permissions.

Citation

If you use this model, please cite:

@misc{llama3.2_4bit_quantized,
  title={Llama-3.2 1B 4-bit Quantized Model},
  author={Raut, Aditya},
  year={2024},
  publisher={Hugging Face}
}

Contributions and Feedback

  • Open to suggestions and improvements
  • Please file issues on the GitHub repository for any bugs or performance concerns

Acknowledgments

  • Meta AI for the base Llama-3.2 model
  • Hugging Face Transformers team
  • AutoGPTQ library contributors
Downloads last month
98
Safetensors
Model size
764M params
Tensor type
F32
FP16
U8
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for rautaditya/llama-3.2-1b-4bit-gptq

Quantized
(111)
this model