---
license: mit
language:
- en
base_model:
- meta-llama/Llama-3.2-1B
pipeline_tag: text2text-generation
---
# Llama-3.2 1B 4-bit Quantized Model


## Model Overview
- **Base Model**: Meta-Llama/Llama-3.2-1B
- **Model Name**: rautaditya/llama-3.2-1b-4bit-gptq
- **Quantization**: 4-bit GPTQ (Generative Pretrained Transformer Quantization)

## Model Description
This is a 4-bit quantized version of the Llama-3.2 1B model, designed to reduce model size and inference latency while maintaining reasonable performance. The quantization process allows for more efficient deployment on resource-constrained environments.

### Key Features
- Reduced model size
- Faster inference times
- Compatible with Hugging Face Transformers
- GPTQ quantization for optimal compression

## Quantization Details
- **Quantization Method**: GPTQ (Generative Pretrained Transformer Quantization)
- **Bit Depth**: 4-bit
- **Base Model**: Llama-3.2 1B
- **Quantization Library**: AutoGPTQ

## Installation Requirements
```bash
pip install transformers accelerate auto-gptq torch
```

## Usage

### Transformers Pipeline
```python
from transformers import AutoTokenizer, pipeline

ModelFolder = "rautaditya/llama-3.2-1b-4bit-gptq"
tokenizer = AutoTokenizer.from_pretrained(ModelFolder)
pipe = pipeline(
    "text-generation",
    model=ModelFolder,
    tokenizer=tokenizer,
    device_map="auto"
)

prompt = "What is the meaning of life?"
generated_text = pipe(prompt, max_length=100)
print(generated_text)
```

### Direct Model Loading
```python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_name = "rautaditya/llama-3.2-1b-4bit-gptq"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_pretrained(
    model_name, 
    device_map="auto"
)
```

## Performance Considerations
- **Memory Efficiency**: Significantly reduced memory footprint compared to full-precision model
- **Inference Speed**: Faster inference due to reduced computational requirements
- **Potential Accuracy Trade-off**: Minor performance degradation compared to full-precision model

## Limitations
- May show slight differences in output quality compared to the original model
- Performance can vary based on specific use case and inference environment

## Recommended Use Cases
- Low-resource environments
- Edge computing
- Mobile applications
- Embedded systems
- Rapid prototyping

## License
Please refer to the original Meta Llama 3.2 model license for usage restrictions and permissions.

## Citation
If you use this model, please cite:
```
@misc{llama3.2_4bit_quantized,
  title={Llama-3.2 1B 4-bit Quantized Model},
  author={Raut, Aditya},
  year={2024},
  publisher={Hugging Face}
}
```

## Contributions and Feedback
- Open to suggestions and improvements
- Please file issues on the GitHub repository for any bugs or performance concerns

## Acknowledgments
- Meta AI for the base Llama-3.2 model
- Hugging Face Transformers team
- AutoGPTQ library contributors