---
base_model: ibm-granite/granite-3.1-2b-instruct
tags:
- text-generation
- transformers
- gguf
- english
- granite
- text-generation-inference
- inference-endpoints
- conversational
- 4-bit
- 5-bit
- 8-bit
- ruslanmv
license: apache-2.0
language:
- en
---

# Granite-3.1-2B-Reasoning-GGUF (Quantized for Efficiency)

## Model Overview

This is a **GGUF quantized version** of **ruslanmv/granite-3.1-2b-Reasoning**, fine-tuned from **ibm-granite/granite-3.1-2b-instruct**. The **GGUF format** allows for efficient inference on **CPU and GPU**, optimized for use with **Kbit quantization levels** (4-bit, 5-bit, and 8-bit).

- **Developed by:** [ruslanmv](https://huggingface.co/ruslanmv)  
- **License:** Apache 2.0  
- **Base Model:** [ibm-granite/granite-3.1-2b-instruct](https://huggingface.co/ibm-granite/granite-3.1-2b-instruct)  
- **Fine-tuned for:** Logical reasoning, structured problem-solving, long-context tasks  
- **Quantized GGUF versions available:**  
  - **4-bit:** `Q4_K_M`  
  - **5-bit:** `Q5_K_M`  
  - **8-bit:** `Q8_0`  
- **Supported Languages:** English  
- **Architecture:** **Granite**  
- **Model Size:** **2.53B params**  

---

## Why Use the GGUF Quantized Version?

The **GGUF format** is designed for optimized **CPU and GPU inference**, enabling:  

✅ **Lower memory usage** for running on consumer hardware  
✅ **Faster inference speeds** without compromising reasoning ability  
✅ **Compatibility with popular inference engines** like llama.cpp, ctransformers, and KoboldCpp  

---

## Installation & Usage  

To use this model with **llama.cpp**, install the required dependencies:

```bash
pip install llama-cpp-python
```

### Running the Model  

To run the model using **llama.cpp**:

```bash
from llama_cpp import Llama

model_path = "path/to/ruslanmv/granite-3.1-2b-Reasoning-GGUF.Q4_K_M.gguf"

llm = Llama(model_path=model_path)

input_text = "Can you explain the difference between inductive and deductive reasoning?"
output = llm(input_text, max_tokens=400)

print(output["choices"][0]["text"])
```

Alternatively, using **ctransformers**:

```bash
pip install ctransformers
```

```python
from ctransformers import AutoModelForCausalLM

model_path = "path/to/ruslanmv/granite-3.1-2b-Reasoning-GGUF.Q4_K_M.gguf"

model = AutoModelForCausalLM.from_pretrained(model_path, model_type="llama", gpu_layers=50)

input_text = "What are the key principles of logical reasoning?"
output = model(input_text, max_new_tokens=400)

print(output)
```

---

## Intended Use  

Granite-3.1-2B-Reasoning-GGUF is optimized for **efficient inference** while maintaining strong **reasoning capabilities**, making it ideal for:  

- **Logical and analytical problem-solving**  
- **Text-based reasoning tasks**  
- **Mathematical and symbolic reasoning**  
- **Advanced instruction-following**  

This model is particularly useful for **CPU-based deployments** and users who need **low-memory, high-performance** text generation.

---

## License & Acknowledgments  

This model is released under the **Apache 2.0** license. It is fine-tuned from IBM’s **Granite 3.1-2B-Instruct** model and **quantized using GGUF** for optimal efficiency. Special thanks to the **IBM Granite Team** for developing the base model.  

For more details, visit the [IBM Granite Documentation](https://huggingface.co/ibm-granite).  

---

### Citation  

If you use this model in your research or applications, please cite:  

```
@misc{ruslanmv2025granite,
  title={Fine-Tuning and GGUF Quantization of Granite-3.1 for Advanced Reasoning},
  author={Ruslan M.V.},
  year={2025},
  url={https://huggingface.co/ruslanmv/granite-3.1-2b-Reasoning-GGUF}
}
```