Granite-3.1-8B-Reasoning-exl2

Original model: granite-3.1-8b-Reasoning by ruslanmv
Based on: granite-3.1-8b-instruct by Granite Team, IBM

Quants

4bpw h6 (main)
4.5bpw h6
5bpw h6
6bpw h6
8bpw h8

Quantization notes

Made with Exllamav2 0.2.8 with default dataset. These quants require Exllamav2 0.2.7 or newer.
They meant to be used with apps that support exl2 models such as TabbyAPI, Text-Generation-WebUI and others.
On Windows it requires a Nvidia RTX2xxx or newer GPU, on Linux it can be used with Nvidia RTX or AMD ROCm cards.
Models are required to be fully loaded into GPU, native RAM offloading isn't supported.
If you need RAM offloading or have some other GPU, try GGUF quants instead.

Original model card

Granite-3.1-8B-Reasoning (Fine-Tuned for Advanced Reasoning)

Model Overview

This model is a fine-tuned version of ibm-granite/granite-3.1-8b-instruct, optimized for logical reasoning and analytical tasks. Fine-tuning has been performed to enhance structured problem-solving, long-context comprehension, and instruction-following capabilities.

Developed by: ruslanmv
License: Apache 2.0
Base Model: ibm-granite/granite-3.1-8b-instruct
Fine-tuned for: Logical reasoning, structured problem-solving, and long-context tasks
Training Framework: Unsloth & Hugging Face TRL (2x faster training)
Supported Languages: English
Model Size: 8.17B params
Tensor Type: BF16

Why Use This Model?

This fine-tuned model improves upon the base Granite-3.1-8B model by enhancing its reasoning capabilities while retaining its general text-generation abilities.

✅ Optimized for complex reasoning tasks
✅ Enhanced long-context understanding
✅ Improved instruction-following abilities
✅ Fine-tuned for structured analytical thinking

Installation & Usage

Install the required dependencies:

pip install torch torchvision torchaudio
pip install accelerate
pip install transformers

Running the Model

Use the following Python snippet to load and generate text with Granite-3.1-8B-Reasoning:

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

# Model and tokenizer
model_name = "ruslanmv/granite-3.1-8b-Reasoning" # Or "ruslanmv/granite-3.1-2b-Reasoning"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto', # or 'cuda' if you have only one GPU
    torch_dtype=torch.float16, # Use float16 for faster and less memory intensive inference
    load_in_4bit=True # Enable 4-bit quantization for lower memory usage - requires bitsandbytes
)

# Prepare dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)

inputs = tokenizer(text, return_tensors="pt").to("cuda") # Move input tensor to GPU

# Sampling parameters
generation_config = GenerationConfig(
    temperature = 0.8,
    top_p = 0.95,
    max_new_tokens = 1024, # Equivalent to max_tokens in the original code, but for generation
)

# Inference
with torch.inference_mode(): # Use inference mode for faster generation
    outputs = model.generate(**inputs, generation_config=generation_config)

output = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Find the start of the actual response
start_index = output.find("assistant")
if start_index != -1:
    # Remove the initial part including "assistant"
    output = output[start_index + len("assistant"):].strip()

print(output)

You will get something like:

<reasoning>
Pi is an irrational number, which means it cannot be exactly calculated as it has an infinite number of decimal places. However, we can approximate pi using various mathematical formulas. One of the simplest methods is the Leibniz formula for pi, which is an infinite series:

pi = 4 * (1 - 1/3 + 1/5 - 1/7 + 1/9 - 1/11 +...)

This series converges to pi as more terms are added.
</reasoning>

<answer>
The exact value of pi cannot be calculated due to its infinite decimal places. However, using the Leibniz formula, we can approximate pi to a certain number of decimal places. For example, after calculating the first 500 terms of the series, we get an approximation of pi as 3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679.
</answer>

Intended Use

Granite-3.1-8B-Reasoning is designed for tasks requiring structured and logical reasoning, including:

Logical and analytical problem-solving
Text-based reasoning tasks
Mathematical and symbolic reasoning
Advanced instruction-following
Conversational AI with a focus on structured responses

This model is particularly useful for enterprise AI applications, research, and large-scale NLP tasks.

License & Acknowledgments

This model is released under the Apache 2.0 license. It is fine-tuned from IBM’s Granite 3.1-8B-Instruct model. Special thanks to the IBM Granite Team for developing the base model.

For more details, visit the IBM Granite Documentation.

Citation

If you use this model in your research or applications, please cite:

@misc{ruslanmv2025granite,
  title={Fine-Tuning Granite-3.1-8B for Advanced Reasoning},
  author={Ruslan M.V.},
  year={2025},
  url={https://huggingface.co/ruslanmv/granite-3.1-8b-Reasoning}
}

cgus
/

granite-3.1-8b-Reasoning-exl2