YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

πŸš€ SuperNova Medius Compressed Model (W4A16)

Model Size Quantization Max Sequence Length

Model ID: arcee-ai/SuperNova-Medius-CM-w4a16

πŸ“‹ Table of Contents

πŸ” Overview

SuperNova Medius CM W4A16 is a quantized version of the arcee-ai/SuperNova-Medius model, optimized for efficient deployment. Using GPTQ (Generalized Post-Training Quantization), we've achieved significant size reduction while maintaining near-original performance.

✨ Key Features

  • 4-bit weight quantization
  • 16-bit activation quantization
  • 4096 token context window
  • Optimized for deployment on consumer hardware

πŸš€ Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")
model = AutoModelForCausalLM.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")

# Simple inference
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ“Š Model Details

Specifications

  • Base Model: arcee-ai/SuperNova-Medius
  • Quantization Method: GPTQ
  • Maximum Sequence Length: 4096
  • Calibration Samples: 1024

Quantization Parameters

Parameter Value
Weight Bits 4
Activation Bits 16
Ignored Layers lm_head
Dampening Fraction 0.1
Calibration Dataset neuralmagic/LLM_compression_calibration

πŸ’» Usage Guide

Basic Usage

See Quick Start section above.

Advanced Usage

# Advanced generation with parameters
output = model.generate(
    input_ids,
    max_length=100,
    num_beams=4,
    temperature=0.7,
    no_repeat_ngram_size=2,
    do_sample=True
)

Memory Optimization

# Load model with device map for multi-GPU setup
model = AutoModelForCausalLM.from_pretrained(
    "arcee-ai/SuperNova-Medius-CM-w4a16",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

βš™οΈ Quantization Process

import torch
from datasets import load_dataset
from transformers import AutoTokenizer
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map

# Configuration
MODEL_ID = "arcee-ai/SuperNova-Medius"
NUM_SAMPLES = 1024
MAX_LENGTH = 4096
SEED = 42

# Calculate device map
device_map = calculate_offload_device_map(
    MODEL_ID,
    num_gpus=torch.cuda.device_count(),
    reserve_for_hessians=True,
    torch_dtype=torch.bfloat16
)

# Load model and tokenizer
model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map=device_map,
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Prepare calibration dataset
ds = load_dataset("neuralmagic/LLM_compression_calibration")
ds = ds["train"].shuffle(seed=SEED).select(range(NUM_SAMPLES))

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}

ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_LENGTH,
        truncation=True,
        add_special_tokens=False
    )

ds = ds.map(tokenize)

# Configure quantization
recipe = GPTQModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=["lm_head"],
    dampening_frac=0.1
)

# Execute quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    oneshot_device=device_map,
    max_seq_length=MAX_LENGTH,
    num_calibration_samples=NUM_SAMPLES,
    accelerator_config={
        'split_batches': True,
        'dispatch_batches': None,
        'even_batches': True,
        'use_seedable_sampler': True,
        'non_blocking': False,
        'gradient_accumulation_kwargs': None,
        'use_configured_state': False
    }
)

# Save quantized model
model.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16", save_compressed=True)
tokenizer.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16")

πŸ› οΈ Technical Details

Dependencies

Package Version
Python 3.9.x
torch 2.5.1
transformers 4.46.2
llmcompressor 0.5.0
vllm 0.6.4
datasets 3.1.0
huggingface_hub 0.24.7
compressed-tensors 0.8.0

Hardware Requirements

  • Minimum: 8GB VRAM
  • Recommended: 16GB VRAM
  • Optimal: 24GB VRAM or multiple GPUs

⚠️ Limitations & Biases

Known Limitations

  • Slight performance degradation compared to full-precision model
  • Limited to 4096 token context window
  • May require careful memory management on consumer GPUs

Inherited Biases

  • Carries over biases from base model
  • Users should implement appropriate content filtering
  • Regular evaluation recommended for production deployments

πŸ“š Citations & Acknowledgements

Citation

@misc{SuperNovaMediusCMW4A16,
  author = {Edward Kim and Jaro Uljanovs},
  title = {SuperNova Medius Compressed Model W4A16},
  year = {2024},
  howpublished = {\url{https://huggingface.co/ConfidentialMind/arcee-ai-SuperNova-Medius-CM-w4a16}},
}

πŸ‘ Acknowledgements

  • Original Model: arcee-ai/SuperNova-Medius
  • Quantization Tools: LLM Compressor
  • Contributors: Edward Kim and Jaro Uljanovs

πŸ“ Version History

  • v1.0.0 (2024-03): Initial release
  • v1.0.1 (2024-03): Documentation updates
Downloads last month
1,708
Safetensors
Model size
3.31B params
Tensor type
I64
Β·
I32
Β·
BF16
Β·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.