Fine-tune Llama 3.2 3B Using Unsloth and BAAI/Infinity-Instruct Dataset

This model uses the "0625" version, but there will be a fine-tuned model trained with the "7M" version as well.

Uploaded Model

  • Developed by: MateoRov
  • License: apache-2.0
  • Fine-tuned from model: unsloth/llama-3.2-3b-instruct-bnb-4bit

Usage

Check my full repo on github for better undestanding: https://github.com/Mateorovere/FineTuning-LLM-Llama3.2-3b

But with the proper dependencies you can run the model with the following code:

from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel

# Get the chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)
model = "MateoRov/Llama3.2-3b-SFF-Infinity-MateoRovere"

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Define the input message
messages = [
    {"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]

# Prepare the inputs
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,  # Must add for generation
    return_tensors="pt",
).to("cuda")

# Generate the output
outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=64,
    use_cache=True,
    temperature=1.5,
    min_p=0.1,
)

# Decode the outputs
result = tokenizer.batch_decode(outputs)
print(result)

To get the generation token by token:


from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel
from transformers import TextStreamer

model = "MateoRov/Llama3.2-3b-SFF-Infinity-MateoRovere"

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Get the chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)

# Define the input message
messages = [
    {"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]

# Prepare the inputs
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,  # Must add for generation
    return_tensors="pt",
).to("cuda")

# Initialize the text streamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

# Generate the output token by token
_ = model.generate(
    input_ids=inputs,
    streamer=text_streamer,
    max_new_tokens=128,
    use_cache=True,
    temperature=1.5,
    min_p=0.1,
)
Downloads last month
12
Safetensors
Model size
3.21B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for MateoRov/Llama3.2-3b-SFF-Infinity-MateoRovere

Quantizations
1 model

Dataset used to train MateoRov/Llama3.2-3b-SFF-Infinity-MateoRovere