Fine-Tuning 1B LLaMA 3.2: A Comprehensive Step-by-Step Guide with Code
Building a Mental Health Chatbot by fine tuning Llama 3.2
Let's find some mental peace π by fine tuning Llama 3.2.
We need to install unsloth for 2x fast training with less size`m
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
We are going to use Unsloth because it significantly enhances the efficiency of fine-tuning large language models (LLMs) specially LLaMA and Mistral. With Unsloth, we can use advanced quantization techniques, such as 4-bit and 16-bit quantization, to reduce the memory and speed up both training and inference. This means we can deploy powerful models even on hardware with limited resources but without compromising on performance.
Additionally, Unsloth broad compatibility and customization options allow to do the quantization process to fit the specific needs of products. This flexibility combined with its ability to cut VRAM usage by up to 60%, makes Unsloth an essential tool in AI toolkit. Its not just about optimizing models its about making cutting-edge AI more accessible and efficient for real world applications.
For fine tuning, I used the following setup:
- Torch 2.1.1 with CUDA 12.1 for efficient computation.
- Unsloth to achieve 2X faster training speeds for the large language model (LLM).
- H100 NVL GPU to handle the intensive processing requirement but you can use the less power GPU I mean Kaggle GPU.
-
Why LLaMA 3.2?
Its Open Source and Accessible and offers the flexibility to customize and fine-tune it with the specific needs. Due to open source weights of the model from Meta, it is very easy to fine tune on any problem and we are going to fine tune it on mental health dataset from the Hugging Face
Python Libraries π π π π
Data Handling and Visualization
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')
LLM model training
import torch
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel
from datasets import Dataset
from unsloth import is_bfloat16_supported
# Saving model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Warnings
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
π¦₯ Unsloth: Will patch your computer to enable 2x faster free finetuning.
Calling the dataset
data = pd.read_json("hf://datasets/Amod/mental_health_counseling_conversations/combined_dataset.json", lines=True)
Exploratory data analysis π π
Lets check the lenght of words in each context
data['Context_length'] = data['Context'].apply(len)
plt.figure(figsize=(10, 3))
sns.histplot(data['Context_length'], bins=50, kde=True)
plt.title('Distribution of Context Lengths')
plt.xlabel('Length of Context')
plt.ylabel('Frequency')
plt.show()
filtered_data = data[data['Context_length'] <= 1500]
ln_Context = filtered_data['Context'].apply(len)
plt.figure(figsize=(10, 3))
sns.histplot(ln_Context, bins=50, kde=True)
plt.title('Distribution of Context Lengths')
plt.xlabel('Length of Context')
plt.ylabel('Frequency')
plt.show()
Lets check now the lenght of words in each Response
ln_Response = filtered_data['Response'].apply(len)
plt.figure(figsize=(10, 3))
sns.histplot(ln_Response, bins=50, kde=True, color='teal')
plt.title('Distribution of Response Lengths')
plt.xlabel('Length of Response')
plt.ylabel('Frequency')
plt.show()
filtered_data = filtered_data[ln_Response <= 4000]
ln_Response = filtered_data['Response'].apply(len)
plt.figure(figsize=(10, 3))
sns.histplot(ln_Response, bins=50, kde=True, color='teal')
plt.title('Distribution of Response Lengths')
plt.xlabel('Length of Response')
plt.ylabel('Frequency')
plt.show()
Model training π§ͺ
Lets deep dive into Llama 3.2 model and train it on our data
Loading the model
Key aspects which can be followed as per your requirement as well:
- Max Sequence Length:
We used
max_seq_length
5020, the maximum number of tokens can be used in model that can handle in a single input sequence. This is crucial for tasks requiring the processing of long texts, ensuring that the model can capture more context in each pass. It can be used as per requirements. - Loading Llama 3.2 Model:
The model and tokenizer are loaded using
FastLanguageModel.from_pretrained
with a specific pre-trained model,"unsloth/Llama-3.2-1B-bnb-4bitt"
. This is optimized for 4-bit precision, which reduces memory usage and increases training speed without significantly compromising performance. Theload_in_4bit=True
parameter enables this efficient 4-bit quantization, making it more suitable for fine-tuning on less powerful hardware. - Applying PEFT (Parameter-Efficient Fine-Tuning):
Then we configured model using
get_peft_model
, which applies LoRA (Low-Rank Adaptation) techniques. This approach focuses on fine-tuning only specific layers or parts of the model, rather than the entire network, drastically reducing the computational resources needed.Parameters such as
r=16
andlora_alpha=16
adjust the complexity and scaling of these adaptations. The use oftarget_modules
specifies which layers of the model should be adapted, which include key components involved in attention mechanisms likeq_proj
,k_proj
, andv_proj
.use_rslora=True
activates Rank-Stabilized LoRA, which improves the stability of the fine-tuning process.use_gradient_checkpointing="unsloth"
ensures that memory usage is optimized during training by selectively storing only necessary computations, further enhancing the model's efficiency. - Verifying Trainable Parameters:
Finally we are using
model.print_trainable_parameters()
to print out the number of parameters that will be updated during fine-tuning, allowing to verify that only the intended parts of the model are being trained.
This combination of techniques makes the fine-tuning process not only more efficient but also more accessible, allowing you to deploy this model even with limited computational resources.
Setting maximum lenght of tokenz 5020 is more than enough as Low-Rank Adaptation (LoRA) for training but you can use as per your data and requirements.
max_seq_length = 5020
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-1B-bnb-4bit",
max_seq_length=max_seq_length,
load_in_4bit=True,
dtype=None,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
use_rslora=True,
use_gradient_checkpointing="unsloth",
random_state = 32,
loftq_config = None,
)
print(model.print_trainable_parameters())
Prapare data for model feed
Main points to remember:
- Data Prompt Structure:
The
data_prompt
is a formatted string template designed to guide the model in analyzing the provided text. It includes placeholders for the input text (the context) and the model's response. This template specifically prompts the model to identify mental health indicators, making it easier to fine-tune the model for mental health-related tasks. - End-of-Sequence Token:
The
EOS_TOKEN
is retrieved from the tokenizer to signify the end of each text sequence. This token is essential for the model to recognize when a prompt has ended, helping to maintain the structure of the data during training or inference. - Formatting Function:
The
formatting_prompt
used to take a batch of examples and formats them according to thedata_prompt
. It iterates over the input and output pairs, inserting them into the template and appending the EOS token at the end. The function then returns a dictionary containing the formatted text, ready for model training or evaluation. - Function Output:
The function outputs a dictionary where the key is
"text"
and the value is a list of formatted strings. Each string represents a fully prepared prompt for the model, combining the context, response and the structured prompt template.
data_prompt = """Analyze the provided text from a mental health perspective. Identify any indicators of emotional distress, coping mechanisms, or psychological well-being. Highlight any potential concerns or positive aspects related to mental health, and provide a brief explanation for each observation.
### Input:
{}
### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token
def formatting_prompt(examples):
inputs = examples["Context"]
outputs = examples["Response"]
texts = []
for input_, output in zip(inputs, outputs):
text = data_prompt.format(input_, output) + EOS_TOKEN
texts.append(text)
return { "text" : texts, }
Format the data for training
training_data = Dataset.from_pandas(filtered_data)
training_data = training_data.map(formatting_prompt, batched=True)
Model training with custom parameters and data
Using
sudo apt-get update
to refresh the list of available packages andsudo apt-get install build-essential
to install essential tools. Only run this on shell if you get any error.
#sudo apt-get update
#sudo apt-get install build-essential
Training setup to start fine tuning!
- Trainer Initialization:
We are going to initialize
SFTTrainer
with the model and tokenizer, as well as the training dataset. Thedataset_text_field
parameter specifies the field in the dataset that contains the text to be used for training which we prepared above. The trainer is responsible for managing the fine-tuning process, including data handling and model updates. - Training Arguments:
The
TrainingArguments
class is used to define key hyperparameters for the training process. These include:learning_rate=3e-4
: Sets the learning rate for the optimizer.per_device_train_batch_size=32
: Defines the batch size per device, optimizing GPU usage.num_train_epochs=20
: Specifies the number of training epochs.fp16=not is_bfloat16_supported()
andbf16=is_bfloat16_supported()
: Enable mixed precision training to reduce memory usage, depending on hardware support.optim="adamw_8bit"
: Uses the 8-bit AdamW optimizer for efficient memory usage.weight_decay=0.01
: Applies weight decay to prevent overfitting.output_dir="output"
: Specifies the directory where the trained model and logs will be saved.
- Training Process:
Finally we called
trainer.train()
method to start the training process. It uses the defined parameters of our fine-tune the model, adjusting weights and learning from the provided dataset. The trainer also handles data packing and gradient accumulation, optimizing the training pipeline for better performance.
Sometime pytorch reserve the memory and dont relase back. Setting this environment variable can help avoid memory fragmentation. You can set this in your environment or script before running your model
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
If there are variables that are no longer needed in the GPU, you can delete them using del and then call
torch.cuda.empty_cache().
trainer=SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=training_data,
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
packing=True,
args=TrainingArguments(
learning_rate=3e-4,
lr_scheduler_type="linear",
per_device_train_batch_size=16,
gradient_accumulation_steps=8,
num_train_epochs=40,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
warmup_steps=10,
output_dir="output",
seed=0,
),
)
trainer.train()
Inference
text="I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here. I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it. How can I change my feeling of being worthless to everyone?"
Here is some keys to note:
The model = FastLanguageModel.for_inference(model)
configures the model specifically for inference, optimizing its performance for generating responses.
The input text is tokenized using the tokenizer
, it convert the text into a format that model can process. We are using data_prompt
to format the input text, while the response placeholder is left empty to get response from model. The return_tensors = "pt"
parameter specifies that the output should be in PyTorch tensors, which are then moved to the GPU using .to("cuda")
for faster processing.
The model.generate
method generating response based on the tokenized inputs. The parameters max_new_tokens = 5020
and use_cache = True
ensure that the model can produce long and coherent responses efficiently by utilizing cached computation from previous layers.
model = FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
data_prompt.format(
#instructions
text,
#answer
"",
)
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 5020, use_cache = True)
answer=tokenizer.batch_decode(outputs)
answer = answer[0].split("### Response:")[-1]
print("Answer of the question is:", answer)
Answer of the question is:
I'm sorry to hear that you are feeling so overwhelmed. It sounds like you are trying to figure out what is going on with you. I would suggest that you see a therapist who specializes in working with people who are struggling with depression. Depression is a common issue that people struggle with. It is important to address the issue of depression in order to improve your quality of life. Depression can lead to other issues such as anxiety, hopelessness, and loss of pleasure in activities. Depression can also lead to thoughts of suicide. If you are thinking of suicide, please call 911 or go to the nearest hospital emergency department. If you are not thinking of suicide, but you are feeling overwhelmed, please call 800-273-8255. This number is free and confidential and you can talk to someone about anything. You can also go to www.suicidepreventionlifeline.org to find a local suicide prevention hotline.<|end_of_text|>
os.environ["HF_TOKEN"] = "hugging face token key, you can create from your HF account."
model.push_to_hub("ImranzamanML/1B_finetuned_llama3.2", use_auth_token=os.getenv("HF_TOKEN"))
tokenizer.push_to_hub("ImranzamanML/1B_finetuned_llama3.2", use_auth_token=os.getenv("HF_TOKEN"))
README.md: 0%| | 0.00/583 [00:00<?, ?B/s] adapter_model.safetensors: 0%| | 0.00/45.1M [00:00<?, ?B/s] Saved model to https://huggingface.co/ImranzamanML/1B_finetuned_llama3.2
model.save_pretrained("model/1B_finetuned_llama3.2")
tokenizer.save_pretrained("model/1B_finetuned_llama3.2")
('model/1B_finetuned_llama3.2/tokenizer_config.json', 'model/1B_finetuned_llama3.2/special_tokens_map.json', 'model/1B_finetuned_llama3.2/tokenizer.json')
model, tokenizer = FastLanguageModel.from_pretrained( model_name = "model/1B_finetuned_llama3.2", max_seq_length = 5020, dtype = None, load_in_4bit = True)
No way, still searching for something? π No worries! You can use the prompt format and code above to get response for mental peace π§ β¨