How to load and quantize the fine tuned model in google colab or kaggle?
Hey Abhishek , firstly thank you so much for your tutorial.
Noob Alert!!
I have fine tuned the LlaMA and mistral sharded model for fine tuning on google colab and saved the same to hf. Now I am totally clueless about how can I run my fine tune model in google colab, also how can I conver the same into ggml/gguf format and quantize it into 4 bits.
I don't expect a full tutorial or answer , but please do give me some resource or reference :)
First run this:
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
hf_token = "your_hf_token"
base_model_name = "abhishek/llama-2-7b-hf-small-shards" #path/to/your/model/or/name/on/hub"
adapter_model_name = "your repo"
model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_model_name, use_auth_token=hf_token)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
Then run this:
Example text input
input_text = "your message"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
Generate text using keyword arguments
outputs = model.generate(
input_ids=input_ids,
max_length=200, # You can increase this value
no_repeat_ngram_size=2,
early_stopping=True,
num_return_sequences=1
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)