parseny/TinyLlama1.1B-Nvidia-QA
This repository contains the parseny/TinyLlama1.1B-Nvidia-QA model, a fine-tuned version of the TinyLlama language model designed for generating answers on NVIDIA documentation. The model was fine-tuned on a dataset of question-answer pairs and evaluated using several metrics to ensure high performance.
Model Details
- Model ID: parseny/TinyLlama1.1B-Nvidia-QA
- Model Type: Causal Language Model
- Base Model: TinyLlama-1.1B
- Quantization: 4-bit quantization using BitsAndBytes
- Fine-Tuning Framework: Hugging Face Transformers and PEFT
Training Configuration
The model was fine-tuned with the following training arguments:
training_arguments = TrainingArguments(
output_dir="./logs",
per_device_train_batch_size=16,
gradient_accumulation_steps=4,
optim="paged_adamw_32bit",
fp16=True,
evaluation_strategy="epoch",
save_strategy="epoch",
num_train_epochs=5,
load_best_model_at_end=True,
learning_rate=5e-4
)
Evaluation Metrics
The performance of the fine-tuned model was evaluated using the following metrics:
ROUGE Scores:
- ROUGE-1: 0.3122
- ROUGE-2: 0.1228
- ROUGE-L: 0.2599
- ROUGE-Lsum: 0.2600
METEOR Score: 0.27
These scores indicate that the model performs reasonably well in generating responses that are lexically and semantically similar to the reference answers.
Model Usage
You can use this model to generate responses for chat-based applications. Below is an example of how to load and use the model for generating responses:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig
import torch
# Load the model and tokenizer
model_id = "parseny/TinyLlama1.1B-Nvidia-QA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
model.to('cuda')
# Generate a response
generation_config = GenerationConfig(
penalty_alpha=0.6, do_sample=True,
top_k=5, temperature=0.5, repetition_penalty=1.2,
max_new_tokens=47, pad_token_id=tokenizer.eos_token_id
)
def generate_response(prompt):
try:
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
outputs = model.generate(**inputs, generation_config=generation_config)
generated_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
start_idx = generated_response.find('<|im_start|>assistant\n') + len('<|im_start|>assistant\n')
generated_response = generated_response[start_idx:]
end_idx = generated_response.find('<|im_end|>')
generated_response = generated_response[:end_idx]
return generated_response
except:
return ""
# Example usage
prompt = "What was the purpose of setting up the DGX RAID memory in version 2 of the pipeline?"
response = generate_response(prompt)
print(response)
Training Procedure
The model was fine-tuned using a dataset of question-answer pairs. The fine-tuning process involved:
- Loading the pre-trained TinyLlama-1.1B model.
- Quantizing the model to 4-bit precision to reduce memory usage and increase inference speed.
- Fine-tuning the model using the
SFTTrainer
with the specified training arguments. - Evaluating the model at the end of each epoch and saving the best-performing model.
How to Cite
If you use this model in your research or applications, please cite it as follows:
@misc{parseny-tinyllama-nvidia-qa,
author = {Your Name},
title = {TinyLlama1.1B-Nvidia-QA: NVIDIA documnetation helper},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/parseny/TinyLlama1.1B-Nvidia-QA},
}
Contact
For any questions or issues, please open an issue on the Hugging Face model repository.
- Downloads last month
- 27