Instruct-Llama-3-8B-TPO-y2 Model Card
TPO (Triple Preference Optimization) is a novel preference optimization algorithm aimed at enhancing the instruction-following and reasoning capabilities of large language models through a one-step optimization process. Additionally, we introduce TPO-L, a length-controlled variant of TPO that significantly boosts performance by incorporating a reward margin into TPO’s structure. For more details, refer to our preprint and GitHub repository.
Model Details
Model Description
We fine-tuned meta-llama/Meta-Llama-3-8B-Instruct on princeton-nlp/llama3-ultrafeedback-armorm with the TPO objective. For fine-tuning, we selected the highest-scoring response as the gold response, the second-best response as the preferred response, and the lowest-scoring response as the rejected response.
- Developed by: Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral
- Model type: Causal Language Model
- License: mistral
- Finetuned from model: meta-llama/Meta-Llama-3-8B-Instruct
Model Sources
- Repository: https://github.com/sahsaeedi/TPO
- Paper: https://arxiv.org/abs/2405.16681
How to Get Started with the Model
import torch
from transformers import pipeline
model_id = "tpo-alignment/Instruct-Llama-3-8B-TPO-y2"
generator = pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda",
)
outputs = generator([{"role": "user", "content": "What's the difference between llamas and alpacas?"}],
do_sample=False,
eos_token_id=[generator.tokenizer.convert_tokens_to_ids("<end_of_turn>"), generator.tokenizer.eos_token_id],
max_new_tokens=200)
print(outputs[0]['generated_text'])
Training Details
Training Data
We use princeton-nlp/llama3-ultrafeedback-armorm as the preference optimization dataset.
Training Hyperparameters
The hyperparameters used can be found in the repository.
Technical Specifications
Model Architecture and Objective
The model architecture is based on meta-llama/Meta-Llama-3-8B-Instruct. We use the TPO training objective proposed in our preprint.
Hardware
We used 8xA100 GPUs for model training.
Citation
TPO paper:
@misc{saeidi2025triplepreferenceoptimizationachieving,
title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization},
author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Kashif Rasul and Chitta Baral},
year={2025},
eprint={2405.16681},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2405.16681},
}
- Downloads last month
- 2
Model tree for tpo-alignment/Instruct-Llama-3-8B-TPO-y2
Base model
meta-llama/Meta-Llama-3-8B-Instruct