Mistral-Instruct-7B-TPO-y2-v0.2 Model Card
TPO (Triple Preference Optimization) is a novel preference optimization algorithm aimed at enhancing the instruction-following and reasoning capabilities of large language models through a one-step optimization process. Additionally, we introduce TPO-L, a length-controlled variant of TPO that significantly boosts performance by incorporating a reward margin into TPO’s structure. For more details, refer to our preprint and GitHub repository.
Model Details
Model Description
We fine-tuned mistralai/Mistral-7B-Instruct-v0.2 on princeton-nlp/mistral-instruct-ultrafeedback with the TPO objective. For fine-tuning, we selected the highest-scoring response as the gold response, the second-best response as the preferred response, and the lowest-scoring response as the rejected response. Versions 0.1 and 0.2 differ in their chat templates, as explained in the Appendix of the TPO paper. For more details, refer to our preprint.
- Developed by: Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral
- Model type: Causal Language Model
- License: mistral
- Finetuned from model: mistralai/Mistral-7B-Instruct-v0.2
Model Sources
- Repository: https://github.com/sahsaeedi/TPO
- Paper: https://arxiv.org/abs/2405.16681
How to Get Started with the Model
import torch
from transformers import pipeline
model_id = "tpo-alignment/Mistral-Instruct-7B-TPO-y2-v0.2"
generator = pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda",
)
outputs = generator([{"role": "user", "content": "What's the difference between llamas and alpacas?"}],
do_sample=False,
eos_token_id=[generator.tokenizer.convert_tokens_to_ids("<end_of_turn>"), generator.tokenizer.eos_token_id],
max_new_tokens=200)
print(outputs[0]['generated_text'])
Training Details
Training Data
We use princeton-nlp/mistral-instruct-ultrafeedback as the preference optimization dataset.
Training Hyperparameters
The hyperparameters used can be found in the repository.
Technical Specifications
Model Architecture and Objective
The model architecture is based on mistralai/Mistral-7B-Instruct-v0.2. We use the TPO training objective proposed in our preprint.
Hardware
We used 8xA100 GPUs for model training.
Citation
TPO paper:
@misc{saeidi2025triplepreferenceoptimizationachieving,
title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization},
author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Kashif Rasul and Chitta Baral},
year={2025},
eprint={2405.16681},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2405.16681},
}
- Downloads last month
- 4
Model tree for tpo-alignment/Mistral-Instruct-7B-TPO-y2-v0.2
Base model
mistralai/Mistral-7B-Instruct-v0.2