Mistral-Instruct-7B-TPO-y2-v0.2 Model Card

TPO (Triple Preference Optimization) is a novel preference optimization algorithm aimed at enhancing the instruction-following and reasoning capabilities of large language models through a one-step optimization process. Additionally, we introduce TPO-L, a length-controlled variant of TPO that significantly boosts performance by incorporating a reward margin into TPO’s structure. For more details, refer to our preprint and GitHub repository.

Model Details

Model Description

We fine-tuned mistralai/Mistral-7B-Instruct-v0.2 on princeton-nlp/mistral-instruct-ultrafeedback with the TPO objective. For fine-tuning, we selected the highest-scoring response as the gold response, the second-best response as the preferred response, and the lowest-scoring response as the rejected response. Versions 0.1 and 0.2 differ in their chat templates, as explained in the Appendix of the TPO paper. For more details, refer to our preprint.

Developed by: Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral
Model type: Causal Language Model
License: mistral
Finetuned from model: mistralai/Mistral-7B-Instruct-v0.2

Model Sources

Repository: https://github.com/sahsaeedi/TPO
Paper: https://arxiv.org/abs/2405.16681

How to Get Started with the Model

import torch
from transformers import pipeline
model_id = "tpo-alignment/Mistral-Instruct-7B-TPO-y2-v0.2"
generator = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)
outputs = generator([{"role": "user", "content": "What's the difference between llamas and alpacas?"}],
                      do_sample=False,
                      eos_token_id=[generator.tokenizer.convert_tokens_to_ids("<end_of_turn>"), generator.tokenizer.eos_token_id],
                      max_new_tokens=200)
print(outputs[0]['generated_text'])

Training Details

Training Data

We use princeton-nlp/mistral-instruct-ultrafeedback as the preference optimization dataset.

Training Hyperparameters

The hyperparameters used can be found in the repository.

Technical Specifications

Model Architecture and Objective

The model architecture is based on mistralai/Mistral-7B-Instruct-v0.2. We use the TPO training objective proposed in our preprint.

Hardware

We used 8xA100 GPUs for model training.

Citation

TPO paper:

@misc{saeidi2025triplepreferenceoptimizationachieving,
      title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization}, 
      author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Kashif Rasul and Chitta Baral},
      year={2025},
      eprint={2405.16681},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2405.16681}, 
}

tpo-alignment
/

Mistral-Instruct-7B-TPO-y2-v0.2