Mistral-Instruct-7B-TPO-y2-v0.2 Model Card

TPO (Triple Preference Optimization) is a novel preference optimization algorithm aimed at enhancing the instruction-following and reasoning capabilities of large language models through a one-step optimization process. Additionally, we introduce TPO-L, a length-controlled variant of TPO that significantly boosts performance by incorporating a reward margin into TPO’s structure. For more details, refer to our preprint and GitHub repository.

Model Details

Model Description

We fine-tuned mistralai/Mistral-7B-Instruct-v0.2 on princeton-nlp/mistral-instruct-ultrafeedback with the TPO objective. For fine-tuning, we selected the highest-scoring response as the gold response, the second-best response as the preferred response, and the lowest-scoring response as the rejected response. Versions 0.1 and 0.2 differ in their chat templates, as explained in the Appendix of the TPO paper. For more details, refer to our preprint.

  • Developed by: Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral
  • Model type: Causal Language Model
  • License: mistral
  • Finetuned from model: mistralai/Mistral-7B-Instruct-v0.2

Model Sources

How to Get Started with the Model

import torch
from transformers import pipeline
model_id = "tpo-alignment/Mistral-Instruct-7B-TPO-y2-v0.2"
generator = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)
outputs = generator([{"role": "user", "content": "What's the difference between llamas and alpacas?"}],
                      do_sample=False,
                      eos_token_id=[generator.tokenizer.convert_tokens_to_ids("<end_of_turn>"), generator.tokenizer.eos_token_id],
                      max_new_tokens=200)
print(outputs[0]['generated_text'])

Training Details

Training Data

We use princeton-nlp/mistral-instruct-ultrafeedback as the preference optimization dataset.

Training Hyperparameters

The hyperparameters used can be found in the repository.

Technical Specifications

Model Architecture and Objective

The model architecture is based on mistralai/Mistral-7B-Instruct-v0.2. We use the TPO training objective proposed in our preprint.

Hardware

We used 8xA100 GPUs for model training.

Citation

TPO paper:

@misc{saeidi2025triplepreferenceoptimizationachieving,
      title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization}, 
      author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Kashif Rasul and Chitta Baral},
      year={2025},
      eprint={2405.16681},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2405.16681}, 
}
Downloads last month
4
Safetensors
Model size
7.24B params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for tpo-alignment/Mistral-Instruct-7B-TPO-y2-v0.2

Finetuned
(926)
this model

Dataset used to train tpo-alignment/Mistral-Instruct-7B-TPO-y2-v0.2