Instruct-Llama-3-8B-TPO-y2 Model Card

TPO (Triple Preference Optimization) is a novel preference optimization algorithm aimed at enhancing the instruction-following and reasoning capabilities of large language models through a one-step optimization process. Additionally, we introduce TPO-L, a length-controlled variant of TPO that significantly boosts performance by incorporating a reward margin into TPO’s structure. For more details, refer to our preprint and GitHub repository.

Model Details

Model Description

We fine-tuned meta-llama/Meta-Llama-3-8B-Instruct on princeton-nlp/llama3-ultrafeedback-armorm with the TPO objective. For fine-tuning, we selected the highest-scoring response as the gold response, the second-best response as the preferred response, and the lowest-scoring response as the rejected response.

Developed by: Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral
Model type: Causal Language Model
License: mistral
Finetuned from model: meta-llama/Meta-Llama-3-8B-Instruct

Model Sources

Repository: https://github.com/sahsaeedi/TPO
Paper: https://arxiv.org/abs/2405.16681

How to Get Started with the Model

import torch
from transformers import pipeline
model_id = "tpo-alignment/Instruct-Llama-3-8B-TPO-y2"
generator = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)
outputs = generator([{"role": "user", "content": "What's the difference between llamas and alpacas?"}],
                      do_sample=False,
                      eos_token_id=[generator.tokenizer.convert_tokens_to_ids("<end_of_turn>"), generator.tokenizer.eos_token_id],
                      max_new_tokens=200)
print(outputs[0]['generated_text'])

Training Details

Training Data

We use princeton-nlp/llama3-ultrafeedback-armorm as the preference optimization dataset.

Training Hyperparameters

The hyperparameters used can be found in the repository.

Technical Specifications

Model Architecture and Objective

The model architecture is based on meta-llama/Meta-Llama-3-8B-Instruct. We use the TPO training objective proposed in our preprint.

Hardware

We used 8xA100 GPUs for model training.

Citation

TPO paper:

@misc{saeidi2025triplepreferenceoptimizationachieving,
      title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization}, 
      author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Kashif Rasul and Chitta Baral},
      year={2025},
      eprint={2405.16681},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2405.16681}, 
}

tpo-alignment
/

Instruct-Llama-3-8B-TPO-y2