Model Details

  • SFT based on meta-llama/Llama-2-7b-hf with merged alpaca datasets
  • DPO: trained on top of SFT model as LoRa Adapter, with merged hh-rlhf data
  • PPO: trained on top of dpo model and reward model, with multi-adapters, with PKU-SafeRLHF data for futher RLHF
  • Trained with Deepspeed ZeRO-1 + TRL + QLoRA + Flash-Attntion 2

Model and Training Details

Training Results

image/png

Evaluation

The reward score and toxicity scores are computed and compared with PKU-Alignment/PKU-SafeRLHF-30K data on SFT/DPO/PPO models

Model Toxicity Reward
SFT_v0.1 0.0698 -0.2828
DPO_v0.1 0.0356 -0.2633
PPO_v0.1 0.0321 0.38
image/png

Compute Infrastructure

The model is trained using 8 * RTX-3090-24GB/A100-PCIE-40GB

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True,)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.eos_token = DEFINE_EOS_TOKEN
model.config.eos_token = DEFINE_EOS_TOKEN
model.config.eos_token_id = tokenizer.eos_token_id

def format_prompt(question):
    return f"###Question: {question}\n###Answer: "

instruction = "Your text here"
input = format_prompt(instruction)
inputs = tokenizer(input, return_tensors='pt')
output = model.generate(inputs['input_ids'], max_new_tokens=512, do_sample=False, top_p=1)
output = tokenizer.decode(output[0], skip_special_tokens=True)
print(output)

Model Card Authors

Yiyu (Michael) Ren

Model Card Contact

Email: [email protected]

Framework versions

  • PEFT 0.8.2
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for renyiyu/llama-2-7b-ppo-lora-v0.1

Finetuned
(779)
this model