Adding Evaluation Results (#1)

90d1dc0 verified 4 months ago

6.95 kB

	---
	language:
	- en
	license: apache-2.0
	model-index:
	- name: Mistral7B-PairRM-SPPO-ExPO
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: IFEval (0-Shot)
	type: HuggingFaceH4/ifeval
	args:
	num_few_shot: 0
	metrics:
	- type: inst_level_strict_acc and prompt_level_strict_acc
	value: 36.73
	name: strict accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=chujiezheng/Mistral7B-PairRM-SPPO-ExPO
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: BBH (3-Shot)
	type: BBH
	args:
	num_few_shot: 3
	metrics:
	- type: acc_norm
	value: 13.68
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=chujiezheng/Mistral7B-PairRM-SPPO-ExPO
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MATH Lvl 5 (4-Shot)
	type: hendrycks/competition_math
	args:
	num_few_shot: 4
	metrics:
	- type: exact_match
	value: 0.91
	name: exact match
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=chujiezheng/Mistral7B-PairRM-SPPO-ExPO
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GPQA (0-shot)
	type: Idavidrein/gpqa
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 3.58
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=chujiezheng/Mistral7B-PairRM-SPPO-ExPO
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MuSR (0-shot)
	type: TAUR-Lab/MuSR
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 8.66
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=chujiezheng/Mistral7B-PairRM-SPPO-ExPO
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU-PRO (5-shot)
	type: TIGER-Lab/MMLU-Pro
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 17.24
	name: accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=chujiezheng/Mistral7B-PairRM-SPPO-ExPO
	name: Open LLM Leaderboard
	---

	# Mistral7B-PairRM-SPPO-ExPO

	The extrapolated (ExPO) model based on [`UCLA-AGI/Mistral7B-PairRM-SPPO`](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO) and [`mistralai/Mistral-7B-Instruct-v0.2`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2), as in the "[Weak-to-Strong Extrapolation Expedites Alignment](https://arxiv.org/abs/2404.16792)" paper.

	Specifically, we obtain this model by extrapolating (alpha = 0.3) from the weights of the SFT and DPO/RLHF checkpoints, achieving superior alignment with human preference.

	This extrapolated model achieves the 35.4% win rate and 31.8% LC win rate on AlpacaEval 2.0, outperforming the original `Mistral7B-PairRM-SPPO`'s 32.2% and 30.5%, respectively.

	## Evaluation Results

	Evaluation results on the AlpacaEval 2.0 benchmark (you can find the evaluation outputs on the [official GitHub repo](https://github.com/chujiezheng/LLM-Extrapolation/tree/main/results_alpaca)):

	\| \| Win Rate (Ori) \| LC Win Rate (Ori) \| Win Rate (+ ExPO) \| LC Win Rate (+ ExPO) \|
	\| ------------------------------------ \| -------------- \| ----------------- \| ----------------- \| -------------------- \|
	\| `HuggingFaceH4/zephyr-7b-alpha` \| 6.7% \| 10.0% \| 10.6% \| 13.6% \|
	\| `HuggingFaceH4/zephyr-7b-beta` \| 10.2% \| 13.2% \| 11.1% \| 14.0% \|
	\| `berkeley-nest/Starling-LM-7B-alpha` \| 15.0% \| 18.3% \| 18.2% \| 19.5% \|
	\| `Nexusflow/Starling-LM-7B-beta` \| 26.6% \| 25.8% \| 29.6% \| 26.4% \|
	\| `snorkelai/Snorkel-Mistral-PairRM` \| 24.7% \| 24.0% \| 28.8% \| 26.4% \|
	\| `RLHFlow/LLaMA3-iterative-DPO-final` \| 29.2% \| 36.0% \| 32.7% \| 37.8% \|
	\| `internlm/internlm2-chat-1.8b` \| 3.8% \| 4.0% \| 5.2% \| 4.3% \|
	\| `internlm/internlm2-chat-7b` \| 20.5% \| 18.3% \| 28.1% \| 22.7% \|
	\| `internlm/internlm2-chat-20b` \| 36.1% \| 24.9% \| 46.2% \| 27.2% \|
	\| `allenai/tulu-2-dpo-7b` \| 8.5% \| 10.2% \| 11.5% \| 11.7% \|
	\| `allenai/tulu-2-dpo-13b` \| 11.2% \| 15.5% \| 15.6% \| 17.6% \|
	\| `allenai/tulu-2-dpo-70b` \| 15.4% \| 21.2% \| 23.0% \| 25.7% \|

	Evaluation results on the MT-Bench benchmark (you can find the evaluation outputs on the [official GitHub repo](https://github.com/chujiezheng/LLM-Extrapolation/tree/main/results_mtbench)):

	\| \| Original \| + ExPO \|
	\| ------------------------------------ \| -------- \| -------- \|
	\| `HuggingFaceH4/zephyr-7b-alpha` \| 6.85 \| 6.87 \|
	\| `HuggingFaceH4/zephyr-7b-beta` \| 7.02 \| 7.06 \|
	\| `berkeley-nest/Starling-LM-7B-alpha` \| 7.82 \| 7.91 \|
	\| `Nexusflow/Starling-LM-7B-beta` \| 8.10 \| 8.18 \|
	\| `snorkelai/Snorkel-Mistral-PairRM` \| 7.63 \| 7.69 \|
	\| `RLHFlow/LLaMA3-iterative-DPO-final` \| 8.08 \| 8.45 \|
	\| `internlm/internlm2-chat-1.8b` \| 5.17 \| 5.26 \|
	\| `internlm/internlm2-chat-7b` \| 7.72 \| 7.80 \|
	\| `internlm/internlm2-chat-20b` \| 8.13 \| 8.26 \|
	\| `allenai/tulu-2-dpo-7b` \| 6.35 \| 6.38 \|
	\| `allenai/tulu-2-dpo-13b` \| 7.00 \| 7.26 \|
	\| `allenai/tulu-2-dpo-70b` \| 7.79 \| 8.03 \|


	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_chujiezheng__Mistral7B-PairRM-SPPO-ExPO)

	\| Metric \|Value\|
	\|-------------------\|----:\|
	\|Avg. \|13.47\|
	\|IFEval (0-Shot) \|36.73\|
	\|BBH (3-Shot) \|13.68\|
	\|MATH Lvl 5 (4-Shot)\| 0.91\|
	\|GPQA (0-shot) \| 3.58\|
	\|MuSR (0-shot) \| 8.66\|
	\|MMLU-PRO (5-shot) \|17.24\|