Experiment for DARE(Drop and REscale), most of the delta parameters can be directly set to zeros without affecting the capabilities of SFT LMs and larger models can tolerate a higher proportion of discarded parameters.

Merged with below DARE models.

weight_mask_rate: 0.85 / use_weight_rescale: True / mask_stratery: random / scaling_coefficient: 1.0

Model Average ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K DROP
Intel/neural-chat-7b-v3-1 59.06 66.21 83.64 62.37 59.65 78.14 19.56 43.84
migtissera/SynthIA-7B-v1.3 57.11 62.12 83.45 62.65 51.37 78.85 17.59 43.76
bhenrym14/mistral-7b-platypus-fp16 56.89 63.05 84.15 64.11 45.07 78.53 17.36 45.92
jondurbin/airoboros-m-7b-3.1.2 56.24 61.86 83.51 61.91 53.75 77.58 13.87 41.2
teknium/CollectiveCognition-v1.1-Mistral-7B 53.87 62.12 84.17 62.35 57.62 75.37 15.62 19.85
uukuguy/speechless-mistral-dolphin-orca-platypus-samantha-7b 53.34 64.33 84.4 63.72 52.52 78.37 21.38 8.66

2023.12.04

It seems that there are some issues with the calculation of the GSM8K and DROP metrics on the Open LLM Leaderboard. Currently, the DROP metric has been removed from the official website, while the calculation of GSM8K metric remains chaotic, with significant differences in values among various models. Therefore, I am temporarily using ARC, HellaSwag, MMLU, TruthfulQA, and Winogrande metrics to evaluate the performance of DARE.

Model Average ARC HellaSwag MMLU TruthfulQA Winogrande
CollectiveCognition-v1.1-Mistral-7B 68.326 62.12 84.17 62.35 57.62 75.37
CollectiveCognition-v1.1-Mistral-7B-dare-0.85 66.676 61.01 84.31 64.34 44.87 78.85
airoboros-m-7b-3.1.2 67.722 61.86 83.51 61.91 53.75 77.58
airoboros-m-7b-3.1.2-dare-0.85 66.144 61.09 83.57 64.05 43.64 78.37
SynthIA-7B-v1.3 67.688 62.12 83.45 62.65 51.37 78.85
SynthIA-7B-v1.3-dare-0.85 66.340 61.01 83.50 64.49 43.77 78.93
speechless-mistral-7b-dare-0.85 (merge 6 DARE models) 68.516 63.57 84.82 64.29 50.66 79.24

From the official website evaluation results, after deleting 85% of the incremental parameters, the overall indicators remain above 97.5% of the original performance indicators. Among them, ARC slightly decreases, TruthfulQA significantly decreases, MMLU significantly increases, and HellaSwagt and Winogrande slightly increase. The most significant impact is the significant decrease in TruthfulQA, while other indicators are relatively well maintained, with MMLU showing a noticeable increase.

Downloads last month
919
Safetensors
Model size
7.24B params
Tensor type
F32
Β·
BF16
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using uukuguy/speechless-mistral-7b-dare-0.85 18