Update README.md

6c2bc3b verified 9 months ago

12.3 kB

	---
	license: cc-by-nc-4.0
	base_model: mlabonne/NeuralMonarch-7B
	tags:
	- generated_from_trainer
	- axolotl
	- mistral
	- instruct
	- finetune
	- chatml
	- gpt4
	- synthetic data
	- distillation
	model-index:
	- name: AlphaMonarch-laser
	results: []
	datasets:
	- argilla/OpenHermes2.5-dpo-binarized-alpha
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	---
	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# AlphaMonarch-laser

	![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/64e380b2e12618b261fa6ba0/62S_ExHO6NKCM3NhPDrds.jpeg)

	AlphaMonarch-laser is a DPO fine-tuned of [mlabonne/NeuralMonarch-7B](https://huggingface.co/mlabonne/NeuralMonarch-7B/) using the [argilla/OpenHermes2.5-dpo-binarized-alpha](https://huggingface.co/datasets/argilla/OpenHermes2.5-dpo-binarized-alpha) preference dataset but achieves better performance then [mlabonne/AlphaMonarch-7B](https://huggingface.co/mlabonne/AlphaMonarch-7B/) using LaserQLoRA. We have fine-tuned this model only on half of the projections, but have achieved better results as compared to the version released by Maximme Labonne. We have trained this model for 1080 steps.

	AlphaMonarch-laser is ranking 1 on YALL - [Yet Another LLM Leaderboard](https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard).
	![image/png](https://cdn-uploads.huggingface.co/production/uploads/64e380b2e12618b261fa6ba0/Jgxw1FZRx7nNAdSh7nYt1.png)

	## 🏆 Evaluation results

	# Nous Benchmark

	### AGIEVAL

	\| Task \| Version \| Metric \| Value \| StdErr \|
	\|---------------------------------\|---------\|--------------\|--------\|--------\|
	\| agieval_aqua_rat \| 0 \| acc \| 28.35% \| 2.83% \|
	\| agieval_aqua_rat \| 0 \| acc_norm \| 26.38% \| 2.77% \|
	\| agieval_logiqa_en \| 0 \| acc \| 38.25% \| 1.91% \|
	\| agieval_logiqa_en \| 0 \| acc_norm \| 38.10% \| 1.90% \|
	\| agieval_lsat_ar \| 0 \| acc \| 23.91% \| 2.82% \|
	\| agieval_lsat_ar \| 0 \| acc_norm \| 23.48% \| 2.80% \|
	\| agieval_lsat_lr \| 0 \| acc \| 52.75% \| 2.21% \|
	\| agieval_lsat_lr \| 0 \| acc_norm \| 53.92% \| 2.21% \|
	\| agieval_lsat_rc \| 0 \| acc \| 66.91% \| 2.87% \|
	\| agieval_lsat_rc \| 0 \| acc_norm \| 67.29% \| 2.87% \|
	\| agieval_sat_en \| 0 \| acc \| 78.64% \| 2.86% \|
	\| agieval_sat_en \| 0 \| acc_norm \| 78.64% \| 2.86% \|
	\| agieval_sat_en_without_passage \| 0 \| acc \| 45.15% \| 3.48% \|
	\| agieval_sat_en_without_passage \| 0 \| acc_norm \| 44.17% \| 3.47% \|
	\| agieval_sat_math \| 0 \| acc \| 33.18% \| 3.18% \|
	\| agieval_sat_math \| 0 \| acc_norm \| 31.36% \| 3.14% \|
	Average: 28.41%

	### GPT4ALL

	\| Task \| Version \| Metric \| Value \| StdErr \|
	\|--------------\|---------\|----------\|-------\|--------\|
	\| arc_challenge\| 0 \| acc \| 66.30%\| ± 1.38%\|
	\| \| \| acc_norm \| 68.26%\| ± 1.36%\|
	\| arc_easy \| 0 \| acc \| 86.57%\| ± 0.70%\|
	\| \| \| acc_norm \| 80.81%\| ± 0.81%\|
	\| boolq \| 1 \| acc \| 87.16%\| ± 0.59%\|
	\| hellaswag \| 0 \| acc \| 69.60%\| ± 0.46%\|
	\| \| \| acc_norm \| 87.45%\| ± 0.33%\|
	\| openbookqa \| 0 \| acc \| 39.20%\| ± 2.19%\|
	\| \| \| acc_norm \| 49.60%\| ± 2.24%\|
	\| piqa \| 0 \| acc \| 83.03%\| ± 0.88%\|
	\| \| \| acc_norm \| 84.87%\| ± 0.84%\|
	\| winogrande \| 0 \| acc \| 81.06%\| ± 1.10%\|
	Average: 76.98%

	### TRUTHFUL-QA

	\| Task \| Version \| Metric \| Value \| StdErr \|
	\|---------------\|---------\|--------\|-------\|--------\|
	\| truthfulqa_mc \| 1 \| mc1 \| 63.04%\| ± 1.69%\|
	\| truthfulqa_mc \| 1 \| mc2 \| 78.39%\| ± 1.37%\|
	Average: 70.71%

	### BIGBENCH

	\| Task \| Version \| Metric \| Value \| StdErr \|
	\|------------------------------------------------\|---------\|-----------------------\|-------\|--------------------\|
	\| bigbench_causal_judgement \| 0 \| multiple_choice_grade\| 60.00%\| ± 3.56% \|
	\| bigbench_date_understanding \| 0 \| multiple_choice_grade\| 62.06%\| ± 2.53% \|
	\| bigbench_disambiguation_qa \| 0 \| multiple_choice_grade\| 54.26%\| ± 3.11% \|
	\| bigbench_geometric_shapes \| 0 \| multiple_choice_grade\| 23.96%\| ± 2.26% \|
	\| \| \| exact_str_match \| 0.00% \| ± 0.00% \|
	\| bigbench_logical_deduction_five_objects \| 0 \| multiple_choice_grade\| 32.80%\| ± 2.10% \|
	\| bigbench_logical_deduction_seven_objects \| 0 \| multiple_choice_grade\| 23.86%\| ± 1.61% \|
	\| bigbench_logical_deduction_three_objects \| 0 \| multiple_choice_grade\| 59.33%\| ± 2.84% \|
	\| bigbench_movie_recommendation \| 0 \| multiple_choice_grade\| 58.00%\| ± 2.21% \|
	\| bigbench_navigate \| 0 \| multiple_choice_grade\| 56.00%\| ± 1.57% \|
	\| bigbench_reasoning_about_colored_objects \| 0 \| multiple_choice_grade\| 69.20%\| ± 1.03% \|
	\| bigbench_ruin_names \| 0 \| multiple_choice_grade\| 55.36%\| ± 2.35% \|
	\| bigbench_salient_translation_error_detection \| 0 \| multiple_choice_grade\| 41.48%\| ± 1.56% \|
	\| bigbench_snarks \| 0 \| multiple_choice_grade\| 73.48%\| ± 3.29% \|
	\| bigbench_sports_understanding \| 0 \| multiple_choice_grade\| 76.06%\| ± 1.36% \|
	\| bigbench_temporal_sequences \| 0 \| multiple_choice_grade\| 55.50%\| ± 1.57% \|
	\| bigbench_tracking_shuffled_objects_five_objects\| 0 \| multiple_choice_grade\| 23.28%\| ± 1.20% \|
	\| bigbench_tracking_shuffled_objects_seven_objects\| 0 \| multiple_choice_grade\| 19.37%\| ± 0.94% \|
	\| bigbench_tracking_shuffled_objects_three_objects\| 0 \| multiple_choice_grade\| 59.33%\| ± 2.84% \|
	Average: 55.37%

	# Openllm Benchmark

	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|-------------\|------:\|--------\|----:\|---\|-----:\|
	\|arc_challenge\| 0\|acc \|70.12\|± \| 1.30\|
	\| \| \|acc_norm\|73.27\|± \| 1.29\|
	\|hellaswag \| 0\|acc \|71.80\|± \| 0.44\|
	\| \| \|acc_norm\|89.20\|± \| 0.30\|
	\|gsm8k \| 0\|acc \|66.77\|± \| 1.2 \|
	\|winogrande \| 0\|acc \|84.6 \|± \| 1.0 \|

	Average: 73.5%

	### TruthfulQA
	\| Task \|Version\|Metric\|Value\| \|Stderr\|
	\|-------------\|------:\|------\|----:\|---\|-----:\|
	\|truthfulqa_mc\| 1\|mc1 \|62.79\|± \| 1.69\|
	\| \| \|mc2 \|77.90\|± \| 1.37\|

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-07
	- train_batch_size: 1
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 8
	- total_train_batch_size: 8
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_steps: 100
	- training_steps: 1080



	### 📝 Axolotl Configuration

	```yaml
	base_model: mlabonne/NeuralMonarch-7B
	model_type: MistralForCausalLM
	tokenizer_type: LlamaTokenizer
	is_mistral_derived_model: true
	load_in_8bit: false
	load_in_4bit: true
	strict: false
	rl: dpo
	chat_template: chatml
	datasets:
	- path: mlabonne/chatml-OpenHermes2.5-dpo-binarized-alpha
	split: train
	type: chatml.intel
	dataset_prepared_path:
	val_set_size: 0.01
	output_dir: ./out
	adapter: qlora
	lora_model_dir:
	sequence_len: 1800
	sample_packing: false
	pad_to_sequence_len: false
	lora_r: 16
	lora_alpha: 16
	lora_dropout: 0.05
	lora_target_linear: true
	lora_fan_in_fan_out:
	lora_target_modules:
	- layers.1.self_attn.q_proj
	- layers.0.self_attn.q_proj
	- layers.15.self_attn.q_proj
	- layers.12.self_attn.q_proj
	- layers.11.self_attn.q_proj
	- layers.14.self_attn.q_proj
	- layers.9.self_attn.q_proj
	- layers.16.self_attn.q_proj
	- layers.30.self_attn.q_proj
	- layers.18.self_attn.q_proj
	- layers.13.self_attn.q_proj
	- layers.10.self_attn.q_proj
	- layers.7.self_attn.q_proj
	- layers.8.self_attn.q_proj
	- layers.4.self_attn.q_proj
	- layers.19.self_attn.q_proj
	- layers.27.self_attn.k_proj
	- layers.24.self_attn.k_proj
	- layers.25.self_attn.k_proj
	- layers.22.self_attn.k_proj
	- layers.26.self_attn.k_proj
	- layers.29.self_attn.k_proj
	- layers.23.self_attn.k_proj
	- layers.28.self_attn.k_proj
	- layers.21.self_attn.k_proj
	- layers.31.self_attn.k_proj
	- layers.30.self_attn.k_proj
	- layers.20.self_attn.k_proj
	- layers.5.self_attn.k_proj
	- layers.19.self_attn.k_proj
	- layers.17.self_attn.k_proj
	- layers.18.self_attn.k_proj
	- layers.19.self_attn.v_proj
	- layers.24.self_attn.v_proj
	- layers.18.self_attn.v_proj
	- layers.5.self_attn.v_proj
	- layers.3.self_attn.v_proj
	- layers.16.self_attn.v_proj
	- layers.23.self_attn.v_proj
	- layers.27.self_attn.v_proj
	- layers.25.self_attn.v_proj
	- layers.26.self_attn.v_proj
	- layers.20.self_attn.v_proj
	- layers.6.self_attn.v_proj
	- layers.15.self_attn.v_proj
	- layers.17.self_attn.v_proj
	- layers.29.self_attn.v_proj
	- layers.22.self_attn.v_proj
	- layers.12.self_attn.o_proj
	- layers.9.self_attn.o_proj
	- layers.14.self_attn.o_proj
	- layers.0.self_attn.o_proj
	- layers.6.self_attn.o_proj
	- layers.8.self_attn.o_proj
	- layers.10.self_attn.o_proj
	- layers.11.self_attn.o_proj
	- layers.13.self_attn.o_proj
	- layers.24.self_attn.o_proj
	- layers.7.self_attn.o_proj
	- layers.15.self_attn.o_proj
	- layers.5.self_attn.o_proj
	- layers.17.self_attn.o_proj
	- layers.25.self_attn.o_proj
	- layers.4.self_attn.o_proj
	- layers.31.mlp.gate_proj
	- layers.30.mlp.gate_proj
	- layers.4.mlp.gate_proj
	- layers.3.mlp.gate_proj
	- layers.29.mlp.gate_proj
	- layers.28.mlp.gate_proj
	- layers.6.mlp.gate_proj
	- layers.27.mlp.gate_proj
	- layers.5.mlp.gate_proj
	- layers.26.mlp.gate_proj
	- layers.25.mlp.gate_proj
	- layers.7.mlp.gate_proj
	- layers.2.mlp.gate_proj
	- layers.24.mlp.gate_proj
	- layers.23.mlp.gate_proj
	- layers.10.mlp.gate_proj
	- layers.6.mlp.up_proj
	- layers.4.mlp.up_proj
	- layers.5.mlp.up_proj
	- layers.27.mlp.up_proj
	- layers.25.mlp.up_proj
	- layers.26.mlp.up_proj
	- layers.17.mlp.up_proj
	- layers.24.mlp.up_proj
	- layers.7.mlp.up_proj
	- layers.10.mlp.up_proj
	- layers.3.mlp.up_proj
	- layers.11.mlp.up_proj
	- layers.23.mlp.up_proj
	- layers.9.mlp.up_proj
	- layers.14.mlp.up_proj
	- layers.18.mlp.up_proj
	- layers.19.mlp.down_proj
	- layers.20.mlp.down_proj
	- layers.18.mlp.down_proj
	- layers.21.mlp.down_proj
	- layers.29.mlp.down_proj
	- layers.1.mlp.down_proj
	- layers.22.mlp.down_proj
	- layers.28.mlp.down_proj
	- layers.23.mlp.down_proj
	- layers.30.mlp.down_proj
	- layers.17.mlp.down_proj
	- layers.4.mlp.down_proj
	- layers.2.mlp.down_proj
	- layers.15.mlp.down_proj
	- layers.5.mlp.down_proj
	wandb_project: axolotl
	wandb_entity:
	wandb_watch:
	wandb_name:
	wandb_log_model:
	gradient_accumulation_steps: 8
	micro_batch_size: 1
	num_epochs: 1
	optimizer: paged_adamw_32bit
	lr_scheduler: cosine
	learning_rate: 5e-7
	train_on_inputs: false
	group_by_length: false
	bf16: true
	fp16: false
	tf32: true
	gradient_checkpointing: true
	early_stopping_patience:
	resume_from_checkpoint:
	local_rank:
	logging_steps: 1
	xformers_attention:
	flash_attention: true
	warmup_steps: 100
	evals_per_epoch: 1
	eval_table_size:
	eval_table_max_new_tokens: 128
	save_steps: 1080
	max_steps: 1080
	debug:
	deepspeed:
	weight_decay: 0.0
	fsdp:
	fsdp_config:
	special_tokens:
	```


	### Framework versions

	- Transformers 4.38.0.dev0
	- Pytorch 2.1.2+cu118
	- Datasets 2.17.0
	- Tokenizers 0.15.0
	- axolotl: 0.4.0

	[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)