Update README.md

dd945a5 verified 2 months ago

8.78 kB

	---
	library_name: transformers
	tags:
	- orpo
	- llama3-8B
	- Supervised_Training
	model-index:
	- name: LLAMA_Harsha_8_B_ORDP_10k
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: IFEval (0-Shot)
	type: HuggingFaceH4/ifeval
	args:
	num_few_shot: 0
	metrics:
	- type: inst_level_strict_acc and prompt_level_strict_acc
	value: 34.64
	name: strict accuracy
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=asharsha30/LLAMA_Harsha_8_B_ORDP_10k
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: BBH (3-Shot)
	type: BBH
	args:
	num_few_shot: 3
	metrics:
	- type: acc_norm
	value: 25.73
	name: normalized accuracy
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=asharsha30/LLAMA_Harsha_8_B_ORDP_10k
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MATH Lvl 5 (4-Shot)
	type: hendrycks/competition_math
	args:
	num_few_shot: 4
	metrics:
	- type: exact_match
	value: 5.21
	name: exact match
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=asharsha30/LLAMA_Harsha_8_B_ORDP_10k
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GPQA (0-shot)
	type: Idavidrein/gpqa
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 3.13
	name: acc_norm
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=asharsha30/LLAMA_Harsha_8_B_ORDP_10k
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MuSR (0-shot)
	type: TAUR-Lab/MuSR
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 7.07
	name: acc_norm
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=asharsha30/LLAMA_Harsha_8_B_ORDP_10k
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU-PRO (5-shot)
	type: TIGER-Lab/MMLU-Pro
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 20.11
	name: accuracy
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=asharsha30/LLAMA_Harsha_8_B_ORDP_10k
	name: Open LLM Leaderboard
	license: apache-2.0
	datasets:
	- mlabonne/orpo-dpo-mix-40k
	language:
	- en
	base_model:
	- meta-llama/Llama-3.1-8B
	---

	# asharsha30/LLAMA_Harsha_8_B_ORDP_10k

	This model is the fine tune of NousResearch/Meta-Llama-3-8B using the 12,000 steps of [mlabonne/orpo-dpo-mix-40k](https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k).



	## 💻 Usage

	```python
	# Use a pipeline as a high-level helper
	from transformers import pipeline

	messages = [
	{"role": "user", "content": "Who are you?"},
	]
	pipe = pipeline("text-generation", model="asharsha30/LLAMA_Harsha_8_B_ORDP_10k")
	pipe(messages)
	```
	## 📈Training And Evaluation Report:

	Reports from Wandb

	https://wandb.ai/asharshavardhana96-texas-a-m-university/huggingface/runs/gky6j4vn?nw=nwuserasharshavardhana96

	## Acknowledgment:

	Huge thanks to Maxime Labonne for his brilliant blog post covering about the techniques related to finetuning the llama models using SFT and ORPO

	## Evaluated Using:

	The model is evaluated using the https://github.com/mlabonne/llm-autoeval and the results are summarized from the generated gist https://gist.github.com/asharsha30-1996/4162fc98d9669aab3080645c54905bd0

	## Accuracy measure on Neous Benchmarks:

	\| Model \|AGIEval\|GPT4All\|TruthfulQA\|Bigbench\|Average\|
	\|----------------------------------------------------------------------------------------\|------:\|------:\|---------:\|-------:\|------:\|
	\|[LLAMA_Harsha_8_B_ORDP_10k](https://huggingface.co/asharsha30/LLAMA_Harsha_8_B_ORDP_10k)\| 35.54\| 71.15\| 55.39\| 37.96\| 50.01\|

	### AGIEval
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------\|------:\|--------\|----:\|---\|-----:\|
	\|agieval_aqua_rat \| 0\|acc \|26.77\|± \| 2.78\|
	\| \| \|acc_norm\|27.17\|± \| 2.80\|
	\|agieval_logiqa_en \| 0\|acc \|31.34\|± \| 1.82\|
	\| \| \|acc_norm\|33.03\|± \| 1.84\|
	\|agieval_lsat_ar \| 0\|acc \|18.70\|± \| 2.58\|
	\| \| \|acc_norm\|19.57\|± \| 2.62\|
	\|agieval_lsat_lr \| 0\|acc \|42.94\|± \| 2.19\|
	\| \| \|acc_norm\|35.10\|± \| 2.12\|
	\|agieval_lsat_rc \| 0\|acc \|52.42\|± \| 3.05\|
	\| \| \|acc_norm\|43.87\|± \| 3.03\|
	\|agieval_sat_en \| 0\|acc \|65.53\|± \| 3.32\|
	\| \| \|acc_norm\|54.37\|± \| 3.48\|
	\|agieval_sat_en_without_passage\| 0\|acc \|41.75\|± \| 3.44\|
	\| \| \|acc_norm\|33.98\|± \| 3.31\|
	\|agieval_sat_math \| 0\|acc \|42.27\|± \| 3.34\|
	\| \| \|acc_norm\|37.27\|± \| 3.27\|

	Average: 35.54%

	### GPT4All
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|-------------\|------:\|--------\|----:\|---\|-----:\|
	\|arc_challenge\| 0\|acc \|49.91\|± \| 1.46\|
	\| \| \|acc_norm\|54.10\|± \| 1.46\|
	\|arc_easy \| 0\|acc \|80.47\|± \| 0.81\|
	\| \| \|acc_norm\|80.05\|± \| 0.82\|
	\|boolq \| 1\|acc \|82.08\|± \| 0.67\|
	\|hellaswag \| 0\|acc \|61.08\|± \| 0.49\|
	\| \| \|acc_norm\|80.26\|± \| 0.40\|
	\|openbookqa \| 0\|acc \|34.00\|± \| 2.12\|
	\| \| \|acc_norm\|45.00\|± \| 2.23\|
	\|piqa \| 0\|acc \|79.71\|± \| 0.94\|
	\| \| \|acc_norm\|81.61\|± \| 0.90\|
	\|winogrande \| 0\|acc \|74.98\|± \| 1.22\|

	Average: 71.15%

	### TruthfulQA
	\| Task \|Version\|Metric\|Value\| \|Stderr\|
	\|-------------\|------:\|------\|----:\|---\|-----:\|
	\|truthfulqa_mc\| 1\|mc1 \|37.45\|± \| 1.69\|
	\| \| \|mc2 \|55.39\|± \| 1.50\|

	Average: 55.39%

	### Bigbench
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------------------------\|------:\|---------------------\|----:\|---\|-----:\|
	\|bigbench_causal_judgement \| 0\|multiple_choice_grade\|57.37\|± \| 3.60\|
	\|bigbench_date_understanding \| 0\|multiple_choice_grade\|68.02\|± \| 2.43\|
	\|bigbench_disambiguation_qa \| 0\|multiple_choice_grade\|31.01\|± \| 2.89\|
	\|bigbench_geometric_shapes \| 0\|multiple_choice_grade\|20.89\|± \| 2.15\|
	\| \| \|exact_str_match \| 0.00\|± \| 0.00\|
	\|bigbench_logical_deduction_five_objects \| 0\|multiple_choice_grade\|28.40\|± \| 2.02\|
	\|bigbench_logical_deduction_seven_objects \| 0\|multiple_choice_grade\|20.71\|± \| 1.53\|
	\|bigbench_logical_deduction_three_objects \| 0\|multiple_choice_grade\|48.67\|± \| 2.89\|
	\|bigbench_movie_recommendation \| 0\|multiple_choice_grade\|31.60\|± \| 2.08\|
	\|bigbench_navigate \| 0\|multiple_choice_grade\|50.60\|± \| 1.58\|
	\|bigbench_reasoning_about_colored_objects \| 0\|multiple_choice_grade\|63.25\|± \| 1.08\|
	\|bigbench_ruin_names \| 0\|multiple_choice_grade\|34.38\|± \| 2.25\|
	\|bigbench_salient_translation_error_detection \| 0\|multiple_choice_grade\|21.84\|± \| 1.31\|
	\|bigbench_snarks \| 0\|multiple_choice_grade\|44.20\|± \| 3.70\|
	\|bigbench_sports_understanding \| 0\|multiple_choice_grade\|50.30\|± \| 1.59\|
	\|bigbench_temporal_sequences \| 0\|multiple_choice_grade\|26.30\|± \| 1.39\|
	\|bigbench_tracking_shuffled_objects_five_objects \| 0\|multiple_choice_grade\|21.36\|± \| 1.16\|
	\|bigbench_tracking_shuffled_objects_seven_objects\| 0\|multiple_choice_grade\|15.77\|± \| 0.87\|
	\|bigbench_tracking_shuffled_objects_three_objects\| 0\|multiple_choice_grade\|48.67\|± \| 2.89\|

	Average: 37.96%

	Average score: 50.01%

	Elapsed time: 02:36:38