asharsha30's picture
Update README.md
dd945a5 verified
---
library_name: transformers
tags:
- orpo
- llama3-8B
- Supervised_Training
model-index:
- name: LLAMA_Harsha_8_B_ORDP_10k
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: IFEval (0-Shot)
type: HuggingFaceH4/ifeval
args:
num_few_shot: 0
metrics:
- type: inst_level_strict_acc and prompt_level_strict_acc
value: 34.64
name: strict accuracy
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=asharsha30/LLAMA_Harsha_8_B_ORDP_10k
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: BBH (3-Shot)
type: BBH
args:
num_few_shot: 3
metrics:
- type: acc_norm
value: 25.73
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=asharsha30/LLAMA_Harsha_8_B_ORDP_10k
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MATH Lvl 5 (4-Shot)
type: hendrycks/competition_math
args:
num_few_shot: 4
metrics:
- type: exact_match
value: 5.21
name: exact match
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=asharsha30/LLAMA_Harsha_8_B_ORDP_10k
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GPQA (0-shot)
type: Idavidrein/gpqa
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 3.13
name: acc_norm
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=asharsha30/LLAMA_Harsha_8_B_ORDP_10k
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MuSR (0-shot)
type: TAUR-Lab/MuSR
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 7.07
name: acc_norm
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=asharsha30/LLAMA_Harsha_8_B_ORDP_10k
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU-PRO (5-shot)
type: TIGER-Lab/MMLU-Pro
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 20.11
name: accuracy
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=asharsha30/LLAMA_Harsha_8_B_ORDP_10k
name: Open LLM Leaderboard
license: apache-2.0
datasets:
- mlabonne/orpo-dpo-mix-40k
language:
- en
base_model:
- meta-llama/Llama-3.1-8B
---
# asharsha30/LLAMA_Harsha_8_B_ORDP_10k
This model is the fine tune of NousResearch/Meta-Llama-3-8B using the 12,000 steps of [mlabonne/orpo-dpo-mix-40k](https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k).
## 💻 Usage
```python
# Use a pipeline as a high-level helper
from transformers import pipeline
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="asharsha30/LLAMA_Harsha_8_B_ORDP_10k")
pipe(messages)
```
## 📈Training And Evaluation Report:
Reports from Wandb
https://wandb.ai/asharshavardhana96-texas-a-m-university/huggingface/runs/gky6j4vn?nw=nwuserasharshavardhana96
## Acknowledgment:
Huge thanks to Maxime Labonne for his brilliant blog post covering about the techniques related to finetuning the llama models using SFT and ORPO
## Evaluated Using:
The model is evaluated using the https://github.com/mlabonne/llm-autoeval and the results are summarized from the generated gist https://gist.github.com/asharsha30-1996/4162fc98d9669aab3080645c54905bd0
## Accuracy measure on Neous Benchmarks:
| Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
|----------------------------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
|[LLAMA_Harsha_8_B_ORDP_10k](https://huggingface.co/asharsha30/LLAMA_Harsha_8_B_ORDP_10k)| 35.54| 71.15| 55.39| 37.96| 50.01|
### AGIEval
| Task |Version| Metric |Value| |Stderr|
|------------------------------|------:|--------|----:|---|-----:|
|agieval_aqua_rat | 0|acc |26.77|± | 2.78|
| | |acc_norm|27.17|± | 2.80|
|agieval_logiqa_en | 0|acc |31.34|± | 1.82|
| | |acc_norm|33.03|± | 1.84|
|agieval_lsat_ar | 0|acc |18.70|± | 2.58|
| | |acc_norm|19.57|± | 2.62|
|agieval_lsat_lr | 0|acc |42.94|± | 2.19|
| | |acc_norm|35.10|± | 2.12|
|agieval_lsat_rc | 0|acc |52.42|± | 3.05|
| | |acc_norm|43.87|± | 3.03|
|agieval_sat_en | 0|acc |65.53|± | 3.32|
| | |acc_norm|54.37|± | 3.48|
|agieval_sat_en_without_passage| 0|acc |41.75|± | 3.44|
| | |acc_norm|33.98|± | 3.31|
|agieval_sat_math | 0|acc |42.27|± | 3.34|
| | |acc_norm|37.27|± | 3.27|
Average: 35.54%
### GPT4All
| Task |Version| Metric |Value| |Stderr|
|-------------|------:|--------|----:|---|-----:|
|arc_challenge| 0|acc |49.91|± | 1.46|
| | |acc_norm|54.10|± | 1.46|
|arc_easy | 0|acc |80.47|± | 0.81|
| | |acc_norm|80.05|± | 0.82|
|boolq | 1|acc |82.08|± | 0.67|
|hellaswag | 0|acc |61.08|± | 0.49|
| | |acc_norm|80.26|± | 0.40|
|openbookqa | 0|acc |34.00|± | 2.12|
| | |acc_norm|45.00|± | 2.23|
|piqa | 0|acc |79.71|± | 0.94|
| | |acc_norm|81.61|± | 0.90|
|winogrande | 0|acc |74.98|± | 1.22|
Average: 71.15%
### TruthfulQA
| Task |Version|Metric|Value| |Stderr|
|-------------|------:|------|----:|---|-----:|
|truthfulqa_mc| 1|mc1 |37.45|± | 1.69|
| | |mc2 |55.39|± | 1.50|
Average: 55.39%
### Bigbench
| Task |Version| Metric |Value| |Stderr|
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|bigbench_causal_judgement | 0|multiple_choice_grade|57.37|± | 3.60|
|bigbench_date_understanding | 0|multiple_choice_grade|68.02|± | 2.43|
|bigbench_disambiguation_qa | 0|multiple_choice_grade|31.01|± | 2.89|
|bigbench_geometric_shapes | 0|multiple_choice_grade|20.89|± | 2.15|
| | |exact_str_match | 0.00|± | 0.00|
|bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|28.40|± | 2.02|
|bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|20.71|± | 1.53|
|bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|48.67|± | 2.89|
|bigbench_movie_recommendation | 0|multiple_choice_grade|31.60|± | 2.08|
|bigbench_navigate | 0|multiple_choice_grade|50.60|± | 1.58|
|bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|63.25|± | 1.08|
|bigbench_ruin_names | 0|multiple_choice_grade|34.38|± | 2.25|
|bigbench_salient_translation_error_detection | 0|multiple_choice_grade|21.84|± | 1.31|
|bigbench_snarks | 0|multiple_choice_grade|44.20|± | 3.70|
|bigbench_sports_understanding | 0|multiple_choice_grade|50.30|± | 1.59|
|bigbench_temporal_sequences | 0|multiple_choice_grade|26.30|± | 1.39|
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|21.36|± | 1.16|
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|15.77|± | 0.87|
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|48.67|± | 2.89|
Average: 37.96%
Average score: 50.01%
Elapsed time: 02:36:38