metadata
license: llama3.1
datasets:
- nvidia/OpenMathInstruct-2
language:
- en
base_model:
- meta-llama/Llama-3.1-8B-Instruct
model-index:
- name: Control-LLM-Llama3.1-8B-Math16
results:
- task:
type: math-evaluation
dataset:
type: parquet
name: Math, Math Hard, GSM8K
dataset_kwargs:
data_files: >-
https://github.com/linkedin/ControlLLM/blob/main/src/controlllm/inference/llm_eval_harness/additional_tasks/math/joined_math.parquet
metrics:
- name: exact_match,none
type: exact_match
value: 0.6205678398534606
stderr: 0.005249520342473376
verified: false
- name: exact_match,none (gsm8k_0shot_instruct)
type: exact_match
value: 0.8968915845337376
stderr: 0.008376436987507811
verified: false
- name: exact_match,none (meta_math_0shot_instruct)
type: exact_match
value: 0.6166
stderr: 0.006876797660918556
verified: false
- name: exact_match,none (meta_math_hard_0shot_instruct)
type: exact_match
value: 0.36027190332326287
stderr: 0.013198755610252931
verified: false
- task:
type: original-capability
dataset:
type: meta/Llama-3.1-8B-Instruct-evals
name: Llama-3.1-8B-Instruct-evals Dataset
dataset_path: meta-llama/llama-3.1-8_b-instruct-evals
dataset_name: Llama-3.1-8B-Instruct-evals__arc_challenge__details
metrics:
- name: exact_match,strict-match
type: exact_match
value: 0.6001372485281902
stderr: 0.002821514831773572
verified: false
- name: exact_match,strict-match (meta_arc_0shot_instruct)
type: exact_match
value: 0.8248927038626609
stderr: 0.011139722235859526
verified: false
- name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
type: exact_match
value: 0.3080357142857143
stderr: 0.021836780796366417
verified: false
- name: exact_match,strict-match (meta_mmlu_0shot_instruct)
type: exact_match
value: 0.7159948725252813
stderr: 0.00380556397209409
verified: false
- name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
type: exact_match
value: 0.45403922872340424
stderr: 0.004539171007529716
verified: false
library_name: transformers
pipeline_tag: text-generation
Control-LLM-Llama3.1-8B-Math16
This is a fine-tuned model of Llama-3.1-8B-Instruct for mathematical tasks on OpenMath2 dataset.
Linked Paper
This model is associated with the paper: Control-LLM.
Linked Open Source code - training, eval and benchmark
This model is associated with the github: Control-LLM.
Evaluation Results
Here is an overview of the evaluation results and findings:
Benchmark Results Table
The table below summarizes evaluation results across mathematical tasks and original capabilities.
Model | MH | M | G8K | M-Avg | ARC | GPQA | MLU | MLUP | O-Avg | Overall |
---|---|---|---|---|---|---|---|---|---|---|
Llama3.1-8B-Inst | 23.7 | 50.9 | 85.6 | 52.1 | 83.4 | 29.9 | 72.4 | 46.7 | 60.5 | 56.3 |
Control LLM* | 36.0 | 61.7 | 89.7 | 62.5 | 82.5 | 30.8 | 71.6 | 45.4 | 57.6 | 60.0 |
Explanation:
- MH: MathHard
- M: Math
- G8K: GSM8K
- M-Avg: Math - Average across MathHard, Math, and GSM8K
- ARC: ARC benchmark
- GPQA: General knowledge QA
- MLU: MMLU (Massive Multitask Language Understanding)
- MLUP: MMLU Pro
- O-Avg: Original Capability - Average across ARC, GPQA, MMLU, and MLUP
- Overall: Combined average across all tasks
Catastrophic Forgetting on OpenMath
The following plot illustrates and compares catastrophic forgetting mitigation during training
Alignment Result
The plot below highlights the alignment result of the model trained with Control LLM.