metadata

license: llama3.1
datasets:
  - nvidia/OpenMathInstruct-2
language:
  - en
base_model:
  - meta-llama/Llama-3.1-8B-Instruct
model-index:
  - name: Control-LLM-Llama3.1-8B-Math16
    results:
      - task:
          type: math-evaluation
        dataset:
          type: parquet
          name: Math, Math Hard, GSM8K
          dataset_kwargs:
            data_files: >-
              https://github.com/linkedin/ControlLLM/blob/main/src/controlllm/inference/llm_eval_harness/additional_tasks/math/joined_math.parquet
        metrics:
          - name: exact_match,none
            type: exact_match
            value: 0.6205678398534606
            stderr: 0.005249520342473376
            verified: false
          - name: exact_match,none (gsm8k_0shot_instruct)
            type: exact_match
            value: 0.8968915845337376
            stderr: 0.008376436987507811
            verified: false
          - name: exact_match,none (meta_math_0shot_instruct)
            type: exact_match
            value: 0.6166
            stderr: 0.006876797660918556
            verified: false
          - name: exact_match,none (meta_math_hard_0shot_instruct)
            type: exact_match
            value: 0.36027190332326287
            stderr: 0.013198755610252931
            verified: false
      - task:
          type: original-capability
        dataset:
          type: meta/Llama-3.1-8B-Instruct-evals
          name: Llama-3.1-8B-Instruct-evals Dataset
          dataset_path: meta-llama/llama-3.1-8_b-instruct-evals
          dataset_name: Llama-3.1-8B-Instruct-evals__arc_challenge__details
        metrics:
          - name: exact_match,strict-match
            type: exact_match
            value: 0.6001372485281902
            stderr: 0.002821514831773572
            verified: false
          - name: exact_match,strict-match (meta_arc_0shot_instruct)
            type: exact_match
            value: 0.8248927038626609
            stderr: 0.011139722235859526
            verified: false
          - name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
            type: exact_match
            value: 0.3080357142857143
            stderr: 0.021836780796366417
            verified: false
          - name: exact_match,strict-match (meta_mmlu_0shot_instruct)
            type: exact_match
            value: 0.7159948725252813
            stderr: 0.00380556397209409
            verified: false
          - name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
            type: exact_match
            value: 0.45403922872340424
            stderr: 0.004539171007529716
            verified: false
library_name: transformers
pipeline_tag: text-generation

Control-LLM-Llama3.1-8B-Math16

This is a fine-tuned model of Llama-3.1-8B-Instruct for mathematical tasks on OpenMath2 dataset.

Linked Paper

This model is associated with the paper: Control-LLM.

Linked Open Source code - training, eval and benchmark

This model is associated with the github: Control-LLM.

Evaluation Results

Here is an overview of the evaluation results and findings:

Benchmark Results Table

The table below summarizes evaluation results across mathematical tasks and original capabilities.

Model	MH	M	G8K	M-Avg	ARC	GPQA	MLU	MLUP	O-Avg	Overall
Llama3.1-8B-Inst	23.7	50.9	85.6	52.1	83.4	29.9	72.4	46.7	60.5	56.3
Control LLM*	36.0	61.7	89.7	62.5	82.5	30.8	71.6	45.4	57.6	60.0

Explanation:

MH: MathHard
M: Math
G8K: GSM8K
M-Avg: Math - Average across MathHard, Math, and GSM8K
ARC: ARC benchmark
GPQA: General knowledge QA
MLU: MMLU (Massive Multitask Language Understanding)
MLUP: MMLU Pro
O-Avg: Original Capability - Average across ARC, GPQA, MMLU, and MLUP
Overall: Combined average across all tasks

Catastrophic Forgetting on OpenMath

The following plot illustrates and compares catastrophic forgetting mitigation during training

Alignment Result

The plot below highlights the alignment result of the model trained with Control LLM.