metadata

license: llama3.1
datasets:
  - survivi/Llama-3-SynE-Dataset
  - hfl/stem_zh_instruction
  - llamafactory/alpaca_zh
  - llamafactory/alpaca_gpt4_zh
  - hfl/ruozhiba_gpt4
  - codingsteven/Llama-3-8B-chat
language:
  - zh
base_model:
  - meta-llama/Llama-3.1-8B
model-index:
  - name: Control-LLM-Llama3.1-8B-SynE-Hybrid
    results:
      - task:
          type: pretraining-evaluation
        dataset:
          type: mixed
          name: Pretraining Evaluation Dataset
        metrics:
          - name: exact_match,strict-match (meta_pretrain)
            type: exact_match
            value: 0.4677775980154236
            stderr: 0.0035271375539740195
            verified: false
          - name: exact_match,strict-match (meta_bbh_3shot_cot_pretrain)
            type: exact_match
            value: 0.6516664106896022
            stderr: 0.005904999312183116
            verified: false
          - name: acc,none (meta_mmlu_5shot_pretrain)
            type: accuracy
            value: 0.6574562028201111
            stderr: 0.004004907112115045
            verified: false
          - name: exact_match,strict-match (meta_mmlu_pro_5shot_pretrain)
            type: exact_match
            value: 0.36826795212765956
            stderr: 0.004397416024070344
            verified: false
      - task:
          type: chinese-evaluation
        dataset:
          type: mixed
          name: Chinese Evaluation Dataset
        metrics:
          - name: exact_match,strict-match (zh_pretrain_multishot)
            type: exact_match
            value: 0.4448483910891089
            stderr: 0.004279257037413458
            verified: false
          - name: acc,none (ceval-valid)
            type: accuracy
            value: 0.5891530460624071
            stderr: 0.012995719777231915
            verified: false
          - name: exact_match,strict-match (ceval-valid-pretrain-cot_zh)
            type: exact_match
            value: 0.44650817236255574
            stderr: 0.013132438471522461
            verified: false
          - name: acc,none (cmmlu)
            type: accuracy
            value: 0.578742876877914
            stderr: 0.004459355253649275
            verified: false
          - name: exact_match,strict-match (cmmlu_pretrain_cot_zh)
            type: exact_match
            value: 0.4446554999136591
            stderr: 0.004526020080338497
            verified: false

Control-LLM-Llama3.1-8B-SynE-Hybrid

This is a fine-tuned model of Llama-3.1-8B for muliligual-Chinese tasks on SynE dataset by Control LLM-Hybrid.

Evaluation Results

Here is an overview of the evaluation results and findings:

Benchmark Results Table

The table below summarizes evaluation results across Chinese tasks and original capabilities.

Model	CEval	CEvalC	CMMLU	CMMLUC	C-Avg	BBH	MLU	MLUP	O-Avg	Overall
Llama3.1-8B	48.3	12.8	51.1	14.1	13.9	65.2	65.4	35.5	45.9	29.9
Llama-3-SynE	57.7	22.3	57.1	22.8	22.8	61.9	64.0	32.6	42.9	32.9
Full Param Tune	59.0	40.2	60.2	44.3	43.8	64.8	64.9	35.0	45.4	44.6
Stack Expansion	56.0	32.7	55.2	33.4	33.3	62.3	65.6	35.3	44.8	39.1
Concat-Lerp*	57.1	34.8	57.0	37.4	37.1	64.4	64.6	35.8	45.9	41.5
Hybrid Expansion	58.9	44.7	57.9	44.3	44.4	65.1	65.7	36.9	46.8	45.6
Control LLM*	57.0	44.7	56.0	44.9	44.8	68.2	65.6	37.9	48.5	46.7

Explanation:

CEval: Chinese Evaluation
CEvalC: Chinese Evaluation (CoT - Chain of Thought)
CMMLU: Chinese MMLU
CMMLUC: Chinese MMLU (CoT)
C-Avg: Chinese - Size Weighted Average across CEval, CEvalC, CMMLU, and CMMLUC
BBH: BigBench Hard
MLU: MMLU (Massive Multitask Language Understanding)
MLUP: MMLU Pro
O-Avg: Original Capability - Size Weighted Average across BBH, MLU, and MLUP
Overall: Combined average across all tasks