hawei_LinkedIn
correct meta data of base model
673afc5
metadata
license: llama3.1
datasets:
  - survivi/Llama-3-SynE-Dataset
  - hfl/stem_zh_instruction
  - llamafactory/alpaca_zh
  - llamafactory/alpaca_gpt4_zh
  - hfl/ruozhiba_gpt4
  - codingsteven/Llama-3-8B-chat
language:
  - zh
base_model:
  - meta-llama/Llama-3.1-8B
model-index:
  - name: Control-LLM-Llama3.1-8B-SynE-Hybrid
    results:
      - task:
          type: pretraining-evaluation
        dataset:
          type: mixed
          name: Pretraining Evaluation Dataset
        metrics:
          - name: exact_match,strict-match (meta_pretrain)
            type: exact_match
            value: 0.4677775980154236
            stderr: 0.0035271375539740195
            verified: false
          - name: exact_match,strict-match (meta_bbh_3shot_cot_pretrain)
            type: exact_match
            value: 0.6516664106896022
            stderr: 0.005904999312183116
            verified: false
          - name: acc,none (meta_mmlu_5shot_pretrain)
            type: accuracy
            value: 0.6574562028201111
            stderr: 0.004004907112115045
            verified: false
          - name: exact_match,strict-match (meta_mmlu_pro_5shot_pretrain)
            type: exact_match
            value: 0.36826795212765956
            stderr: 0.004397416024070344
            verified: false
      - task:
          type: chinese-evaluation
        dataset:
          type: mixed
          name: Chinese Evaluation Dataset
        metrics:
          - name: exact_match,strict-match (zh_pretrain_multishot)
            type: exact_match
            value: 0.4448483910891089
            stderr: 0.004279257037413458
            verified: false
          - name: acc,none (ceval-valid)
            type: accuracy
            value: 0.5891530460624071
            stderr: 0.012995719777231915
            verified: false
          - name: exact_match,strict-match (ceval-valid-pretrain-cot_zh)
            type: exact_match
            value: 0.44650817236255574
            stderr: 0.013132438471522461
            verified: false
          - name: acc,none (cmmlu)
            type: accuracy
            value: 0.578742876877914
            stderr: 0.004459355253649275
            verified: false
          - name: exact_match,strict-match (cmmlu_pretrain_cot_zh)
            type: exact_match
            value: 0.4446554999136591
            stderr: 0.004526020080338497
            verified: false

Control-LLM-Llama3.1-8B-SynE-Hybrid

This is a fine-tuned model of Llama-3.1-8B for muliligual-Chinese tasks on SynE dataset by Control LLM-Hybrid.

Evaluation Results

Here is an overview of the evaluation results and findings:

Benchmark Results Table

The table below summarizes evaluation results across Chinese tasks and original capabilities.

Model CEval CEvalC CMMLU CMMLUC C-Avg BBH MLU MLUP O-Avg Overall
Llama3.1-8B 48.3 12.8 51.1 14.1 13.9 65.2 65.4 35.5 45.9 29.9
Llama-3-SynE 57.7 22.3 57.1 22.8 22.8 61.9 64.0 32.6 42.9 32.9
Full Param Tune 59.0 40.2 60.2 44.3 43.8 64.8 64.9 35.0 45.4 44.6
Stack Expansion 56.0 32.7 55.2 33.4 33.3 62.3 65.6 35.3 44.8 39.1
Concat-Lerp* 57.1 34.8 57.0 37.4 37.1 64.4 64.6 35.8 45.9 41.5
Hybrid Expansion 58.9 44.7 57.9 44.3 44.4 65.1 65.7 36.9 46.8 45.6
Control LLM* 57.0 44.7 56.0 44.9 44.8 68.2 65.6 37.9 48.5 46.7

Explanation:

  • CEval: Chinese Evaluation
  • CEvalC: Chinese Evaluation (CoT - Chain of Thought)
  • CMMLU: Chinese MMLU
  • CMMLUC: Chinese MMLU (CoT)
  • C-Avg: Chinese - Size Weighted Average across CEval, CEvalC, CMMLU, and CMMLUC
  • BBH: BigBench Hard
  • MLU: MMLU (Massive Multitask Language Understanding)
  • MLUP: MMLU Pro
  • O-Avg: Original Capability - Size Weighted Average across BBH, MLU, and MLUP
  • Overall: Combined average across all tasks