sethuiyer's picture
Update README.md
23399ca verified
metadata
base_model:
  - unsloth/Meta-Llama-3.1-8B
model-index:
  - name: Llama-3.1-8B-Experimental-1206-Instruct
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 69.67
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=sethuiyer/Llama-3.1-8B-Experimental-1206-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 30.06
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=sethuiyer/Llama-3.1-8B-Experimental-1206-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 11.1
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=sethuiyer/Llama-3.1-8B-Experimental-1206-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 6.6
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=sethuiyer/Llama-3.1-8B-Experimental-1206-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 8.5
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=sethuiyer/Llama-3.1-8B-Experimental-1206-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 28.1
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=sethuiyer/Llama-3.1-8B-Experimental-1206-Instruct
          name: Open LLM Leaderboard

Llama 3.1 8B Experimental 1206

Overall Strengths

  1. Logical and Boolean Reasoning – Excels in tasks requiring clear, rule-based logic and manipulation of true/false statements.
  2. Focused Domain Knowledge – Strong at certain specialized tasks (sports rules, ruin names, hyperbaton) that blend world knowledge with language comprehension.
  3. Good Instruction Compliance – High prompt-level and instance-level accuracy (both strict and loose) indicate that it follows user instructions effectively, even in more complex or nuanced prompts.
  4. Reasonable Multi-step Reasoning – While not the best in every logic category, it still shows solid performance (60%+) on tasks like disambiguation and causal reasoning.
  5. Extended Context Window (138k) – The large 138k token context allows the model to handle lengthy inputs and maintain coherence across extensive passages or multi-turn conversations. This is especially valuable for tasks like long-document question answering, summarization, or complex scenario analysis where context retention is crucial.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 25.67
IFEval (0-Shot) 69.67
BBH (3-Shot) 30.06
MATH Lvl 5 (4-Shot) 11.10
GPQA (0-shot) 6.60
MuSR (0-shot) 8.50
MMLU-PRO (5-shot) 28.10