metadata

language:
  - en
license: cc-by-nc-4.0
tags:
  - distilabel
  - dpo
  - rlaif
  - rlhf
  - merge
  - mergekit
datasets:
  - argilla/distilabel-intel-orca-dpo-pairs
base_model: mlabonne/Marcoro14-7B-slerp
model-index:
  - name: distilabeled-Marcoro14-7B-slerp
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 70.73
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Marcoro14-7B-slerp
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 87.47
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Marcoro14-7B-slerp
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 65.22
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Marcoro14-7B-slerp
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 65.1
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Marcoro14-7B-slerp
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 82.08
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Marcoro14-7B-slerp
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 71.19
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Marcoro14-7B-slerp
          name: Open LLM Leaderboard

⚗️ distilabeled Marcoro14 7B Slerp

Introduction

This model is a new DPO fine-tune of our new open dataset argilla/distilabel-intel-orca-dpo-pairs, on the mlabonne/Marcoro14-7B-slerp model. You can find more information of the "distilabeled" dataset used at this repo argilla/distilabeled-Hermes-2.5-Mistral-7B, and visit distilabel.

Training details

As we did with Notus, we wanted a reproducible recipe to test the impact of data quality.

And we're lucky to have so many amazing folks in the open community contributing reproducible, easy-to-use training scripts and recipes. This time, Maxime Labonne had shared a Colab to fine-tune OpenHermes with DPO and the original Intel's dataset, perfect! We just updated the base model to mlabonne/Marcoro14-7B-slerp, and applied the same dataset recipe we used for argilla/distilabeled-Hermes-2.5-Mistral-7B:

from datasets import load_dataset

# Instead of this:
# dataset = load_dataset("Intel/orca_dpo_pairs", split="train")

# we did this
dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

dataset = dataset.filter(
    lambda r: 
        r["status"] != "tie" and 
        r["chosen_score"] >= 8 and 
        not r["in_gsm8k_train"]
)

Benchmark results

For benchmarking we used the famous "Nous" or "Teknium" benchmark. You can find below an overview, including our first experiment with a less ambitious dataset filtering (removing ties and score>5).

For running the benchmark we used another awesome contribution from Maxime: LLM AutoEval, check it out!

Model	AGIEval	GPT4ALL	TruthfulQA	Bigbench	Average
argilla/distilabeled-Marcoro14-7B-slerp	45.4	76.47	65.46	47.19	58.63
Marcoro14-7B-slerp	44.66	76.24	64.15	45.64	57.67
argilla/distilabeled-Hermes-2.5-Mistral-7B	44.64	73.35	55.96	42.21	54.04

Training Hardware

We used 1 x A100 80GB in runpod for less than 1 hour.

Acknowledgements

We'd like to thank the amazing open community and in particular:

The Intel team for publishing a great open dataset and show how well it worked in the first place
Teknium and NousResearch for their awesome work and models.
Maxime for sharing such great resources.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	73.63
AI2 Reasoning Challenge (25-Shot)	70.73
HellaSwag (10-Shot)	87.47
MMLU (5-Shot)	65.22
TruthfulQA (0-shot)	65.10
Winogrande (5-shot)	82.08
GSM8k (5-shot)	71.19