Text Generation
Transformers
Safetensors
English
olmoe
Mixture of Experts
olmo
conversational
Inference Endpoints
Muennighoff commited on
Commit
f6558ef
β€’
1 Parent(s): 3746e8e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -44
README.md CHANGED
@@ -22,50 +22,6 @@ base_model: allenai/OLMoE-1B-7B-0924-SFT
22
  - Paper:
23
  - Logs: https://github.com/allenai/OLMoE/blob/main/logs/olmoe-dpo-logs.txt
24
 
25
- ### Evaluation Summary
26
-
27
- | Task (β†’) | MMLU | GSM8k | BBH | Human-Eval | Alpaca-Eval 1.0 | XSTest | IFEval | Avg |
28
- |---------------|------|-------|------|------------|-----------------|--------|--------|------|
29
- | **Setup (β†’)** | 0-shot | 8-shot CoT | 3-shot | 0-shot | 0-shot | 0-shot | 0-shot | |
30
- | **Metric (β†’)** | EM | EM | EM | Pass@10 | %win | F1 | Loose Acc | |
31
- | | | | | | | | | |
32
- | OLMo-1B (0724) | 25.0 | 7.0 | 22.5 | 16.0 | - | 67.6 | 20.5 | - |
33
- | +SFT | 36.0 | 12.5 | 27.2 | 21.2 | 41.5 | 81.9 | 26.1 | 35.9 |
34
- | +DPO | 36.7 | 12.5 | 30.6 | 22.0 | 50.9 | 79.8 | 24.2 | 37.4 |
35
- | OLMo-7B (0724) | 50.8 | 32.5 | 36.9 | 32.3 | - | 80.8 | 19.6 | - |
36
- | +SFT | 54.2 | 25.0 | 35.7 | 38.5 | 70.9 | 86.1 | 39.7 | 49.3 |
37
- | +DPO | 52.8 | 9.0 | 16.6 | 35.0 | 83.5 | **87.5** | 37.9 | 49.1 |
38
- | JetMoE-2B-9B | 45.6 | 43.0 | 37.2 | 54.6 | - | 68.2 | 20.0 | - |
39
- | +SFT | 46.1 | 53.5 | 35.6 | 64.8 | 69.3 | 55.6 | 30.5 | 50.4 |
40
- | DeepSeek-3B-16B | 37.7 | 18.5 | 39.4 | 48.3 | - | 65.9 | 13.5 | - |
41
- | +Chat | 48.5 | 46.5 | **40.8** | **70.1** | 74.8 | 85.6 | 32.3 | 57.0 |
42
- | Qwen1.5-3B-14B | **60.4** | 13.5 | 27.2 | 60.2 | - | 73.4 | 20.9 | - |
43
- | +Chat | 58.9 | **55.5** | 21.3 | 59.7 | 83.9 | 85.6 | 36.2 | 57.3 |
44
- | **OLMoE (This Model)** | 49.8 | 3.0 | 33.6 | 22.4 | - | 59.7 | 16.6 | - |
45
- | **+SFT** | 51.4 | 40.5 | 38.0 | 51.6 | 69.2 | 84.1 | 43.3 | 54.0 |
46
- | **+DPO** | 51.9 | 45.5 | 37.0 | 54.8 | **84.0** | 82.6 | **48.1** | **57.7** |
47
-
48
- ### Artifacts
49
-
50
- - **Pretraining**
51
- - [Checkpoints](https://hf.co/allenai/OLMoE-1B-7B-0924)
52
- - [Code](https://github.com/allenai/OLMo/tree/Muennighoff/MoE): Built on top of OLMo models.
53
- - [Data](https://huggingface.co/datasets/allenai/OLMoE-mix-0924): Mix of DCLM Baseline with some components of Dolma.
54
- - Logs: *coming soon*
55
-
56
- - **SFT (Supervised Fine-Tuning)**
57
- - [Checkpoints](https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT): With and without load balancing.
58
- - [Code](https://github.com/allenai/open-instruct/tree/olmoe-sft)
59
- - [Data](https://hf.co/datasets/allenai/tulu-v3.1-mix-preview-4096-OLMoE): Preview of Tulu 3 post-training recipe.
60
- - [Logs](https://github.com/allenai/OLMoE/blob/main/logs/olmoe-sft-logs.txt)
61
-
62
- - **DPO/KTO (Direct Preference Optimization/Kahneman-Tversky Optimization)**
63
- - [Checkpoints](https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct)
64
- - [Preference Data](https://hf.co/datasets/allenai/ultrafeedback_binarized_cleaned)
65
- - [DPO code](https://github.com/allenai/open-instruct/tree/olmoe-sft), [KTO code](https://github.com/Muennighoff/kto/blob/master/kto.py)
66
- - [Logs](https://github.com/allenai/OLMoE/blob/main/logs/olmoe-dpo-logs.txt)
67
-
68
-
69
  # Use
70
 
71
  Install `transformers` **from source** until a release after [this PR](https://github.com/huggingface/transformers/pull/32406) & `torch` and run:
@@ -99,6 +55,29 @@ Branches:
99
  - `non-annealed`: Ablation starting from the `non-annealed` branch of https://hf.co/allenai/OLMoE-1B-7B-0924-SFT which is an SFT of the pretraining checkpoint prior to annealing (branch `step1200000-tokens5033B` of https://hf.co/allenai/OLMoE-1B-7B-0924)
100
  - `kto`: Ablation using KTO instead of DPO. This branch is the checkpoint after 5,000 steps with the RMS optimizer. The other `kto*` branches correspond to the other checkpoints mentioned in the paper.
101
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
  # Citation
103
 
104
  ```bibtex
 
22
  - Paper:
23
  - Logs: https://github.com/allenai/OLMoE/blob/main/logs/olmoe-dpo-logs.txt
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  # Use
26
 
27
  Install `transformers` **from source** until a release after [this PR](https://github.com/huggingface/transformers/pull/32406) & `torch` and run:
 
55
  - `non-annealed`: Ablation starting from the `non-annealed` branch of https://hf.co/allenai/OLMoE-1B-7B-0924-SFT which is an SFT of the pretraining checkpoint prior to annealing (branch `step1200000-tokens5033B` of https://hf.co/allenai/OLMoE-1B-7B-0924)
56
  - `kto`: Ablation using KTO instead of DPO. This branch is the checkpoint after 5,000 steps with the RMS optimizer. The other `kto*` branches correspond to the other checkpoints mentioned in the paper.
57
 
58
+ # Evaluation Snapshot
59
+
60
+ | Task (β†’) | MMLU | GSM8k | BBH | Human-Eval | Alpaca-Eval 1.0 | XSTest | IFEval | Avg |
61
+ |---------------|------|-------|------|------------|-----------------|--------|--------|------|
62
+ | **Setup (β†’)** | 0-shot | 8-shot CoT | 3-shot | 0-shot | 0-shot | 0-shot | 0-shot | |
63
+ | **Metric (β†’)** | EM | EM | EM | Pass@10 | %win | F1 | Loose Acc | |
64
+ | | | | | | | | | |
65
+ | OLMo-1B (0724) | 25.0 | 7.0 | 22.5 | 16.0 | - | 67.6 | 20.5 | - |
66
+ | +SFT | 36.0 | 12.5 | 27.2 | 21.2 | 41.5 | 81.9 | 26.1 | 35.9 |
67
+ | +DPO | 36.7 | 12.5 | 30.6 | 22.0 | 50.9 | 79.8 | 24.2 | 37.4 |
68
+ | OLMo-7B (0724) | 50.8 | 32.5 | 36.9 | 32.3 | - | 80.8 | 19.6 | - |
69
+ | +SFT | 54.2 | 25.0 | 35.7 | 38.5 | 70.9 | 86.1 | 39.7 | 49.3 |
70
+ | +DPO | 52.8 | 9.0 | 16.6 | 35.0 | 83.5 | **87.5** | 37.9 | 49.1 |
71
+ | JetMoE-2B-9B | 45.6 | 43.0 | 37.2 | 54.6 | - | 68.2 | 20.0 | - |
72
+ | +SFT | 46.1 | 53.5 | 35.6 | 64.8 | 69.3 | 55.6 | 30.5 | 50.4 |
73
+ | DeepSeek-3B-16B | 37.7 | 18.5 | 39.4 | 48.3 | - | 65.9 | 13.5 | - |
74
+ | +Chat | 48.5 | 46.5 | **40.8** | **70.1** | 74.8 | 85.6 | 32.3 | 57.0 |
75
+ | Qwen1.5-3B-14B | **60.4** | 13.5 | 27.2 | 60.2 | - | 73.4 | 20.9 | - |
76
+ | +Chat | 58.9 | **55.5** | 21.3 | 59.7 | 83.9 | 85.6 | 36.2 | 57.3 |
77
+ | **OLMoE (This Model)** | 49.8 | 3.0 | 33.6 | 22.4 | - | 59.7 | 16.6 | - |
78
+ | **+SFT** | 51.4 | 40.5 | 38.0 | 51.6 | 69.2 | 84.1 | 43.3 | 54.0 |
79
+ | **+DPO** | 51.9 | 45.5 | 37.0 | 54.8 | **84.0** | 82.6 | **48.1** | **57.7** |
80
+
81
  # Citation
82
 
83
  ```bibtex