Benchmarks are here!
!!! Actually wild! Great work on the benchmarks!
@0-hero The Arc scores don't make sense to me. Doesn't the Llama 2 70b base only have an Arc of around 68? I read that the Arc used to evaluate GPT4, Opus and other proprietary models is very different. So for example, if Mixtral 8x22b has an Arc of 70.5, then GPT4 has an equivalent Arc score of about 83, not 96.3.
@0-hero The Arc scores don't make sense to me. Doesn't the Llama 2 70b base only have an Arc of around 68? I read that the Arc use to evaluate GPT4, Opus and other proprietary models is very different. So for example, if Mixtral 8x22b has an Arc of 70.5, then GPT has an equivalent Arc score of about 83, not 96.3.
Yep thats also what I thought.
Improvement on GSM8K is hopfully great for reasoning (but still significantly behind the smaller Claude models...); MMLU with a great improvement as well (Im skeptical that Qwen 1.5s MMLU score is legit).
I doubt if this is an older base version of mistral-large
Any benchmarks on code writing?
Will do evalplus tomorrow on any finetune if someone hasn’t already done it
We did also some benchmark runs on the NousResearch suite earlier today (by @bjoernp /DiscoResearch) - note that in some of these, finetuned versions usually do significantly better than base models (thanks @clem for the reminder to post here as well) :
0-shot
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|------------------------------|------:|------|-----:|--------|-----:|---|-----:|
|agieval_sat_math | 1|none | 0|acc |0.5409|± |0.0337|
| | |none | 0|acc_norm|0.4227|± |0.0334|
|agieval_sat_en_without_passage| 1|none | 0|acc |0.5825|± |0.0344|
| | |none | 0|acc_norm|0.4903|± |0.0349|
|agieval_sat_en | 1|none | 0|acc |0.8301|± |0.0262|
| | |none | 0|acc_norm|0.7476|± |0.0303|
|agieval_lsat_rc | 1|none | 0|acc |0.7472|± |0.0265|
| | |none | 0|acc_norm|0.5799|± |0.0301|
|agieval_lsat_lr | 1|none | 0|acc |0.5745|± |0.0219|
| | |none | 0|acc_norm|0.4471|± |0.0220|
|agieval_lsat_ar | 1|none | 0|acc |0.2435|± |0.0284|
| | |none | 0|acc_norm|0.2174|± |0.0273|
|agieval_logiqa_en | 1|none | 0|acc |0.3963|± |0.0192|
| | |none | 0|acc_norm|0.3840|± |0.0191|
|agieval_aqua_rat | 1|none | 0|acc |0.2677|± |0.0278|
| | |none | 0|acc_norm|0.2795|± |0.0282|
Mixtral-8x22b
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|-------------|------:|------|-----:|--------|-----:|---|-----:|
|piqa | 1|none | 0|acc |0.8313|± |0.0087|
| | |none | 0|acc_norm|0.8487|± |0.0084|
|boolq | 2|none | 0|acc |0.8780|± |0.0057|
|arc_challenge| 1|none | 0|acc |0.5922|± |0.0144|
| | |none | 0|acc_norm|0.6365|± |0.0141|
|arc_easy | 1|none | 0|acc |0.8577|± |0.0072|
| | |none | 0|acc_norm|0.8401|± |0.0075|
|winogrande | 1|none | 0|acc |0.7979|± |0.0113|
|openbookqa | 1|none | 0|acc |0.3640|± |0.0215|
| | |none | 0|acc_norm|0.4960|± |0.0224|
|hellaswag | 1|none | 0|acc |0.6719|± |0.0047|
| | |none | 0|acc_norm|0.8617|± |0.0034|
How much sense does it make to use base models for 0-shot-benchmarks like TruthfulQA? I mean, this benchmark is asking questions, which is perfect for instruction fine-tuned models, but not for base models. Or do I miss something?
This was quick, thanks for all the work! I'd love to see the numbers for GPQA 0-shot COT or DROP 3-shot as well please if its possible!
@Acrobatix still useful to have baseline numbers pre-instruction tuning. say you have 2 base models A and B you only have budget to fine-tune one of them, if A outperforms B on the target task, makes sense to pick A for tuning. also having baseline numbers will confirm that your instruction tuning was set up correctly & useful (because the numbers post-tuning should be substantially improved)
Are you sure this is BASE, not INSTRUCT version? It looks better than GPT-35, I am confused
Are you sure this is BASE, not INSTRUCT version? It looks better than GPT-35, I am confused
The instruct version has not been released yet.
@ASHIDAKA Even though it's a foundational model it responds surprisingly well to various prompts.
Testing...
https://labs.perplexity.ai/
Oh nice this is fine tuned? Chat is good
Source - https://x.com/alpayariyak/status/1778329833514098832?s=46&t=ZC6wgu7iLucRMVlNeDgmYQ
That chart can be misleading.
"They're all base models", so there is still room for improvement.
What are the gpu requirements for finetuning this model?
Hi everyone!
We're thrilled to announce that after finishing our evaluations, the new Mixtral model is the top-performing pretrained model on the Open LLM Leaderboard! Congratulations to Mistral! 🏆👏
For a detailed breakdown, you can view the results here: results
To see how it compares with others, visit the leaderboard: leaderboard page
For an in-depth look at the evaluation process, check the "📝 About" section on the leaderboard page
@jphme Improvement on GSM8K is hopfully great for reasoning (but still significantly behind the smaller Claude models...);
However, the Claude models are aligned LLMs rather than foundation LLMs. Perhaps the aligned version of Mixtral-8x22B-v0.1 will be greatly improved.🧐
do you mean the instruct version? given that today it was finally released, hope someone can benchmark it
Working on mt-bench for now
MT-Bench
Model | MT-Bench |
---|---|
Claude 3 Opus | 9.43 |
GPT-4-1106-Preview | 9.32 |
Claude 3 Sonnet | 9.18 |
WizardLM-2 8x22B | 9.12 |
GPT-4-0314 | 8.96 |
Mixtral-8x22B-Instruct-v0.1 | 8.66 |
zephyr-orpo-141b-A35b-v0.1 | 8.17 |
Matter-0.2-8x22B | 8.00 |