Was doing some benchmarks for comparison data here is your GPT4All and AGIEval Scores:
AGIEval:
| Task |Version| Metric |Value | |Stderr|
|------------------------------|------:|--------|-----:|---|-----:|
|agieval_aqua_rat | 0|acc |0.1969|± |0.0250|
| | |acc_norm|0.1969|± |0.0250|
|agieval_logiqa_en | 0|acc |0.3134|± |0.0182|
| | |acc_norm|0.3518|± |0.0187|
|agieval_lsat_ar | 0|acc |0.2043|± |0.0266|
| | |acc_norm|0.1870|± |0.0258|
|agieval_lsat_lr | 0|acc |0.3941|± |0.0217|
| | |acc_norm|0.3882|± |0.0216|
|agieval_lsat_rc | 0|acc |0.5093|± |0.0305|
| | |acc_norm|0.4833|± |0.0305|
|agieval_sat_en | 0|acc |0.6942|± |0.0322|
| | |acc_norm|0.6748|± |0.0327|
|agieval_sat_en_without_passage| 0|acc |0.3835|± |0.0340|
| | |acc_norm|0.3835|± |0.0340|
|agieval_sat_math | 0|acc |0.3955|± |0.0330|
| | |acc_norm|0.3545|± |0.0323|
and GPT4All:
| Task |Version| Metric |Value | |Stderr|
|-------------|------:|--------|-----:|---|-----:|
|arc_challenge| 0|acc |0.5367|± |0.0146|
| | |acc_norm|0.5640|± |0.0145|
|arc_easy | 0|acc |0.8245|± |0.0078|
| | |acc_norm|0.8051|± |0.0081|
|boolq | 1|acc |0.8697|± |0.0059|
|hellaswag | 0|acc |0.6273|± |0.0048|
| | |acc_norm|0.8123|± |0.0039|
|openbookqa | 0|acc |0.3440|± |0.0213|
| | |acc_norm|0.4460|± |0.0223|
|piqa | 0|acc |0.8161|± |0.0090|
| | |acc_norm|0.8275|± |0.0088|
|winogrande | 0|acc |0.7569|± |0.0121|
Awesome! Thanks @teknium ! Is there a leaderboard with these evals anywhere?
Awesome! Thanks @teknium ! Is there a leaderboard with these evals anywhere?
You're welcome. And, as far as I know, no. All nous models, my own models, and I think openorca models do these bench's, and in Nous' discord I have a channel that I log all benchmark results I do, but nothing official or structured or centralized atm
edit: gpt4all.io has a leaderboard for gpt4all scores, but, theyve been kind of dead for months, and Ive been the only one submitting models there since like 4 months ago xD
Also, is acc or acc_norm used for calculating the average?
Just did a Google search and came up with this: https://opencompass.org.cn/dataset-detail/AGIEval
Is this not legit?
Also, is acc or acc_norm used for calculating the average?
acc_norm
Just did a Google search and came up with this: https://opencompass.org.cn/dataset-detail/AGIEval
Is this not legit?
I wouldnt trust anything that could have been tested in a different eval framework than lm-eval-harness (which is what hf uses and gpt4all - orca paper did their own implementation of the bench in their paper and its vastly different for their vicuna score than eval harness provides)