Was doing some benchmarks for comparison data here is your GPT4All and AGIEval Scores:

#7
by teknium - opened

AGIEval:

|             Task             |Version| Metric |Value |   |Stderr|
|------------------------------|------:|--------|-----:|---|-----:|
|agieval_aqua_rat              |      0|acc     |0.1969|±  |0.0250|
|                              |       |acc_norm|0.1969|±  |0.0250|
|agieval_logiqa_en             |      0|acc     |0.3134|±  |0.0182|
|                              |       |acc_norm|0.3518|±  |0.0187|
|agieval_lsat_ar               |      0|acc     |0.2043|±  |0.0266|
|                              |       |acc_norm|0.1870|±  |0.0258|
|agieval_lsat_lr               |      0|acc     |0.3941|±  |0.0217|
|                              |       |acc_norm|0.3882|±  |0.0216|
|agieval_lsat_rc               |      0|acc     |0.5093|±  |0.0305|
|                              |       |acc_norm|0.4833|±  |0.0305|
|agieval_sat_en                |      0|acc     |0.6942|±  |0.0322|
|                              |       |acc_norm|0.6748|±  |0.0327|
|agieval_sat_en_without_passage|      0|acc     |0.3835|±  |0.0340|
|                              |       |acc_norm|0.3835|±  |0.0340|
|agieval_sat_math              |      0|acc     |0.3955|±  |0.0330|
|                              |       |acc_norm|0.3545|±  |0.0323|

and GPT4All:

|    Task     |Version| Metric |Value |   |Stderr|
|-------------|------:|--------|-----:|---|-----:|
|arc_challenge|      0|acc     |0.5367|±  |0.0146|
|             |       |acc_norm|0.5640|±  |0.0145|
|arc_easy     |      0|acc     |0.8245|±  |0.0078|
|             |       |acc_norm|0.8051|±  |0.0081|
|boolq        |      1|acc     |0.8697|±  |0.0059|
|hellaswag    |      0|acc     |0.6273|±  |0.0048|
|             |       |acc_norm|0.8123|±  |0.0039|
|openbookqa   |      0|acc     |0.3440|±  |0.0213|
|             |       |acc_norm|0.4460|±  |0.0223|
|piqa         |      0|acc     |0.8161|±  |0.0090|
|             |       |acc_norm|0.8275|±  |0.0088|
|winogrande   |      0|acc     |0.7569|±  |0.0121|

Awesome! Thanks @teknium ! Is there a leaderboard with these evals anywhere?

Awesome! Thanks @teknium ! Is there a leaderboard with these evals anywhere?

You're welcome. And, as far as I know, no. All nous models, my own models, and I think openorca models do these bench's, and in Nous' discord I have a channel that I log all benchmark results I do, but nothing official or structured or centralized atm

edit: gpt4all.io has a leaderboard for gpt4all scores, but, theyve been kind of dead for months, and Ive been the only one submitting models there since like 4 months ago xD

Also, is acc or acc_norm used for calculating the average?

Just did a Google search and came up with this: https://opencompass.org.cn/dataset-detail/AGIEval

Is this not legit?

Also, is acc or acc_norm used for calculating the average?

acc_norm

Just did a Google search and came up with this: https://opencompass.org.cn/dataset-detail/AGIEval

Is this not legit?

I wouldnt trust anything that could have been tested in a different eval framework than lm-eval-harness (which is what hf uses and gpt4all - orca paper did their own implementation of the bench in their paper and its vastly different for their vicuna score than eval harness provides)

Sign up or log in to comment