Was doing some benchmarks for comparison data here is your GPT4All and AGIEval Scores:

by teknium - opened Oct 13, 2023

Oct 13, 2023

•

edited Oct 13, 2023

AGIEval:

|             Task             |Version| Metric |Value |   |Stderr|
|------------------------------|------:|--------|-----:|---|-----:|
|agieval_aqua_rat              |      0|acc     |0.1969|±  |0.0250|
|                              |       |acc_norm|0.1969|±  |0.0250|
|agieval_logiqa_en             |      0|acc     |0.3134|±  |0.0182|
|                              |       |acc_norm|0.3518|±  |0.0187|
|agieval_lsat_ar               |      0|acc     |0.2043|±  |0.0266|
|                              |       |acc_norm|0.1870|±  |0.0258|
|agieval_lsat_lr               |      0|acc     |0.3941|±  |0.0217|
|                              |       |acc_norm|0.3882|±  |0.0216|
|agieval_lsat_rc               |      0|acc     |0.5093|±  |0.0305|
|                              |       |acc_norm|0.4833|±  |0.0305|
|agieval_sat_en                |      0|acc     |0.6942|±  |0.0322|
|                              |       |acc_norm|0.6748|±  |0.0327|
|agieval_sat_en_without_passage|      0|acc     |0.3835|±  |0.0340|
|                              |       |acc_norm|0.3835|±  |0.0340|
|agieval_sat_math              |      0|acc     |0.3955|±  |0.0330|
|                              |       |acc_norm|0.3545|±  |0.0323|

and GPT4All:

|    Task     |Version| Metric |Value |   |Stderr|
|-------------|------:|--------|-----:|---|-----:|
|arc_challenge|      0|acc     |0.5367|±  |0.0146|
|             |       |acc_norm|0.5640|±  |0.0145|
|arc_easy     |      0|acc     |0.8245|±  |0.0078|
|             |       |acc_norm|0.8051|±  |0.0081|
|boolq        |      1|acc     |0.8697|±  |0.0059|
|hellaswag    |      0|acc     |0.6273|±  |0.0048|
|             |       |acc_norm|0.8123|±  |0.0039|
|openbookqa   |      0|acc     |0.3440|±  |0.0213|
|             |       |acc_norm|0.4460|±  |0.0223|
|piqa         |      0|acc     |0.8161|±  |0.0090|
|             |       |acc_norm|0.8275|±  |0.0088|
|winogrande   |      0|acc     |0.7569|±  |0.0121|

migtissera

Owner Oct 13, 2023

Awesome! Thanks @teknium ! Is there a leaderboard with these evals anywhere?

teknium

Oct 13, 2023

•

edited Oct 13, 2023

Awesome! Thanks @teknium ! Is there a leaderboard with these evals anywhere?

You're welcome. And, as far as I know, no. All nous models, my own models, and I think openorca models do these bench's, and in Nous' discord I have a channel that I log all benchmark results I do, but nothing official or structured or centralized atm

edit: gpt4all.io has a leaderboard for gpt4all scores, but, theyve been kind of dead for months, and Ive been the only one submitting models there since like 4 months ago xD

migtissera

Owner Oct 13, 2023

Also, is acc or acc_norm used for calculating the average?

migtissera

Owner Oct 13, 2023

Just did a Google search and came up with this: https://opencompass.org.cn/dataset-detail/AGIEval

Is this not legit?

teknium

Oct 14, 2023

Also, is acc or acc_norm used for calculating the average?

acc_norm

teknium

Oct 14, 2023

Just did a Google search and came up with this: https://opencompass.org.cn/dataset-detail/AGIEval

Is this not legit?

I wouldnt trust anything that could have been tested in a different eval framework than lm-eval-harness (which is what hf uses and gpt4all - orca paper did their own implementation of the bench in their paper and its vastly different for their vicuna score than eval harness provides)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment