openaccess-ai-collective/hippogriff-30b-chat · Will you consider releasing a public dataset?

May 31, 2023

Here's the thing, I've noticed that from Mega to Manticore, and now to Hippogriff, it seems like you all have been using the Pygmalion dataset. The open-source community has probably also realized that in order to achieve better and more open-ended role-playing effects, it's not necessarily required to align with datasets like Alpaca and Vicuna that resemble GPT more. Instead, we should lean towards Pygmalion.

If you consider releasing datasets like Pygmalion and hellaswag (updated with 30K+ rows), it should encourage the open-source community to use Falcon, Guanaco, RedPajama, BLOOM, and other tools to train better models based on Pygmalion.

winglian

Open Access AI Collective org May 31, 2023

Unfortunately I'm bound by oath not to release the pygmalion dataset. The hellaswag dataset I'm using is here: https://huggingface.co/datasets/winglian/evals/blob/main/hellaswag/hellaswag.jsonl

HDiffusion

May 31, 2023

•

edited May 31, 2023

releasing datasets like Pygmalion

From what I've heard from one of the people involved with the project, the reason they don't release it is because it contains a lot of data that might be upsetting to some people. If you actually intend to use it for training and have trained models in the past you can probably reach out to one of the members for a copy.

Jackdiy

Jun 1, 2023

Thank you both for your patient explanation and sharing. I will try to contact the Pygmalion team.

winglian changed discussion status to closed Jun 12, 2023