Base Model or Finetuned Version?

#2
by jphme - opened

It´s not really clear from your description whether this is the extracted base model or whether you already did finetuning on top of it?
If the latter, which data + prompt format did you use?

I´d be interested in just the extracted base model without any additional finetuning.

@jphme
Ur good! It’s a fine tune, I will release the base model, along with the v.2 lora so anyone who would like to fine tune it with lora from either my check point or from scratch can. My wifi bandwidth can only go so far. And I havent slept since mixtral 22b dropped 😅 also the safetensors files are almost done uploading, I would say like 15 ish minutes.

Nice Catch @jphme
I'd love to play around with the base weights too 😉
@Vezora are you doing the computations locally that you are limited by hotel wlan ?
If you need an dedicated vserver with higher bandwidth and GPU we can talk about some sponsoring

@flozi00 yeah, all computation was done locally my rooms a bit toasty right now 😂. Here if my twitter I just followed you (from ur HF profile) that way we can DM https://twitter.com/mejia_petit . I would love to talk more about this! (Preferably tomorrow haven't slept since mixtral 22b dropped)

please drop quantize version

@Winmodel I’d love to but I’m currently training, you can easily quantize it with BNB using load_in_4bit and save prettrained dir.

@jphme @flozi00 I have a untrained versions of each raw extracted experts as dense Mistral 22B models.

https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-expert-0

The other experts are there also on my profile

There is also this one which is a linear merge of all experts into one model : https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-lerp

They all output gibberish for the most part, expert 2 seems to be the most coherent from my limited tests. Expert 0 has the lowest perplexity on wikitext but I wasn't able to generate coherent text with it.

I'll be sharing code and evals in the next hours

Would love to know the prompt format as well.
Thank you.

@Winmodel Working to get an AWQ quant of this, debugging a few errors.

@jphme @flozi00 I have a untrained versions of each raw extracted experts as dense Mistral 22B models.

https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-expert-0

The other experts are there also on my profile

There is also this one which is a linear merge of all experts into one model : https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-lerp

They all output gibberish for the most part, expert 2 seems to be the most coherent from my limited tests. Expert 0 has the lowest perplexity on wikitext but I wasn't able to generate coherent text with it.

I'll be sharing code and evals in the next hours

Those were some of my findings as well. I also found expert 2 to be the only once to write consistent english words. Provided it was completely unrelated to what I said asked, it atleast was an attempt every other expert truly did become an expert in langauge. As shown here “https://huggingface.co/blog/moe#what-does-an-expert-learn”. Some would have finicy spacing and symboks, and somewhere just mangled nonsense.

Would love to know the prompt format as well.
Thank you.

Alpaca! V2 is almost done, and its also alpaca, but in multi turn raw format. (So same thing for you just more work prepping dataset for me)

Alpaca! V2 is almost done, and its also alpaca, but in multi turn raw format. (So same thing for you just more work prepping dataset for me)

Great! I look forward to V2 :)

Alpaca! V2 is almost done, and its also alpaca, but in multi turn raw format. (So same thing for you just more work prepping dataset for me)

Great! I look forward to V2 :)

Thank you!! V2 is essentially the test to see if using all experts equally is the best thing to do, or just using a single one, by increase the data size by 8x i will easily he able to verify the knowledge of the model. There are still other methods I have yet to try, so I’m not done I’m gonna keep going till I get a 22b that out preforms mistral 7b as expected, by a 22b model.

@Winmodel

please drop quantize version

I am uploading some GGUF quants(with importance matrix) here: https://huggingface.co/qwp4w3hyb/Mistral-22B-v0.1-iMat-GGUF

Sign up or log in to comment