Pretrain?
Hi, I'm just using Mistral and read the paper for the first time a few days ago, and I'm so sad that i read this great paper too late.
I have a question about Mistral-7b. I knew there is very little difference between mistral structure and llama-2 structure, and i think paper said just three useful tools only for inference and memory optimization (1. SWA, 2. Rolling buffer cache, 3. Pre-fill and chunking). I think there is nothing about pre-training on paper.
Therefore, i can only assume that just changing model architecture for inference makes a big difference.
(Or did mistral ai use llama-2 parameter weights?) (THIS IS ONLY MY THINKING)
I don't want to know about datasets for pre-training if you did (becasue of reading the discussion you cannot tell us about dataset before), only want to know about whether there is pre-train process and there is another skill to pretrain.
I apologize if I've been rude, and I hope anyone let me know if I've gotten anything wrong or misunderstood.
Thank you for reading my question!
I'm pretty sure they trained the whole thing from scratch, although it's a very interesting idea to explore to try to only finetune the juiced up llama2 instead of from beginning I give you that.
I'm pretty sure they trained the whole thing from scratch, although it's a very interesting idea to explore to try to only finetune the juiced up llama2 instead of from beginning I give you that.
that's how miqu (close relative to mistral medium) was created apparently
I'm pretty sure they trained the whole thing from scratch, although it's a very interesting idea to explore to try to only finetune the juiced up llama2 instead of from beginning I give you that.
that's how miqu (close relative to mistral medium) was created apparently
Why is miqu a proof of that...?