mistralai/Mistral-7B-v0.1

Jan 31, 2024

•

edited Jan 31, 2024

Hi, I'm just using Mistral and read the paper for the first time a few days ago, and I'm so sad that i read this great paper too late.

I have a question about Mistral-7b. I knew there is very little difference between mistral structure and llama-2 structure, and i think paper said just three useful tools only for inference and memory optimization (1. SWA, 2. Rolling buffer cache, 3. Pre-fill and chunking). I think there is nothing about pre-training on paper.

Therefore, i can only assume that just changing model architecture for inference makes a big difference.
(Or did mistral ai use llama-2 parameter weights?) (THIS IS ONLY MY THINKING)

I don't want to know about datasets for pre-training if you did (becasue of reading the discussion you cannot tell us about dataset before), only want to know about whether there is pre-train process and there is another skill to pretrain.

I apologize if I've been rude, and I hope anyone let me know if I've gotten anything wrong or misunderstood.
Thank you for reading my question!

nofreewill

Feb 15, 2024

I'm pretty sure they trained the whole thing from scratch, although it's a very interesting idea to explore to try to only finetune the juiced up llama2 instead of from beginning I give you that.

g-ronimo

Feb 16, 2024

I'm pretty sure they trained the whole thing from scratch, although it's a very interesting idea to explore to try to only finetune the juiced up llama2 instead of from beginning I give you that.

that's how miqu (close relative to mistral medium) was created apparently

limha

Feb 16, 2024

•

edited Feb 16, 2024

I'm pretty sure they trained the whole thing from scratch, although it's a very interesting idea to explore to try to only finetune the juiced up llama2 instead of from beginning I give you that.

that's how miqu (close relative to mistral medium) was created apparently

Why is miqu a proof of that...?

mistralai
/

Mistral-7B-v0.1

Pretrain?