Unable to finetune using axolotl

#1
by henriklied - opened

I tried fine tuning the model using axolotl, but it seems like your special_tokens.json has some issues and looks quite different from the main Mistral one as well as Nb's.

I ended up copying the base Mistral specials_tokens_map.json and setting tokenizer_type: PreTrainedTokenizerFast per your tokenizer_config.json, but that just crashed with TypeError: MistralForCausalLM.forward() got an unexpected keyword argument 'token_type_ids'.

I ran into the same issues on the -warm model.

Any tips would be greatly appreciated.

Norwegian Large Language Models org

Hi, our models use a Norwegian tokenizer, which is different from the mostly English tokenizer of Mistral. But the special_tokens_map.json file is in the standard HuggingFace format,

It would be really helpful if you could send us a minimal reproducible example of this bug or at least a detailed error trace.

Hi again,

Sorry, I meant the tokenizer_config.json file, not the special_tokens_map.json. It seems like axolotl was unable to process it.
I copied Mistrals tokenizer_config.json and changed tokenizer_class to PreTrainedTokenizerFast, as well as purged all my dataset caches, and now I'm able to fine tune the model. ☺️

I'm using the alpaca format for fine tuning it right now. The default system prompt for alpaca fine tuning is "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request". Should I use this, or would you recommend swapping it out for something else when instruction fine tuning? The rest of my instruction, user and system messages are all in Norwegian.

Norwegian Large Language Models org

Hi,

That's great news! Thanks for finding a way around this issue. However, note that the tokenizers are completely different (SentencePiece vs. HF tokenizers), so copying the full config from Mistral might introduce some unwanted artifacts. Do you by any chance remember what item(s) was axolotl missing from the config file? Adding only this item(s) would be a safer workaround, I would also fix it directly in this repository.

Norwegian system prompt will probably work better for these models :)

The error message was quite cryptic, with sentencepiece/init.py line 310, in LoadFromFilespitting out"TypeError: not a string"`.
The fine tuning ended up with an eval loss of 1.077 on 3 epochs, fine tuned on 120k articles where the output was title generation. Feels like it performs better than my testing with GPT-J, so that's fun. I'll do another run with a Norwegian system prompt as well. :-)

Want to share your tokenizer_config.json setup @henriklied ?

Sign up or log in to comment