elinas's picture
Update README.md
06e2ddf verified
metadata
base_model:
  - elinas/Llama-3-15B-Instruct-zeroed
library_name: transformers
tags:
  - mergekit
  - merge
datasets:
  - Chat-Error/Pure-dove-sharegpt
license: llama3

Llama-3-15B-Instruct-zeroed-ft

This is a QLoRA finetune of a merge of pre-trained language models created using mergekit.

The model is based on a "zeroed" passthrough merge of Llama-3-15B-Instruct-zeroed

This was primarily an experiment to see how a passthrough merge will respond to further finetuning, though this was done on a small dataset.

The model was finetuned on 8192 context length and is likely reliable using RoPE up to 32k.

Further finetuning this model or finetuning the base model on more samples is encouraged.

Datasets

A small, high quality, dataset was used as a PoC / validation on stabilizing the model after finetuning.

Finetuning details

This is a QLoRA model and the following modules were targeted.

lora_target_modules:
  - down_proj
  - o_proj

The model is coherent even with training the "zeroed" layers and can write well. In the next experiment, all layers will be finetuned as this was the recommendation from Charles Goddard - thank you for sharing the method of merging as well as Toasty Pigeon for bringing it to my attention!

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- num_devices: 3
- total_train_batch_size: 6
- total_eval_batch_size: 6
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 25
- num_epochs: 1

Optimizer paged_adamw_8bit and Deepspeed ZeRO 3 was used at a LR of 1e-5 using the cosine scheduler for 1 epoch on 3x3090s taking 2h 30m total.

Sample packing and padding was disabled to reduce VRAM consumption significantly at the cost of speed.

W&B Run Summary

wandb: Run summary:
wandb:                eval/loss 0.94497
wandb:             eval/runtime 276.2864
wandb:  eval/samples_per_second 1.397
wandb:    eval/steps_per_second 0.235
wandb:               total_flos 12246605365248.0
wandb:              train/epoch 1.0
wandb:        train/global_step 579
wandb:          train/grad_norm 0.80411
wandb:      train/learning_rate 0.0
wandb:               train/loss 1.085
wandb:               train_loss 0.8834
wandb:            train_runtime 9893.1688
wandb: train_samples_per_second 0.351
wandb:   train_steps_per_second 0.059

Framework versions

  • PEFT 0.10.0
  • Transformers 4.40.0.dev0
  • Pytorch 2.3.0+cu121
  • Datasets 2.15.0
  • Tokenizers 0.15.0

Model Evaluation

TBD

If you have any questions or comments on the model, feel free to open a discussion in the community tab.

Built with Axolotl