--- base_model: Lambent/cosmo-upscale tags: - generated_from_trainer model-index: - name: lisa-out results: [] --- Tried depth-upscaling Cosmo-1b by duplicating 6 layers, then LISA-training on a dataset reasonably similar to the original one in an attempt to 'self-repair'. Not sure if it worked out exactly how I pictured but the nous eval's not overall much worse than the original at least. (Took I think about 8 hours for, I want to say, ~80 million tokens on one RTX 3090?) Thought about doing LORA first but I couldn't get peft_layers_to_transform working on axolotl and decided to go straight to LISA. It's probably good(?) for (random selection of layers) to get experience trying to work around (random thick doubled layers) in some kind of brain-exercise sense anyway. Capabilities assessment vs original and upscaled version: | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average| |-----------------------------------------------------------------------|------:|------:|---------:|-------:|------:| |[cosmo-upscale-lisa](https://huggingface.co/Lambent/cosmo-upscale-lisa)| 22.76| 49.69| 39.49| 28.7| 35.16| | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average| |---------------------------------------------------------|------:|------:|---------:|-------:|------:| |[cosmo-1b](https://huggingface.co/HuggingFaceTB/cosmo-1b)| 22.97| 52.01| 38.02| 28.73| 35.43| | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average| |-------------------------------------------------------------|------:|------:|---------:|-------:|------:| |[cosmo-upscale](https://huggingface.co/Lambent/cosmo-upscale)| 22.23| 48.35| 42.01| 28.36| 35.24| I'm not sure this helped. [Built with Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)
See axolotl config axolotl version: `0.4.0` ```yaml base_model: Lambent/cosmo-upscale model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: false strict: false datasets: - path: HuggingFaceTB/cosmopedia-100k type: completion - path: Vezora/Tested-22k-Python-Alpaca type: alpaca dataset_prepared_path: val_set_size: 0.05 output_dir: ./lisa-out sequence_len: 2048 sample_packing: true pad_to_sequence_len: true adapter: lora_model_dir: lora_r: lora_alpha: lora_dropout: lora_target_linear: lora_fan_in_fan_out: lisa_n_layers: 4 lisa_step_interval: 10 lisa_layers_attribute: model.layers wandb_project: cosmouplisa wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 1 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true warmup_steps: 10 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: ```

# lisa-out This model is a fine-tuned version of [Lambent/cosmo-upscale](https://huggingface.co/Lambent/cosmo-upscale) on the None dataset. It achieves the following results on the evaluation set: - Loss: 1.0353 ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0002 - train_batch_size: 2 - eval_batch_size: 2 - seed: 42 - gradient_accumulation_steps: 4 - total_train_batch_size: 8 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 10 - num_epochs: 1 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:-----:|:----:|:---------------:| | 1.4298 | 0.0 | 1 | 1.4591 | | 1.1229 | 0.25 | 1480 | 1.0594 | | 1.0711 | 0.5 | 2960 | 1.0418 | | 1.0511 | 0.75 | 4440 | 1.0353 | ### Framework versions - Transformers 4.40.0.dev0 - Pytorch 2.1.2+cu118 - Datasets 2.18.0 - Tokenizers 0.15.0