--- license: apache-2.0 base_model: - Qwen/Qwen2.5-14B-Instruct-1M --- Fine-tuned/hyperfitted with methodology from https://arxiv.org/abs/2412.04318 using OrthoGrad optimizer https://arxiv.org/abs/2501.04697 Updated 23.02.2025: same dataset, 512 token sequences with 64 tokens sliding window (loss still decreased). Significant hellaswag drop (~22%)