pk11's picture
Iteration 2, finte-tuned using same dataset but with 512 token sequences and 64 tokens sliding window
95784ae
|
raw
history blame
320 Bytes
metadata
license: apache-2.0
base_model:
  - Qwen/Qwen2.5-14B-Instruct-1M

Fine-tuned/hyperfitted with methodology from https://arxiv.org/abs/2412.04318

using OrthoGrad optimizer https://arxiv.org/abs/2501.04697

Updated 23.02.2025: same dataset, 512 token sequences with 64 tokens sliding window (loss still decreased).