i wanted to learn more about exposure bias mitigation in language models and came across ReMask. it's a neat idea, and i wanted to give it a go.

  • during training, the model processes input sequences twice - once with the full sequence & once with masked sequence.
  • computes model outputs for both.
  • divergence loss is computed as the average of forward and backward KL divergences.
  • final loss is a weighted sum of the cross entropy losses and the divergence loss.

impl on github

<|user|>
Could Moulin Rouge have been hypothetically used as Spain's Spanish American War triage center?
<|logic|>
The Moulin Rouge cabaret in France had a capacity of 850 people. Spain had 700-800 injured during Spanish American War.
<|answer|>
Downloads last month
142
Safetensors
Model size
134M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for aloobun/ReMask-135m

Quantizations
1 model