The Impact of Positional Encoding on Length Generalization in Transformers
Paper
•
2305.19466
•
Published
•
2
Note NoPE and T5’s Relative PE resemble each other and exhibit a bimodal distribution, reflecting both short-range and long-range attention... Notably, NoPE and T5’s RPE are the top-performing PEs in this setup, which suggest the bimodal distribution to be more optimal.