arxiv:2406.10906

Breaking the Attention Bottleneck

Published on Jun 16

· Submitted by

Bachstelze on Jun 18

Upvote

Authors:

Kalle Hilsenbek

Abstract

Attention-based transformers have become the standard architecture in many deep learning fields, primarily due to their ability to model long-range dependencies and handle variable-length input sequences. However, the attention mechanism with its quadratic complexity is a significant bottleneck in the transformer architecture. This algorithm is only uni-directional in the decoder and converges to a static pattern in over-parametrized decoder-only models. I address this issue by developing a generative function as attention or activation replacement. It still has the auto-regressive character by comparing each token with the previous one. In my test setting with nanoGPT this yields a smaller loss while having a smaller model. The loss further drops by incorporating an average context vector. This concept of attention replacement is distributed under the GNU AGPL v3 license at https://gitlab.com/Bachstelze/causal_generation.

View arXiv page View PDF Add to collection

Community

Bachstelze

Paper author Paper submitter Jun 18

A causal activation function is presented which could replace attention in decoders.

MichaelBarryUK

Jun 18

Thanks for sharing. FYI, I don't think you can license a concept, I'm pretty sure you need a patent for that. Copyright is for a specific expression of a concept, not the concept itself. Anyone can implement this with their own code and licence it as they see fit.

Bachstelze

Paper author Jun 18

I am considering applying for a patent. However, in my legal understanding, the reimplementation doesn't change the copyright. Only a new height of creation changes the license, then it isn't a reimplementation anymore. The mission statement of free and human development is going to be enforced by law.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.10906 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.10906 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.10906 in a Space README.md to link it from this page.