Abstract
Attention-based transformers have become the standard architecture in many deep learning fields, primarily due to their ability to model long-range dependencies and handle variable-length input sequences. However, the attention mechanism with its quadratic complexity is a significant bottleneck in the transformer architecture. This algorithm is only uni-directional in the decoder and converges to a static pattern in over-parametrized decoder-only models. I address this issue by developing a generative function as attention or activation replacement. It still has the auto-regressive character by comparing each token with the previous one. In my test setting with nanoGPT this yields a smaller loss while having a smaller model. The loss further drops by incorporating an average context vector. This concept of attention replacement is distributed under the GNU AGPL v3 license at https://gitlab.com/Bachstelze/causal_generation.
Community
A causal activation function is presented which could replace attention in decoders.
Thanks for sharing. FYI, I don't think you can license a concept, I'm pretty sure you need a patent for that. Copyright is for a specific expression of a concept, not the concept itself. Anyone can implement this with their own code and licence it as they see fit.
I am considering applying for a patent. However, in my legal understanding, the reimplementation doesn't change the copyright. Only a new height of creation changes the license, then it isn't a reimplementation anymore. The mission statement of free and human development is going to be enforced by law.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper