arxiv:2405.12250

Your Transformer is Secretly Linear

Published on May 19

· Submitted by

akhaliq on May 22

#1 Paper of the day

Upvote

150

Authors:

Anton Razzhigaev ,

Matvey Mikhalchuk ,

Elizaveta Goncharova ,

Nikolai Gerasimenko ,

Ivan Oseledets ,

Denis Dimitrov ,

Andrey Kuznetsov

Abstract

This paper reveals a novel linear characteristic exclusive to transformer decoders, including models such as GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed due to a consistently low output norm of the transformer layer. Our experiments show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance. Moreover, in our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity. This regularization improves performance metrics on benchmarks like Tiny Stories and SuperGLUE and as well successfully decreases the linearity of the models. This study challenges the existing understanding of transformer architectures, suggesting that their operation may be more linear than previously assumed.

View arXiv page View PDF Add to collection

Community

jacksee

May 22

wonder what would this imply in terms for approx Transformer/efficiency

kuznetsoffandrey

Paper author May 22

The proposed regularization technique makes training more efficient due to embedding linearisation control

puffy310

May 22

Absolute chads at SberAI for still releasing after the war started, regardless of one's political stance, I respect them a lot for not just cancelling their research division or not putting anything on arxiv anymore.

kuznetsoffandrey

Paper author May 22

However, the major work is done in AIRI😉 We love science and there are no limits for the job you love. Thank you for kind words

mikelabs

May 23

Very cool! Summary here (feedback welcome!): https://www.aimodels.fyi/papers/arxiv/your-transformer-is-secretly-linear

Title inspired by this one? ;) https://www.aimodels.fyi/papers/arxiv/from-words-to-numbers-your-large-language

mattbarr

May 23

I'm a simple man, I see "secretly linear," I upvote.

kabachuha

May 24

Well, from the newer paper by MIT it seems the features are not as linear as it has been thought. https://huggingface.co/papers/2405.14860

In this case, if I get both papers right, linearization can hurt the model by eliminating complex associations, such as days of week, months, years and many other implicit nonlinear features we cannot even know that exist in the model, but directly tied to the model's understanding of the cyclic/curved/jagged parts of the world.

literate-goggles

May 26

•

edited May 26

These are different papers: this one studies the linearity between two consecutive transformer block transformations, but the paper by MIT studied embedding linearity within one transformer layer.

puffy310

May 24

MIT VS AIRI LMAO

M1cler

May 27

Is that so? Or should I say: We will see about that!

melisa

May 27

Working on reproducing this and similar pruning criteria here:
https://github.com/melisa-writer/short-transformers
Linear approximation of the last token is there, along with angular distances, bi score etc.

The goal of the library: choose your distance (layer importance metric), get cropped model. :rocket:

jasam-sheja

May 30

The implications of this work are significant. There is so much to explore.
One thing that I can't quite grasp is how Cosine Similarity regularization manages to control linearity.

kuznetsoffandrey

Paper author May 30

Actually this is a challenging outcome, because the hypothesis is when adding cosine similarity to make embeddings more similar (CS -> 1), the training process leads to increasing the non linear part in the residual stream. We plan to investigate this effect more

blanchon

Jun 8

Your Transformer Might Be Linear! | Deep Dive

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix

chadrick

Jun 15

In the paper

Furthermore, our feature triggering regime hypothesis
proposes that rare specific features on a
few tokens with high non-linearity significantly influence
model behavior — in the Figure 9 one can
see that some layers of OPT-1.3B have the long
tailed distribution of L2 errors, which means that
there are still sparse spikes of non-linearity.

how is this L2 error in Figure 9 here calculated?