arxiv:2307.14995

Scaling TransNormer to 175 Billion Parameters

Published on Jul 27, 2023

· Submitted by

akhaliq on Jul 27, 2023

#2 Paper of the day

Upvote

Authors:

Dong Li ,

Weigao Sun ,

Weixuan Sun ,

Xuyang Shen ,

Xiaodong Han ,

Fei Yuan ,

Yiran Zhong

Abstract

We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanism, tensor normalization, inference acceleration and stabilization. Specifically, we use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, we propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times. To further enhance the performance of TransNormer, we leverage a gating mechanism to smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over 20%. Furthermore, we have developed a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. Scalability is at the heart of our model's design, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, all while maintaining outstanding performance metrics. Rigorous validation of our model design is achieved through a series of comprehensive experiments on our self-collected corpus, boasting a size exceeding 6TB and containing over 2 trillion tokens. To ensure data quality and relevance, we implement a new self-cleaning strategy to filter our collected data. Our pre-trained models will be released to foster community advancements in efficient LLMs.

View arXiv page View PDF Add to collection

Community

PY007

Jul 28, 2023

Just skim through the paper. Am I missing something or there is actually no evaluation of the 175B model on standard benchmarks (e.g., LM-harness) ?

PlanetMoon

Jul 28, 2023

Just skim through the paper. Am I missing something or there is actually no evaluation of the 175B model on standard benchmarks (e.g., LM-harness) ?

Yes, you're right. The paper just described the efficiency of the models. However the performance did not show in the evalution part.

IanZhong

Jul 29, 2023

Just skim through the paper. Am I missing something or there is actually no evaluation of the 175B model on standard benchmarks (e.g., LM-harness) ?

I think this paper is proposing an algorithm rather than a trained LLM model. Benchmark performance is more related to the training corpus rather than the algorithm itself. However, there is an Apple to Apple accuracy comparison in relatively small models. Training LLMs for ablation is costly.