arxiv:2503.04598

HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

Published on Mar 6

· Submitted by

Taoer on Mar 7

Upvote

Authors:

Zhijian Zhuo ,

Yutao Zeng ,

Xiaoqing Li ,

Jinwen Ma

Abstract

Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, challenges remain in training deep transformer networks, especially regarding the location of layer normalization. While Pre-Norm structures facilitate easier training due to their more prominent identity path, they often yield suboptimal performance compared to Post-Norm. In this paper, we propose HybridNorm, a straightforward yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm approaches. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. This design not only stabilizes training but also enhances performance, particularly in the context of LLMs. Comprehensive experiments in both dense and sparse architectures show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches, achieving state-of-the-art results across various benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. %Code will be made publicly available. Code is available at https://github.com/BryceZhuo/HybridNorm.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

Taoer

Paper author Paper submitter 2 days ago

•

edited 2 days ago

The paper introduces HybridNorm, a novel normalization strategy for transformer networks that combines the best features of Pre-Norm and Post-Norm approaches. By applying QKV normalization in the attention mechanism and Post-Norm in the feed-forward network, HybridNorm provides a more stable training process and improved performance for large language models. The method achieves a 1.4x pre-training convergence speedup when compared to the commonly used pre-norm architecture. Experimental results demonstrate that this approach consistently outperforms traditional normalization techniques across different architectures and benchmarks, offering a promising solution to existing challenges in transformer model training. Code is available at this URL.

pengxiang

2 days ago

I found your paper on HybridNorm quite interesting and the performance improvements are impressive!

I'm curious about the computational efficiency aspects of your approach. Does implementing HybridNorm introduce any significant additional computational overhead during training or inference compared to standard Pre-Norm or Post-Norm approaches?

Taoer

Paper author 2 days ago

•

edited 2 days ago

Thank you for your interest in our work. In our approach, we have primarily focused on adjusting the positioning of normalization layers to achieve improved gradient flow and training stability, without introducing additional computationally expensive operators. Meanwhile, as the normalization operation is element-wise, its contribution to the overall computational Flops of the model is relatively small (in most cases, less than 1%). Consequently, the computational efficiency of the different variants (including the baselines) is nearly consistent. :)

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

WaitHZ

1 day ago

Impressive results!
But I feel a bit puzzled by the statement in the paper that FFN is combined with post-norm. It seems to me more like the attention is using both QKV normalization and post-normalization, while discarding the normalization for FFN. Or maybe I'm overlooking something important?

Taoer

Paper author about 24 hours ago

Yes, your understanding of the approach is correct. The design of the method can be referenced in Figure 2 and Algorithm 1.

In this paper, to facilitate a unified analysis of various normalization variants, we attempt to categorize Pre-Norm and Post-Norm based on their differing treatments of the residual connections (see the figure below, particularly the sections highlighted with dashed boxes). From this perspective, the FFN sublayer can be considered as adopting a connection scheme akin to Post-Norm.

Thank you very much for your attention and feedback. We will incorporate this content into the paper to enhance the explanation of our proposed approach. :)