HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
Abstract
Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, challenges remain in training deep transformer networks, especially regarding the location of layer normalization. While Pre-Norm structures facilitate easier training due to their more prominent identity path, they often yield suboptimal performance compared to Post-Norm. In this paper, we propose HybridNorm, a straightforward yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm approaches. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. This design not only stabilizes training but also enhances performance, particularly in the context of LLMs. Comprehensive experiments in both dense and sparse architectures show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches, achieving state-of-the-art results across various benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. %Code will be made publicly available. Code is available at https://github.com/BryceZhuo/HybridNorm.
Community
The paper introduces HybridNorm, a novel normalization strategy for transformer networks that combines the best features of Pre-Norm and Post-Norm approaches. By applying QKV normalization in the attention mechanism and Post-Norm in the feed-forward network, HybridNorm provides a more stable training process and improved performance for large language models. The method achieves a 1.4x pre-training convergence speedup when compared to the commonly used pre-norm architecture. Experimental results demonstrate that this approach consistently outperforms traditional normalization techniques across different architectures and benchmarks, offering a promising solution to existing challenges in transformer model training. Code is available at this URL.
I found your paper on HybridNorm quite interesting and the performance improvements are impressive!
I'm curious about the computational efficiency aspects of your approach. Does implementing HybridNorm introduce any significant additional computational overhead during training or inference compared to standard Pre-Norm or Post-Norm approaches?
Thank you for your interest in our work. In our approach, we have primarily focused on adjusting the positioning of normalization layers to achieve improved gradient flow and training stability, without introducing additional computationally expensive operators. Meanwhile, as the normalization operation is element-wise, its contribution to the overall computational Flops of the model is relatively small (in most cases, less than 1%). Consequently, the computational efficiency of the different variants (including the baselines) is nearly consistent. :)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models (2025)
- MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections (2025)
- The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training (2025)
- AdaGC: Improving Training Stability for Large Language Model Pretraining (2025)
- The Curse of Depth in Large Language Models (2025)
- Efficient Language Modeling for Low-Resource Settings with Hybrid RNN-Transformer Architectures (2025)
- Tensor Product Attention Is All You Need (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Impressive results!
But I feel a bit puzzled by the statement in the paper that FFN is combined with post-norm. It seems to me more like the attention is using both QKV normalization and post-normalization, while discarding the normalization for FFN. Or maybe I'm overlooking something important?
Yes, your understanding of the approach is correct. The design of the method can be referenced in Figure 2 and Algorithm 1.
In this paper, to facilitate a unified analysis of various normalization variants, we attempt to categorize Pre-Norm and Post-Norm based on their differing treatments of the residual connections (see the figure below, particularly the sections highlighted with dashed boxes). From this perspective, the FFN sublayer can be considered as adopting a connection scheme akin to Post-Norm.
Thank you very much for your attention and feedback. We will incorporate this content into the paper to enhance the explanation of our proposed approach. :)
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper