Papers
arxiv:2411.02265

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Published on Nov 4
· Submitted by xxzcc on Nov 5
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large

Community

Paper author Paper submitter

In this paper, we introduce Hunyuan-Large, which is currently the largest opensource Transformer-based mixture of experts model, with a total of 389 billion
parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large’s superior performance
across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated
tasks, where it outperforms LLama3.1-70B and exhibits comparable performance
when compared to the significantly larger LLama3.1-405B model. Key practice
of Hunyuan-Large include large-scale synthetic data that is orders larger than in
previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we
also investigate the scaling laws and learning rate schedule of mixture of experts
models, providing valuable insights and guidances for future model development
and optimization. The code and checkpoints of Hunyuan-Large are released to
facilitate future innovations and applications.

Sign up or log in to comment

Models citing this paper 6

Browse 6 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.02265 in a dataset README.md to link it from this page.

Spaces citing this paper 24

Collections including this paper 3