MaskLLM: Learnable Semi-structured Sparsity for Large Language Models

This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or ``N:M'') Sparsity in LLMs, aimed at reducing computational overhead during inference. The proposed method is scalable and stands to benefit from larger training datasets.

Requirements

We provide pre-computed masks for Huggingface Models such as Llama-2 7B and Llama-3 8B with the minimum requirements. It will not involve docker, Megatron or data preprocessing.

pip install transformers accelerate datasets SentencePiece 

Pre-computed Masks

The following masks were trained and provided by @VainF. We use huggingface_hub to automatically download those masks and apply them to offcical LLMs for evaluation. Those mask files were compressed using numpy.savez_compressed. More results for baselines (SparseGPT, Wanda) can be found in the appendix.

Model Pattern Training Data Training/Eval SeqLen PPL (Dense) PPL (SparseGPT) PPL (MaskLLM) Link
LLaMA-2 7B 2:4 C4 (2B Tokens) 4096 5.12 10.42 6.78 HuggingFace
LLaMA-3 8B 2:4 C4 (2B Tokens) 4096 5.75 17.64 8.49 HuggingFace
LLaMA-3.1 8B 2:4 C4 (2B Tokens) 4096 - - - Coming Soon

How to use it

Please see NVlabs/MaskLLM.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Model tree for Vinnnf/LLaMA-2-7B-MaskLLM-C4

Finetuned
(32)
this model

Collection including Vinnnf/LLaMA-2-7B-MaskLLM-C4