PEFT: Parameter-Efficient Fine-Tuning Methods for LLMs

Community Article Published January 24, 2025
In the constrained and costly landscape of LLMs, where large organizations employ extraordinarily complex computational capabilities to create general-purpose language models, PEFT emerges as a valuable alternative—not only to reduce costs but also to enable specification and control.

image/jpeg

Source: Image generated by the model FLUX.1 [dev]
Contents

Introduction

This article explores the universe of Parameter-Efficient Fine-Tuning (PEFT) techniques—a set of approaches that enable the adaptation of large language models (LLMs) more efficiently in terms of memory and computational performance. Drawing from the paper “[2303.15647] Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning” and the PEFT library, integrated with Hugging Face's Transformers, this study delves into the key concepts and methodologies that facilitate fine-tuning language models without the need to train all of their billions of parameters.

This article provides an introduction to the main PEFT techniques, describing how they work and highlighting their characteristics and potential applications for fine-tuning language models, with a focus on maximizing memory efficiency and training time performance.

And also a notebook performing fine-tuning of a model for summarizing customer service conversations, using Full Fine-Tune, LoRA, QLoRA and IA3

Open in Colab

Categories of PEFT Methods

Parameter-Efficient Fine-Tuning (PEFT) methods can be classified according to two main aspects: their conceptual structure (e.g., introducing new parameters or adjusting existing ones) and their primary objective (minimizing memory footprint, improving storage efficiency, or reducing computational costs).

PEFT Methods

These methods are divided into three broad categories:

Additive Methods

Additive methods introduce new parameters to the base model, often through small adapter layers or by adjusting a part of the input embeddings (known as soft prompts). These methods are widely used and include:

  • Adapters: Small dense (fully connected) networks inserted after specific transformer sublayers, allowing adaptation to new tasks without the need to train all the model's parameters.
  • Soft Prompts: Fine-tuning applied directly to the model's input embeddings, enabling task-specific adaptation without modifying the model's internal parameters.

These methods are generally memory-efficient as they reduce the size of gradients and optimizer states.

Selective Methods

Selective methods adjust only a fraction of the existing model parameters. This can be done in several ways, such as:

  • Top Layer Fine-Tuning: Focusing on fine-tuning only the upper layers of the network while leaving the lower layers untouched.
  • Specific Parameter Fine-Tuning: Selectively training certain types of parameters, such as biases, while freezing other parameters.
  • Sparse Updates: Selecting a specific subset of parameters for training. While promising, this approach can be computationally expensive due to the need to identify the most relevant parameters.

Despite reducing the number of trained parameters, selective methods may incur high computational costs, especially in sparse configurations.

Reparameterization-Based Methods

Reparameterization-based methods reduce the number of trainable parameters by utilizing low-rank representations, leveraging the redundancy present in neural networks. Key methods include:

  • LoRa (Low-Rank Adaptation): Employs low-rank matrix decomposition to represent weight updates, providing an efficient way to fine-tune models.
  • Intrinsic SAID: Utilizes the Fastfood transform, a technique for efficiently representing low-rank updates.

These methods significantly reduce the number of parameters to be trained, making them ideal for scenarios where storage efficiency and training time are critical.

Additional Points

  • Additive Methods: While they introduce new parameters, they can be more memory-efficient overall by reducing the amount of gradients and optimizer states that need to be stored.
  • Selective Methods: Although promising for reducing the number of trained parameters, they can be computationally intensive, particularly in cases of sparse updates.
  • Hybrid Methods: Combinations of ideas from different categories are often explored to maximize performance, leveraging the strengths of each approach.
Method Type Storage Memory Backprop Inference overhead
Adapters (Houlsby et al., 2019) A yes yes no Extra FFN
AdaMix (Wang et al., 2022) A yes yes no Extra FFN
SparseAdapter (He et al., 2022b) AS yes yes no Extra FFN
Cross-Attn tuning (Gheini et al., 2021) S yes yes no No overhead
BitFit (Ben-Zaken et al., 2021) S yes yes no No overhead
DiffPruning (Guo et al., 2020) S yes no no No overhead
Fish-Mask (Sung et al., 2021) S yes maybe no No overhead
LT-SFT (Ansell et al., 2022) S yes maybe no No overhead
Prompt Tuning (Lester et al., 2021) A yes yes no Extra input
Prefix-Tuning (Li and Liang, 2021) A yes yes no Extra input
Spot (Vu et al., 2021) A yes yes no Extra input
IPT (Qin et al., 2021) A yes yes no Extra FFN and input
MAM Adapter (He et al., 2022a) A yes yes no Extra FFN and input
Parallel Adapter (He et al., 2022a) A yes yes no Extra FFN
Intrinsic SAID (Aghajanyan et al., 2020) R no no no No overhead
LoRa (Hu et al., 2021) R yes yes no No overhead
UniPELT (Mao et al., 2021) AR yes yes no Extra FFN and input
Compacter (Karimi Mahabadi et al., 2021) AR yes yes no Extra FFN
PHM Adapter (Karimi Mahabadi et al., 2021) AR yes yes no Extra FFN
KronA (Edalati et al., 2022) R yes yes no No overhead
KronA_Bres (Edalati et al., 2022) AR yes yes no Extra linear layer
(IA)³ (Liu et al., 2022) A yes yes no Extra gating
Attention Fusion (Cao et al., 2022) A yes yes yes Extra decoder
LeTS (Fu et al., 2021) A yes yes yes Extra FFN
Ladder Side-Tuning (Sung et al., 2022) A yes yes yes Extra decoder
FAR (Vucetic et al., 2022) S yes maybe no No overhead
S4-model (Chen et al., 2023) ARS yes yes no Extra FFN and input

The table above presents a detailed comparison of PEFT methods in terms of storage efficiency, memory efficiency, and computational efficiency. It examines the reduction of backpropagation costs during training and the inference overhead associated with each method. The different techniques are classified as follows:

  • A (Additive): Methods that introduce new parameters into the model.
  • S (Selective): Methods that fine-tune only a subset of existing parameters.
  • R (Reparameterization): Methods that utilize low-rank representations to reduce the number of trainable parameters.
Method % Trainable parameters % Changed parameters Evaluated on
<1B <20B >20B
Adapters (Houlsby et al., 2019) 0.1 - 6 0.1 - 6 yes yes yes
AdaMix (Wang et al., 2022) 0.1 - 0.2 0.1 - 0.2 yes no no
SparseAdapter (He et al., 2022b) 2.0 2.0 yes no no
BitFit (Ben-Zaken et al., 2021) 0.05 - 0.1 0.05 - 0.1 yes yes yes
DiffPruning (Guo et al., 2020) 200 0.5 yes no no
Fish-Mask (Sung et al., 2021) 0.01 - 0.5 0.01 - 0.5 yes yes no
Prompt Tuning (Lester et al., 2021) 0.1 0.1 yes yes yes
Prefix-Tuning (Li and Liang, 2021) 0.1 - 4.0 0.1 - 4.0 yes yes yes
IPT (Qin et al., 2021) 56.0 56.0 yes no no
MAM Adapter (He et al., 2022a) 0.5 0.5 yes no no
Parallel Adapter (He et al., 2022a) 0.5 0.5 yes no no
Intrinsic SAID (Aghajanyan et al., 2020) 0.001 - 0.1 ~0.1 or 100 yes yes no
LoRa (Hu et al., 2021) 0.01 - 0.5 ~0.5 or ~30 yes yes yes
UniPELT (Mao et al., 2021) 1.0 1.0 yes no no
Compacter (Karimi Mahabadi et al., 2021) 0.05-0.07 ~0.07 or ~0.1 yes yes no
PHM Adapter (Karimi Mahabadi et al., 2021) 0.2 ~0.2 or ~1.0 yes no no
KronA (Edalati et al., 2022) 0.07 ~0.07 or ~30.0 yes no no
KronA_Bres (Edalati et al., 2022) 0.07 ~0.07 or ~1.0 yes no no
(IA)³ (Liu et al., 2022) 0.02 0.02 no yes no
Ladder Side-Tuning(Sung et al., 2022) 7.5 7.5 yes yes no
FAR (Vucetic et al., 2022) 6.6-26.4 6.6-26.4 yes no no
S4-model (Chen et al., 2023) 0.5 more than 0.5 yes yes no

Table 2 provides an analysis of the model scales on which PEFT methods have been evaluated, highlighting the typical number of trainable parameters used by each approach. The "trainable parameters" count specifically refers to the parameters adjusted by a gradient optimization algorithm, distinguishing them from "modified parameters," which indicate changes relative to the original model. For reparameterization-based methods, the table reports the parameters both before and after reparameterization.


Additive Methods

Adapters

Adapter-based methods add extra trainable parameters after the attention and fully connected layers of a frozen pre-trained model to reduce memory usage and accelerate training. The specific implementation of the adapter may vary; it can be a simple extra layer or involve expressing weight updates ∆W as a low-rank decomposition of the weight matrix. In either case, adapters are typically small but demonstrate performance comparable to fully fine-tuned models, enabling the training of larger models with fewer resources.

The concept of adapters was initially developed for multi-domain image classification (Rebuffi et al., 2017, 2018) and involved adding domain-specific layers between neural network modules. Houlsby et al. (2019) adapted this idea for NLP. They proposed adding fully connected networks after the attention and FFN layers in the Transformer architecture.


Soft Prompts

Prompt methods have emerged as an efficient way to adapt pre-trained language models to specific tasks without the need for full fine-tuning (Brown et al., 2020). The concept involves providing instructions or examples to the model that guide its behavior for the desired task.

There are two main categories of prompt methods:

  • Hard Prompts: Consist of manually created natural text that instructs the model about the task. For example: "Translate the following text to French:" or "Classify the sentiment as positive or negative:". Although intuitive, they require significant expertise to create effective prompts.

  • Soft Prompts: Utilize continuous and trainable vectors that are concatenated to input embeddings. Unlike hard prompts, these "virtual tokens" are automatically optimized for the task but are not human-interpretable as they do not correspond to real words. (Li and Liang, 2021; Lester et al., 2021; Liu et al., 2021)

Prompt Tuning

Prompt Tuning

Concept:

Prompt Tuning proposes adding a trainable tensor, known as a "soft prompt", to the model's input embeddings. This tensor is directly optimized through gradient descent, allowing the model to adjust its behavior without altering the underlying model parameters.

Implementation:

  • Prompt tokens are initialized randomly or from existing word embeddings
  • During training, only the prompt tokens are updated, keeping the base model frozen
  • The prompt size (number of tokens) is an adjustable hyperparameter
  • Prompts can be reused for different instances of the same task

Efficiency:

  • Research shows that prompt tuning is more parameter-efficient as model size increases. For example, T5-11B achieves similar performance on the SuperGLUE benchmark with short (5 tokens) and long (150 tokens) soft prompts.
  • Model Scale: Prompt tuning becomes comparable to full fine-tuning only in models with over 10 billion parameters, demonstrating its efficiency primarily in large models.
  • Inference Overhead: While soft prompts are highly parameter-efficient, they can lead to increased computation due to additional tokens, particularly in transformer models with quadratic complexity.

Applications and Limitations:

  • Ideal for classification and text generation tasks
  • Allows maintaining a single copy of the base model for multiple tasks
  • Performance may be inferior to traditional fine-tuning in smaller models
  • Interpretability of prompt tokens is limited by their continuous nature

Prefix Tuning

Prefix Tuning

Concept:

Prefix Tuning is a fine-tuning method that introduces trainable parameters ('prefixes') across all model layers, preserving the original parameters unchanged. Unlike other approaches that modify only input embeddings, this method optimizes prefixes at multiple levels of the architecture, allowing for more refined and efficient adjustment.

In the image above, it is demonstrated that only the prefixes (red prefix blocks) are optimized, so only the prefix needs to be stored for each task, making the method efficient and modular.

Implementation:

  • A sequence of task-specific vectors (prefixes) are inserted into the hidden states at each model layer.
  • To handle training instability, prefixes are generated through a feed-forward network (FFN), which is optimized during training. After training, only the prefixes are retained and the FFN is discarded.

Performance:

  • Shows performance close to full fine-tuning, requiring significantly fewer parameters — only about 0.1% of the total model parameters.
  • It performs particularly well on natural language generation (NLG) tasks and is especially effective in few-data settings.

Comparison with Prompt Tuning:

  • Both methods add additional parameters to the model, but prefix tuning inserts these parameters in each layer, while prompt tuning modifies only the input embeddings.
  • Prefix tuning, with its integration across layers, achieves performance equivalent to full fine-tuning, but with much greater efficiency, primarily in large models.

P-tuning

P-tuning

Concept:

P-Tuning is a method developed to optimize the performance of language models in natural language understanding (NLU) tasks, aiming to overcome the limitations of traditional discrete prompts. Based on the soft prompt concept, the method uses a trainable embedding tensor optimized through a specialized prompt encoder — typically a bidirectional LSTM network. This approach allows for more refined model adaptation to specific tasks while maintaining computational efficiency.

Implementation:

  • The method begins by inserting anchor tokens in the input sequence, which serve as reference points to guide the model in identifying important input components
  • Prompt tokens can be flexibly positioned at any position in the input sequence, not limited to the beginning
  • Model modification occurs only in the input layer, unlike methods like prefix tuning that affect multiple layers, resulting in a more efficient implementation

Efficiency/Performance:

  • P-Tuning is more efficient than manually creating prompts, allowing GPT-like models to achieve or surpass BERT-like model performance in NLU tasks.

  • In benchmarks like LAMA and SuperGlue, P-Tuning enables GPT models to recover a significant amount of world knowledge and achieve performance comparable or better than similar-sized BERT models.

  • It also improves BERT model performance, particularly in supervised and few-shot settings, reducing dependence on extensive prompt engineering.

    ptuning-results-1

Application:

  • Primarily applied to NLU tasks, P-Tuning enables models like GPT to effectively compete in areas traditionally dominated by BERT models.
  • It is particularly beneficial for tasks requiring knowledge probing and few-shot learning, where it outperforms state-of-the-art approaches.

Comparison and Additional Points:

  • Compared to other tuning methods like prefix tuning, P-Tuning is more flexible in prompt positioning and does not require modification of all model layers. Its use of a prompt encoder, particularly LSTM, provides more robust prompt optimization, leading to superior performance in specific benchmarks.

(IA)³

(IA)³

Concept:

(IA)³ (Infused Adapter by Inhibiting and Amplifying Inner Activations) is a parameter-efficient method designed for fine-tuning transformer models. Unlike traditional fine-tuning, which adjusts a large portion of the model's parameters, (IA)³ modifies only specific learned vectors associated with key, value, and feedforward layers within transformer blocks.
The method introduces three vectors, lv, lk, and lf, which rescale activations in attention and feedforward layers. This approach keeps most model weights frozen, drastically reducing the number of trainable parameters.

Implementation:

  • (IA)³ injects these learned vectors into attention (key and value layers) and the second feedforward layer within each transformer block. The vectors are the only trainable parameters during fine-tuning, making the process parameter-efficient.
  • The method maintains the original model architecture and incurs only minimal computational overhead, specifically from the rescaling operations performed by lf.

Efficiency/Performance:

  • (IA)³ is highly parameter-efficient, updating only about 0.01-0.02% of the total model parameters. For example, in the T0-3B model, it updates just 0.02% of parameters, significantly less than methods like LoRA, which requires 16 times more trainable parameters.
  • Despite its minimal parameter updates, it achieves performance comparable to fully fine-tuned models, often outperforming other parameter-efficient methods like Compacter.

Application:

  • Applicable to any subset of weight matrices in a neural network, making it versatile for various downstream tasks. By adjusting only a small part of the model, it allows the creation of multiple lightweight and portable models adapted to specific tasks.
  • It is particularly beneficial in scenarios with limited computational resources or where rapid adaptation to new tasks is needed without the overhead of full model retraining.

Comparison:


Reparametrization-based methods

Parameter-efficient fine-tuning methods based on reparametrization leverage low-rank representations to minimize the number of trainable parameters. The notion that neural networks have low-dimensional representations has been extensively explored in empirical and theoretical deep learning analyses.

Intrinsic SAID

Introduced by Aghajanyan et al. in 2020, Intrinsic SAID is a fine-tuning method based on the discovery that large language models can be effectively adapted using far fewer parameters than their total size suggests. This method explores the concept of "intrinsic dimensionality" - the idea that there exists a lower-dimensional subspace where fine-tuning can be performed as effectively as in the full parameter space.

Concept:

  • Intrinsic dimensionality represents the minimum number of parameters necessary to effectively adjust a model for a specific task.
  • SAID uses the Fastfood transformation to project updates from a low-dimensional space to the full model space, enabling efficient fine-tuning.
  • A key finding is that larger models often have a lower intrinsic dimensionality relative to their total size, making the method especially relevant for LLMs.

Implementation:

  • The process occurs in three main steps:

    1. Subspace Definition: Identification of intrinsic dimensionality (d) for the specific task through empirical analysis of the model and data
    2. Reparametrization: Use of Fastfood transformation (F) to map parameters from the low-dimensional space (d) to the original space (D). This transformation is an efficient alternative to traditional dense matrices
    3. Update: θ = θ₀ + F(θd), where θ₀ are the original pre-trained parameters and θd are parameters optimized in the low-dimensional space
  • The Fastfood transformation offers computational efficiency:

    • Temporal complexity: O(D log d)
    • Spatial complexity: O(D)
    • Does not require storage of large dense matrices

Experimental Results:

  • Tests on the MRPC (Microsoft Research Paraphrase Corpus) dataset showed that:
    • Larger models require proportionally lower intrinsic dimensionality
    • With just 200 parameters projected into a space of millions or billions of dimensions, it's possible to achieve 90% of full fine-tuning performance
    • This discovery suggests that efficient fine-tuning is not only possible but also more effective in larger models

INTRINSIC DIMENSIONALITY

In the original paper, they used the MRPC dataset and calculated the intrinsic dimension for each pre-trained model using the SAID method.

Limitations:

  • Although it reduces the number of trainable parameters, it still requires updating all model weights
  • Calculating intrinsic dimensionality can be computationally expensive
  • The efficiency of the Fastfood transformation may be compromised on specific deep learning hardware
  • Practical implementation can be more complex than methods like LoRA

Intrinsic SAID, despite its practical limitations for very large models, established crucial theoretical foundations that directly influenced the development of more practical methods like LoRA. Its main contribution was mathematically demonstrating that efficient fine-tuning in low-dimensional subspaces is not only possible but an intrinsic property of large-scale models. This discovery paved the way for a new generation of efficient fine-tuning techniques, significantly influencing the direction of research in language model adaptation.


Low-Rank Adaptation (LoRA)

Introduced by Hu et al. in 2021, LoRA drastically reduces computational costs by decomposing weight updates into lower-rank matrices, minimizing the number of trainable parameters and memory consumption.

LoRA

Concept:
Deep learning models, like LLMs (Large Language Models), depend on weight matrices that store parameters learned during pre-training. In traditional fine-tuning, these weight matrices (W) are directly updated. LoRA, on the other hand, represents these updates (ΔW) as the product of two low-rank matrices (Wa and Wb): ΔW = WA x Wb

This decomposition drastically reduces the number of trainable parameters while keeping the original weight matrix frozen. This approach is possible due to the concept of intrinsic dimensionality, which suggests that large models have room for efficient learning in smaller dimensions. In 2020, a Facebook group published a paper (Intrinsic Dimensionality Paper) demonstrating this.

Implementation:
LoRA implementation follows these steps:

  • Decomposition: The weight update matrix ΔWm×n is decomposed into two smaller matrices: (Am×r) and (Br×n), where (r) (rank) is an adjustable hyperparameter.

    There is a theory in linear algebra known as the Rank Factorization Theorem, which states:

    A matrix of size (m,n) can be written as the product of two matrices of sizes (m, r) and (r,n), respectively, where r is the matrix's rank, provided that r >= 1.

    This can be visualized with this example:

    matrix-r

    The (3,3) matrix of rank 1, like B, can be decomposed into two matrices of sizes (3,1) and (1,3)

    Considering large matrices, like LLM layers, this reduction has a significant effect. Considering an LLM with 256 million parameters with a single matrix

    W shape = (16000, 16000)
    

    Assuming the matrix rank is 300, and W is decomposed into two rank-300 matrices, P and Q. The shapes of P and Q will be:

    P shape = (16000, 300)  
    Q shape = (300, 16000)
    

    The total number of parameters with decomposition will be:

    16000 * 300 + 300 * 16000 = 9600000 = 9.6 million
    
  • Training: Only matrices (A) and (B) are trained, while (W) remains frozen.

  • Merging: After fine-tuning, (ΔW) can be merged back into matrix (W), maintaining the model's performance.

In practice, LoRA is typically applied to Transformer attention blocks, such as the Wk and Wv projection matrices in multi-head attention modules.

Efficiency/Performance:

  • Parameter Reduction: A model like GPT-3, with 175 billion parameters, would need to adjust only about 37.7 million parameters using LoRA, representing a reduction of almost 5000 times in computational costs.
  • Versatility: LoRA can be selectively applied to parts of the model, such as specific layers, to further optimize performance.
  • Performance: Studies show that models fine-tuned with LoRA present results comparable to full fine-tuning, but with lower computational and memory costs.

Applications:

  • Language Models: LoRA was initially designed for LLMs, enabling efficient adjustment of models like GPT, PaLM, and LLaMA.
  • Diffusion Models: Due to its efficiency, LoRA has become a popular choice for image generation models like DALL-E and Stable Diffusion.
  • Variant Creation: The method facilitates the creation of multiple lightweight variants of a base model, adapted to different tasks.

Comparison and Additional Benefits:

  • Overcoming Other PEFT Methods: LoRA frequently outperforms techniques like BitFit and Adapters, especially in very large models.
  • Combinability: The method is orthogonal to other efficient adjustments, like quantization, allowing combinations such as QLoRA.
  • Latency Elimination: The possibility of merging adjusted weights into the base model eliminates additional inference latencies, making LoRA ideal for real-time applications.

With these advantages, LoRA represents a milestone in efficient training of large-scale models, impressively balancing cost and performance.

Quantized Low-Rank Adaptation (QLoRA)

Introduced by Dettmers et al., 2023, QLoRA combines the efficiency of quantization with the low-rank adaptation approach of LoRA, further optimizing fine-tuning for large-scale language models. This technique allows models with up to 65 billion parameters to be fine-tuned on limited GPUs (such as a single 48GB GPU), while preserving the performance of traditional 16-bit fine-tuning methods.

qlora

Concept:

QLoRA employs quantization to reduce the numerical precision of model weights, minimizing memory usage while maintaining computational efficiency. Simultaneously, LoRA is applied to perform weight updates with low-rank matrices. This powerful combination significantly reduces computational and memory costs without compromising performance.

Key innovations include:

  • 4-bit Normal Float (NF4): A new data type specifically designed to efficiently represent weights that follow a normal distribution (common in LLMs) using only 4 bits per element. NF4 has demonstrated superiority over FP4 and Int4, significantly improving post-quantization accuracy, as evidenced by lower average perplexity (27.41 vs. 31.07) in tests with models like OPT, BLOOM, LLaMA, and Pythia.
  • Double Quantization: Implemented via Hugging Face’s bitsandbytes, this reduces average memory usage by also quantizing the quantization constants.
  • Paged Optimizers: Manage memory spikes during training.

Implementation:

  • Quantization: Model weights are quantized to 4 bits using the NF4 technique, reducing memory requirements without losing relevant information.
  • Fine-tuning: Low-rank matrices are trained (similar to LoRA), while the original weights remain quantized and frozen.
  • Preserved Performance: The approach maintains model precision and effectiveness, even with significant resource reductions.

Experimental Results:

Experiments with QLoRA demonstrated that it replicates the performance of traditional 16-bit fine-tuning methods, even using 4-bit quantization. This was shown in academic benchmarks such as GLUE, Super-NaturalInstructions, and MMLU.

Table 2: Pile Common Crawl mean perplexity for different data types for 125M to 13B OPT, BLOOM, LLaMA, and Pythia models.
Data type Mean PPL
Int4 34.34
Float4 (E2M1) 31.07
Float4 (E3M0) 29.48
NFloat4 + DQ 27.41

As shown in Table 2, the NF4 format with double quantization (NFloat4 + DQ) achieves the lowest average perplexity (27.41) among all tested 4-bit formats, highlighting its efficacy in preserving information. Table 3 shows that QLoRA maintains performance close to or equal to 16-bit training (BF16) across different model scales and tasks.

Efficiency:

  • Memory Reduction: 4-bit quantization reduces memory usage by up to 75% compared to FP16 weights.
  • Computational Cost: Combining LoRA with quantization enables fine-tuning of extremely large models using more accessible hardware.
  • Optimized Performance: Even with limited resources, QLoRA achieves results comparable to traditional methods, demonstrating practical applicability.

By combining quantization and low-rank adaptation, QLoRA democratizes the fine-tuning of LLMs, enabling researchers and developers with limited computational resources to work with large-scale models without compromising quality.


Initialization

The initialization of LoRA weights in the PEFT library uses Kaiming-uniform for weights A and zeros for B by default, resulting in an identity transformation, consistent with the reference implementation.

PiSSA

PiSSA

PiSSA initializes the LoRA adapter using the principal singular values and vectors of the model's original weight matrix (W) to initialize adapters. This direct modification enables PiSSA to converge faster than standard initialization and achieve superior performance by focusing on the model's most relevant components during fine-tuning while keeping residual components frozen. Moreover, PiSSA reduces quantization error compared to QLoRA, leading to additional improvements.

Comparing PiSSA and LoRA on NLU tasks.
Method Parameters MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B
RoBERTa-large (355M)
Full FT 355M 90.2 96.4 90.9 68.0 94.7 92.2 86.6 91.5
LoRA 1.84M 90.6 96.2 90.9 68.2 94.9 91.6 87.4 92.6
PiSSA 1.84M 90.7 96.7 91.9 69.0 95.1 91.6 91.0 92.9
DeBERTa-v3-base (184M)
Full FT 184M 89.90 95.63 89.46 69.19 94.03 92.40 83.75 91.60
LoRA 1.33M 90.65 94.95 89.95 69.82 93.87 91.99 85.20 91.60
PiSSA 1.33M 90.43 95.87 91.67 72.64 94.29 92.26 87.00 91.88

pissa analysss

The images above demonstrate PiSSA's advantages in improving and accelerating convergence and reducing quantization error, as evidenced by experiments in the published paper.

Experiments show that PiSSA consistently outperforms LoRA across various models and tasks, including natural language generation and comprehension benchmarks. For example, in the GSM8K benchmark, the Mistral-7B model fine-tuned with PiSSA achieved 72.86% accuracy, surpassing the 67.7% achieved with LoRA, a significant improvement of 5.16%. Furthermore, PiSSA excels in reducing quantization error compared to QLoRA. The quantized version, QPiSSA, achieved 86.05% accuracy in the same benchmark, outperforming QLoRA's 81.73%.

Thanks to compatibility with quantization techniques and the use of fast SVD for initialization, PiSSA offers a memory-efficient and performance-enhancing solution without sacrificing training speed. Additional evaluations on the GLUE benchmark using RoBERTa-large and DeBERTa-v3-base confirm that PiSSA outperforms LoRA in 14 out of 16 tested tasks, demonstrating superior fine-tuning capability and reduced training loss.

OLoRA

OLoRA

OLoRA (Orthonormal Low-Rank Adaptation) is an efficient parameter adaptation technique that uses QR decomposition to initialize LoRA adapters. Instead of directly applying adaptation to the model weights, OLoRA introduces an orthonormal transformation to the pre-trained weight matrix before any adjustment. It decomposes the weight matrix 𝑊 into an orthogonal matrix 𝑄 and an upper triangular matrix 𝑅. This approach provides greater training stability, accelerates convergence, and leads to superior performance.

OLoRA is applied independently to each model layer, using the adapted weight matrices during forward propagation, while gradients are calculated only concerning the adaptation matrices during backpropagation. This preserves the original model's knowledge, enabling efficient adjustments with low computational cost.

Although QR decomposition has an initial computational cost of 𝑂(𝑚𝑛𝑟), it is performed only once per layer during initialization, making the overhead negligible compared to the total cost of training large-scale models.

olora-rank-eval olora-loss

These figures demonstrate the loss behavior during fine-tuning for models such as Tiny-Llama-1.1B, with different ranks, Gemma-2B, and OPT-1.3B, where OLoRA converges more rapidly. These experiments are detailed in the original paper.

Rank-stabilized LoRA (rsLoRA)

The LoRA architecture scales each adapter during each forward pass by a fixed scalar set during initialization, dependent on the rank (r). In the original implementation, this scalar is defined as ( \text{lora_alpha} / r ), but rank-stabilized LoRA (rsLoRA) uses ( \text{lora_alpha} / \sqrt{r} ), stabilizing the adapters and enhancing performance potential when using higher ranks.

rsLoRA-perplexity

rsLoRA-gradient-norm

In experiments, rsLoRA demonstrated superior performance to LoRA in high-rank configurations. While training the LLaMA 2 model on 20,000 examples from the OpenOrca instruction dataset, it maintained gradient stability even with high ranks, whereas standard LoRA suffered from gradient collapse and lower learning efficiency.


Weight-Decomposed Low-Rank Adaptation (DoRA)

DoRA

DoRA decomposes weight updates into two components: magnitude and direction. Direction is handled by standard LoRA, while magnitude is managed by a separate learnable parameter. This approach enhances LoRA's performance, especially at lower ranks.

Experiments demonstrated that DoRA consistently outperforms LoRA and other fine-tuning methods across models such as LLaMA, LLaVA, and VL-BART in various downstream tasks, including commonsense reasoning, visual instruction tuning, and text-image/video comprehension. For example, on the LLaMA-7B model, DoRA improved average accuracy by 3.7% compared to LoRA on commonsense reasoning datasets, even surpassing ChatGPT's accuracy levels. For larger models like LLaMA-13B, DoRA achieved similar performance to parallel adapters while using only a quarter of the trainable parameters and without increasing inference costs.

Paper Code: GitHub - NVlabs/DoRA: [ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation


Merge LoRA Adapter

While LoRA is significantly smaller and faster to train, latency issues may arise during inference due to the separate loading of the base model and the LoRA adapter. To eliminate this latency, you can use the **merge_and_unload()** function to merge the adapter weights with the base model. This enables you to use the newly merged model as a standalone model. The merge_and_unload() function does not retain the adapter weights in memory.

Below is a diagram illustrating the intuition behind merging the LoRA adapter:

merge-lora

Source: Hugging Face

To keep a separate copy of the weights and allow decoupling of the adapter, you can use **merge_adapter()** and unmerge_adapter().


Variations

Low-Rank Hadamard Product (LoHa)

image/png

Concept:
The Low-Rank Hadamard Product (LoHa) utilizes low-rank matrices combined via the Hadamard product (element-wise multiplication) instead of traditional matrix multiplication.

  • ∆W Representation: In LoHa, the weight update matrix ∆W is decomposed into four smaller matrices, and each pair of these low-rank matrices is combined using the Hadamard product. This allows the model to retain high rank and expressiveness while keeping the number of trainable parameters consistent.

Implementation:

  • Hadamard Product: LoHa relies on the Hadamard product instead of matrix multiplication to combine low-rank matrices, affecting the model’s structure and training.
  • Extension of LoRA: LoHa can be seen as an extension of Low-Rank Adaptation (LoRA), enhancing model capacity without adding more parameters.
  • Embedding Layers: Although LoHa is applicable to various models, it has not yet been fully implemented in Parameter-Efficient Fine-Tuning (PEFT) frameworks for embedding layers.

Efficiency and Performance:

  • Performance Trade-offs: LoHa balances model expressiveness and the number of parameters. It allows for higher rank and capacity without increasing computational load.
  • Federated Learning (FL): In FL, LoHa has shown a significant reduction in communication costs (3–10x lower) while maintaining comparable model performance. This is achieved via the FedPara method, which uses low-rank weights followed by the Hadamard product, making it more efficient than traditional low-rank approaches.

Applications:
LoHa was originally developed for computer vision tasks, particularly in diffusion models where generating diverse images is crucial.

Low-Rank Kronecker Product (LoKr)

LoKr is a LoRA variant, closely related to LoRA and LoHa, primarily applied in diffusion models but also adaptable to other model types. The main difference between LoKr and LoRA is that LoKr replaces traditional matrix multiplication with the Kronecker product. This decomposition creates a block matrix that preserves the structure and rank of the original weight matrix, ensuring the model retains its generalization capabilities during fine-tuning.

Key Advantages:

  • Vectorization Capability: The Kronecker product can be vectorized, meaning the matrix columns can be stacked into a vector. This reduces the need to fully reconstruct the adjustment matrix (∆W), accelerating fine-tuning and enhancing efficiency.
  • Additional Matrix Flexibility: LoKr allows for an optional third low-rank matrix, providing more refined control during fine-tuning.

Although initially designed for diffusion models, LoKr’s flexibility allows integration into a wide range of models, making it an efficient low-rank adaptation technique without compromising the base model’s performance.

Mixture of LoRA Experts (X-LoRA)

image/png

image/png

Concept:
X-LoRA is an advanced low-rank adaptation (LoRA) method leveraging the concept of mixture of experts (MoE). It dynamically activates different LoRA experts using control mechanisms (gates) that can be dense or sparse. Unlike traditional MoE methods, X-LoRA keeps both LoRA experts and the base model frozen during training, training only the control layers (gates), thereby reducing complexity and training costs.

Implementation:

  • Dual Pass: During inference, X-LoRA performs two steps. First, the model generates hidden states without applying LoRA adapters. Then, these states are used to compute adjustments from LoRA adapters, dynamically reorganizing and selecting the most suitable experts for the task.
  • Control Layers: Control layers decide which LoRA experts are activated and adjust their scales precisely, both at the layer and token levels. These are the only parts trained in X-LoRA.
  • Custom Adaptation: X-LoRA allows fine-grained adaptation, activating specific LoRA experts for particular layers and tokens, resulting in highly personalized tuning.

Efficiency and Performance:

  • Parameter Efficiency: Since only control layers are trained, while the base model and LoRA experts remain frozen, X-LoRA significantly reduces the number of parameters adjusted, enabling lightweight training without compromising performance.
  • Dynamic Knowledge Recovery: The dual-pass mechanism allows the model to "reflect" on its outputs, dynamically adjusting its predictions for improved accuracy and context sensitivity.

Applications:
X-LoRA excels in scientific and technical domains, such as materials analysis, protein mechanics, and molecular design. Its ability to dynamically combine knowledge from different experts makes it highly effective for complex, interdisciplinary problems like predicting nanomechanical properties or molecular behaviors.

xlora-gif

Comparison and Highlights:

  • Traditional LoRA vs. X-LoRA: Unlike standard LoRA, which applies fixed adaptations, X-LoRA introduces dynamic flexibility, enabling more intelligent and specialized adjustments.
  • Biological Inspiration: The design of X-LoRA is inspired by biological principles, such as component reuse across hierarchies, enhancing its versatility and applicability across diverse fields.

image/png

These experimental results show knowledge recall evaluation.

  • (a) Results from a bioinspired recall exam, where X-LoRA outperforms other models despite being smaller (7B parameters vs. 13B).
  • (b) Benchmark results for mechanics/materials knowledge recall.
  • (c) Results in biology, materials, protein properties, logic, and reasoning domains, focusing on challenging questions.

KronA

krona

Concept

KronA is a parameter-efficient fine-tuning method that extends the idea of matrix factorization used in LoRA by leveraging the Kronecker product. The Kronecker product enables improved order efficiency, i.e., it retains or enhances the rank of the original weight matrices being factorized (Edalati et al., 2022).

The Kronecker product used in KronA is represented as δW = WA ⊗ WB, where WA and WB are the matrices involved in the factorization. This approach allows KronA to achieve a better rank-to-parameter ratio compared to traditional matrix factorization methods.

The figure above illustrates the structure of the proposed Kronecker-based modules and their low-rank counterparts. In figure d), KronA^Bres is shown, with the residual connection represented by the dotted line.

Implementation

image/png

KronA implements a Kronecker product-vector operation x(A⊗B), avoiding the explicit representation of the update matrix δW. This results in significant computational speedups during both training and inference.

KronA also introduces KronA^Bres, a variant that includes a residual connection and a parallel adapter structure, further optimizing weight parameterization through the Kronecker product.

Efficiency/Performance

KronA is highly parameter-efficient, updating only about 0.07% of the total model parameters, similar to methods like LoRA and Compacter. However, it achieves better performance, particularly on tasks within the GLUE benchmark.

The method is not only parameter-efficient but also faster during inference compared to adapter-based methods like Compacter. This is largely due to the efficient computation of the Kronecker product, which avoids the overhead associated with explicit matrix representations.

Applications

KronA is particularly useful for models with smaller parameter counts (less than 1 billion). It is well-suited for tasks where parameter efficiency and inference speed are critical, such as real-time applications or resource-constrained environments.

The method has demonstrated strong performance in natural language processing tasks, specifically on the GLUE benchmark, where it matches or outperforms other fine-tuning methods while maintaining a similar or smaller parameter footprint.


Selective Tuning

Selective methods adjust a subset of the model's existing parameters. This can involve layer-depth selection, type-based selection, or even individual parameter selection.

BitFit

Concept

BitFit is an efficient fine-tuning technique proposed by (Ben-Zaken et al., 2021), which focuses on adjusting only the bias terms of pre-trained models instead of updating all layer weights. In linear or convolutional layers, the weight matrix 𝑊 remains unchanged, while only the bias vector 𝑏 is optimized.

Implementation

The implementation of BitFit is straightforward. The bias terms are selected from the model parameters, and the optimizer works exclusively on these parameters.

This approach modifies only about 0.05% of the model's total parameters, making BitFit highly efficient in terms of storage and computation.

Efficiency/Performance

BitFit demonstrates excellent memory and training time efficiency. By updating a minimal fraction of parameters, it drastically reduces resource requirements, such as memory and computational capacity, especially compared to full fine-tuning. It has shown effectiveness in smaller models, particularly in small to medium data scenarios, where its performance is comparable to or even surpasses full fine-tuning.

However, when applied to larger models, such as T0-3B or GPT-3, BitFit may become less competitive, trailing behind full fine-tuning or other efficient fine-tuning approaches like LoRA or Prefix Tuning. This is because modifying only the bias terms may be insufficient to capture the complexity of training data in larger models.

Applications

BitFit is especially useful in scenarios where computational resources are limited or where rapid model training on new data is required without the capacity for full fine-tuning. It is particularly effective in domains with small or medium training datasets, such as adapting pre-trained language models to specific tasks (e.g., text classification or question answering).

Comparison with Other Techniques

Compared to other efficient fine-tuning techniques like LoRA and Adapters, BitFit is the simplest method, as it only modifies biases. However, its simplicity can lead to inferior performance in scenarios involving large datasets or large models, where techniques like LoRA tend to be more effective.

% Param QNLI SST-2 MNLIm MNLIm mm CoLA MRPC STS-B RTE QQP Avg.
Train size 105k 67k 393k 393k 8.5k 3.7k 7k 2.5k 364k
(V) Full-FT 100% 93.5 94.1 86.5 87.1 62.8 91.9 89.8 71.8 87.6 84.8
(V) Full-FT 100% 91.7±0.1 93.4±0.2 85.5±0.4 85.7±0.4 62.2±1.2 90.7±0.3 90.0±0.4 71.9±1.3 87.5±0.4 84.1
(V) Diff-Prune 0.5% 93.4 94.2 86.4 86.9 63.5 91.3 89.5 71.5 86.6 84.6
(V) BitFit 0.08% 91.4±2.4 93.2±0.4 84.4±0.2 84.8±0.1 63.6±0.7 91.7±0.5 90.3±0.1 73.2±3.7 85.4±0.1 84.2
(T) Full-FTi 100% 91.1 94.9 86.7 85.9 60.5 89.3 87.6 70.1 72.1 81.8
(T) Full-FTt 100% 93.4 94.1 86.7 86.0 59.6 88.9 86.6 71.2 71.7 81.5
(T) Adaptersi 3.6% 90.7 94.0 84.9 85.1 59.5 89.5 86.9 71.5 71.8 81.1
(T) Diff-Prune 0.5% 93.3 94.1 86.4 86.0 61.1 89.7 86.0 70.6 71.1 81.5
(T) BitFit 0.08% 92.0 94.2 84.5 84.8 59.7 88.9 85.5 72.0 70.5 80.9

Paper code: GitHub - benzakenelad/BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

DiffPruning

Concept:

DiffPruning, a technique proposed by Guo et al. (2020), is an efficient fine-tuning technique aimed at updating neural network weights sparsely. The method introduces a learnable binary mask represented by δ = z ◦ ∆W, where "◦" is the Hadamard product (element-wise multiplication). This mask is trained alongside the model during fine-tuning, as part of a regularization objective.

Implementation:
During fine-tuning, both model weights and the learnable binary mask that decides which parameters will be actually updated are adjusted. This results in a highly efficient and sparse update, reducing the number of modified parameters without affecting the model's performance.

Efficiency and Performance:

DiffPruning achieves exceptional parameter efficiency by modifying only about 0.5% of the model's parameters in smaller-scale configurations (<1B parameters). Therefore, it is most suitable for multi-task edge applications (such as mobile devices) where storage is limited. However, by minimizing the number of modified parameters, the technique needs more memory during training, as all parameters are optimized along with the binary mask.

Application:
It is especially useful in scenarios where multiple tasks need to be managed efficiently with restricted storage, such as mobile devices. It is also applicable in situations where new tasks arrive in continuous flow or from different providers, as only a small task-specific difference vector needs to be stored. The technique proved comparable to full fine-tuning in benchmarks like GLUE, while modifying only a fraction of the model's parameters.

Comparison and Additional Points:
It offers favorable scalability as the number of tasks increases, requiring only storage of a small difference vector per task. Although more efficient in terms of modified parameters, the memory cost during training can be higher than in traditional fine-tuning approaches, due to the need to optimize all parameters along with the learnable binary mask.

  • Sparsity Control: The differentiable approach to L0 norm allows DiffPruning to promote sparse updates in a controlled manner, making it useful for scenarios where saving space and computational resources is crucial.
  • Multiple Task Adaptation: By not requiring simultaneous access to all tasks during training, DiffPruning is a viable solution for devices that need to adapt to new tasks continuously, without the need to recalibrate the entire model.

image/png

In this figure, the left shows the average performance on the GLUE validation set at different target sparsity rates for the methods. The right shows results with BERTlarge on the SQuAD v1.1 validation set.

Freeze and Reconfigure (FAR)

image/png

Concept:

The Freeze and Reconfigure (FAR) technique is an efficient fine-tuning method aimed at reducing memory consumption and accelerating the training of large language models like BERT. FAR works by freezing part of the model's parameters and focusing only on adjusting the most important parameters. The goal is to reduce resource usage during training, especially in edge scenarios where storage and computational power are limited. The method also reconfigures the model architecture to group frozen and trainable parameters separately, optimizing memory operations.

Implementation:
It operates in two steps:

  • Parameter Identification: Using a learning metric based on L1 norm, it evaluates which parts of the model are important for adjustment. Then, the columns of the matrices that need to be adjusted are selected.
  • Dynamic Reconfiguration: The model's parameters and linear layers are divided into frozen and trainable components. During training, matrix multiplications are performed separately for the components, and the results are concatenated to generate the output.

Efficiency and Performance:
In experiments with DistilBERT on GLUE and SQuAD tasks, it managed to freeze up to 60% of the model's parameters, reducing training time by 30% and memory access time by 47%, without significant performance loss in metrics. This approach provides great flexibility when using modern hardware and frameworks like PyTorch, and after training, parameters can be reconfigured, eliminating any negative impact on inference.

Application:
FAR is particularly effective in edge scenarios, such as mobile devices, where resources are limited. Its main application has been in models like DistilBERT, used in NLP tasks like the GLUE benchmark and SQuAD 2.0, where it showed performance comparable to full fine-tuning, but with only a fraction of parameters being updated (approximately 6%).

Comparison and Additional Points:
Compared with BitFit, which freezes all weights of dense layers, FAR showed superior performance, especially in more complex tasks like SQuAD 2.0. BitFit had a sharp performance drop, particularly in more challenging tasks, where it performed nearly 20% worse in EM and F1 metrics. This demonstrates that FAR is more effective in handling the complexity of these tasks in compressed models like DistilBERT, offering a balance between resource efficiency and performance.

image/png

Here's the translation to English, maintaining all details and information:

FishMask

image/png

Concept:

FishMask (Fisher-Induced Sparse uncHanging) mask is an efficient fine-tuning technique based on sparse parameter updates, where parameters to be adjusted are selected based on Fisher information. By calculating the importance of each parameter using Fisher information, FishMask creates a sparse mask, allowing only a fixed subset of parameters to be updated during training, while the rest remain frozen. The main objective is to optimize performance while reducing memory and communication costs, especially in distributed learning and transfer learning scenarios.

Implementation:

  • Fisher information calculation: The importance of each parameter is estimated through a diagonal approximation of Fisher information, with the formula:

    image/png

    The calculation is performed after computing gradients for all parameters in batches.

  • Selection: After Fisher calculation, parameters with the highest Fisher value are selected and adjusted. The selection is based on a percentage threshold.

  • Mask: A mask is created to indicate parameter selection.

Efficiency:
The FishMask technique is designed to reduce memory and communication costs in distributed environments, without significantly compromising model performance. By updating only a small fraction of parameters (typically 1% to 10%), FishMask can maintain performance comparable to methods like Adapters, while being more memory-efficient.

In terms of performance, FishMask presents results similar to techniques like Adapters, but falls short of more advanced methods like LoRA and (IA)³.

Applications:
FishMask is especially useful in scenarios where updating all model parameters would be unfeasible.

Comparison and Additional Points:
Compared to other sparse fine-tuning methods, such as BitFit and DiffPruning, FishMask stands out for pre-computing a fixed mask of important parameters, which avoids the need for dynamic adjustments during training. This provides a significant reduction in computational overhead, especially on modern hardware that may have limited support for dynamic sparse operations.

In experiments, FishMask demonstrated performance comparable to Adapters, but with lower memory cost. However, it does not achieve the performance level of techniques like LoRA and (IA)³, which can adjust parameters more precisely and efficiently across a variety of tasks.

Referências

Community

Sign up or log in to comment