LoRA
Low-Rank Adaptation (LoRA) is a PEFT method that decomposes a large matrix into two smaller low-rank matrices in the attention layers. This drastically reduces the number of parameters that need to be fine-tuned.
The abstract from the paper is:
We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we present a method based on low-rank decomposition to train a rescoring BERT model and adapt it to new domains using only a fraction (0.08%) of the pretrained parameters. These inserted matrices are optimized through a discriminative training objective along with a correlation-based regularization loss. The proposed low-rank adaptation Rescore-BERT (LoRB) architecture is evaluated on LibriSpeech and internal datasets with decreased training times by factors between 5.4 and 3.6..
LoraConfig
class peft.LoraConfig
< source >( peft_type: Union = None auto_mapping: Optional = None base_model_name_or_path: Optional = None revision: Optional = None task_type: Union = None inference_mode: bool = False r: int = 8 target_modules: Optional[Union[list[str], str]] = None exclude_modules: Optional[Union[list[str], str]] = None lora_alpha: int = 8 lora_dropout: float = 0.0 fan_in_fan_out: bool = False bias: Literal['none', 'all', 'lora_only'] = 'none' use_rslora: bool = False modules_to_save: Optional[list[str]] = None init_lora_weights: bool | Literal['gaussian', 'olora', 'pissa', 'pissa_niter_[number of iters]', 'loftq'] = True layers_to_transform: Optional[Union[list[int], int]] = None layers_pattern: Optional[Union[list[str], str]] = None rank_pattern: Optional[dict] = <factory> alpha_pattern: Optional[dict] = <factory> megatron_config: Optional[dict] = None megatron_core: Optional[str] = 'megatron.core' loftq_config: Union[LoftQConfig, dict] = <factory> use_dora: bool = False layer_replication: Optional[list[tuple[int, int]]] = None runtime_config: LoraRuntimeConfig = <factory> )
Parameters
- r (
int
) — Lora attention dimension (the “rank”). - target_modules (
Optional[Union[List[str], str]]
) — The names of the modules to apply the adapter to. If this is specified, only the modules with the specified names will be replaced. When passing a string, a regex match will be performed. When passing a list of strings, either an exact match will be performed or it is checked if the name of the module ends with any of the passed strings. If this is specified as ‘all-linear’, then all linear/Conv1D modules are chosen, excluding the output layer. If this is not specified, modules will be chosen according to the model architecture. If the architecture is not known, an error will be raised — in this case, you should specify the target modules manually. - exclude_modules (
Optional[Union[List[str], str]]
) — The names of the modules to not apply the adapter. When passing a string, a regex match will be performed. When passing a list of strings, either an exact match will be performed or it is checked if the name of the module ends with any of the passed strings. - lora_alpha (
int
) — The alpha parameter for Lora scaling. - lora_dropout (
float
) — The dropout probability for Lora layers. - fan_in_fan_out (
bool
) — Set this to True if the layer to replace stores weight like (fan_in, fan_out). For example, gpt-2 usesConv1D
which stores weights like (fan_in, fan_out) and hence this should be set toTrue
. - bias (
str
) — Bias type for LoRA. Can be ‘none’, ‘all’ or ‘lora_only’. If ‘all’ or ‘lora_only’, the corresponding biases will be updated during training. Be aware that this means that, even when disabling the adapters, the model will not produce the same output as the base model would have without adaptation. - use_rslora (
bool
) — When set to True, uses Rank-Stabilized LoRA which sets the adapter scaling factor tolora_alpha/math.sqrt(r)
, since it was proven to work better. Otherwise, it will use the original default value oflora_alpha/r
. - modules_to_save (
List[str]
) — List of modules apart from adapter layers to be set as trainable and saved in the final checkpoint. - init_lora_weights (
bool
|Literal["gaussian", "olora", "pissa", "pissa_niter_[number of iters]", "loftq"]
) — How to initialize the weights of the adapter layers. Passing True (default) results in the default initialization from the reference implementation from Microsoft. Passing ‘gaussian’ results in Gaussian initialization scaled by the LoRA rank for linear and layers. Setting the initialization to False leads to completely random initialization and is discouraged. Pass'loftq'
to use LoftQ initialization. Pass'olora'
to use OLoRA initialization. Passing'pissa'
results in the initialization of Principal Singular values and Singular vectors Adaptation (PiSSA), which converges more rapidly than LoRA and ultimately achieves superior performance. Moreover, PiSSA reduces the quantization error compared to QLoRA, leading to further enhancements. Passing'pissa_niter_[number of iters]'
initiates Fast-SVD-based PiSSA initialization, where[number of iters]
indicates the number of subspace iterations to perform FSVD, and must be a nonnegative integer. When[number of iters]
is set to 16, it can complete the initialization of a 7B model within seconds, and the training effect is approximately equivalent to using SVD. - layers_to_transform (
Union[List[int], int]
) — The layer indices to transform. If a list of ints is passed, it will apply the adapter to the layer indices that are specified in this list. If a single integer is passed, it will apply the transformations on the layer at this index. - layers_pattern (
Optional[Union[List[str], str]]
) — The layer pattern name, used only iflayers_to_transform
is different fromNone
. This should target thenn.ModuleList
of the model, which is often called'layers'
or'h'
. - rank_pattern (
dict
) — The mapping from layer names or regexp expression to ranks which are different from the default rank specified byr
. - alpha_pattern (
dict
) — The mapping from layer names or regexp expression to alphas which are different from the default alpha specified bylora_alpha
. - megatron_config (
Optional[dict]
) — The TransformerConfig arguments for Megatron. It is used to create LoRA’s parallel linear layer. You can get it like this,core_transformer_config_from_args(get_args())
, these two functions being from Megatron. The arguments will be used to initialize the TransformerConfig of Megatron. You need to specify this parameter when you want to apply LoRA to the ColumnParallelLinear and RowParallelLinear layers of megatron. - megatron_core (
Optional[str]
) — The core module from Megatron to use, defaults to"megatron.core"
. - loftq_config (
Optional[LoftQConfig]
) — The configuration of LoftQ. If this is not None, then LoftQ will be used to quantize the backbone weights and initialize Lora layers. Also passinit_lora_weights='loftq'
. Note that you should not pass a quantized model in this case, as LoftQ will quantize the model itself. - use_dora (
bool
) — Enable ‘Weight-Decomposed Low-Rank Adaptation’ (DoRA). This technique decomposes the updates of the weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA especially at low ranks. Right now, DoRA only supports linear and Conv2D layers. DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference. For more information, see https://arxiv.org/abs/2402.09353. - layer_replication (
List[Tuple[int, int]]
) — Build a new stack of layers by stacking the original model layers according to the ranges specified. This allows expanding (or shrinking) the model without duplicating the base model weights. The new layers will all have separate LoRA adapters attached to them. - runtime_config (
LoraRuntimeConfig
) — Runtime configurations (which are not saved or restored).
This is the configuration class to store the configuration of a LoraModel.
Returns the configuration for your adapter model as a dictionary. Removes runtime configurations.
LoraModel
class peft.LoraModel
< source >( model config adapter_name low_cpu_mem_usage: bool = False ) → torch.nn.Module
Parameters
- model (
torch.nn.Module
) — The model to be adapted. - config (LoraConfig) — The configuration of the Lora model.
- adapter_name (
str
) — The name of the adapter, defaults to"default"
. - low_cpu_mem_usage (
bool
,optional
, defaults toFalse
) — Create empty adapter weights on meta device. Useful to speed up the loading process.
Returns
torch.nn.Module
The Lora model.
Creates Low Rank Adapter (LoRA) model from a pretrained transformers model.
The method is described in detail in https://arxiv.org/abs/2106.09685.
Example:
>>> from transformers import AutoModelForSeq2SeqLM
>>> from peft import LoraModel, LoraConfig
>>> config = LoraConfig(
... task_type="SEQ_2_SEQ_LM",
... r=8,
... lora_alpha=32,
... target_modules=["q", "v"],
... lora_dropout=0.01,
... )
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
>>> lora_model = LoraModel(model, config, "default")
>>> import torch
>>> import transformers
>>> from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
>>> rank = ...
>>> target_modules = ["q_proj", "k_proj", "v_proj", "out_proj", "fc_in", "fc_out", "wte"]
>>> config = LoraConfig(
... r=4, lora_alpha=16, target_modules=target_modules, lora_dropout=0.1, bias="none", task_type="CAUSAL_LM"
... )
>>> quantization_config = transformers.BitsAndBytesConfig(load_in_8bit=True)
>>> tokenizer = transformers.AutoTokenizer.from_pretrained(
... "kakaobrain/kogpt",
... revision="KoGPT6B-ryan1.5b-float16", # or float32 version: revision=KoGPT6B-ryan1.5b
... bos_token="[BOS]",
... eos_token="[EOS]",
... unk_token="[UNK]",
... pad_token="[PAD]",
... mask_token="[MASK]",
... )
>>> model = transformers.GPTJForCausalLM.from_pretrained(
... "kakaobrain/kogpt",
... revision="KoGPT6B-ryan1.5b-float16", # or float32 version: revision=KoGPT6B-ryan1.5b
... pad_token_id=tokenizer.eos_token_id,
... use_cache=False,
... device_map={"": rank},
... torch_dtype=torch.float16,
... quantization_config=quantization_config,
... )
>>> model = prepare_model_for_kbit_training(model)
>>> lora_model = get_peft_model(model, config)
Attributes:
- model (PreTrainedModel) — The model to be adapted.
- peft_config (LoraConfig): The configuration of the Lora model.
add_weighted_adapter
< source >( adapters: list[str] weights: list[float] adapter_name: str combination_type: str = 'svd' svd_rank: int | None = None svd_clamp: int | None = None svd_full_matrices: bool = True svd_driver: str | None = None density: float | None = None majority_sign_method: Literal['total', 'frequency'] = 'total' )
Parameters
- adapters (
list
) — List of adapter names to be merged. - weights (
list
) — List of weights for each adapter. - adapter_name (
str
) — Name of the new adapter. - combination_type (
str
) — The merging type can be one of [svd
,linear
,cat
,ties
,ties_svd
,dare_ties
,dare_linear
,dare_ties_svd
,dare_linear_svd
,magnitude_prune
,magnitude_prune_svd
]. When using thecat
combination_type, the rank of the resulting adapter is equal to the sum of all adapters ranks (the mixed adapter may be too big and result in OOM errors). - svd_rank (
int
, optional) — Rank of output adapter for svd. If None provided, will use max rank of merging adapters. - svd_clamp (
float
, optional) — A quantile threshold for clamping SVD decomposition output. If None is provided, do not perform clamping. Defaults to None. - svd_full_matrices (
bool
, optional) — Controls whether to compute the full or reduced SVD, and consequently, the shape of the returned tensors U and Vh. Defaults to True. - svd_driver (
str
, optional) — Name of the cuSOLVER method to be used. This keyword argument only works when merging on CUDA. Can be one of [None,gesvd
,gesvdj
,gesvda
]. For more info please refer totorch.linalg.svd
documentation. Defaults to None. - density (
float
, optional) — Value between 0 and 1. 0 means all values are pruned and 1 means no values are pruned. Should be used with [ties
,ties_svd
,dare_ties
,dare_linear
,dare_ties_svd
,dare_linear_svd
,magnintude_prune
,magnitude_prune_svd
] - majority_sign_method (
str
) — The method, should be one of [“total”, “frequency”], to use to get the magnitude of the sign values. Should be used with [ties
,ties_svd
,dare_ties
,dare_ties_svd
]
This method adds a new adapter by merging the given adapters with the given weights.
When using the cat
combination_type you should be aware that rank of the resulting adapter will be equal to
the sum of all adapters ranks. So it’s possible that the mixed adapter may become too big and result in OOM
errors.
delete_adapter
< source >( adapter_name: str )
Deletes an existing adapter.
Disable all adapters.
When disabling all adapters, the model output corresponds to the output of the base model.
Enable all adapters.
Call this if you have previously disabled all adapters and want to re-enable them.
merge_and_unload
< source >( progressbar: bool = False safe_merge: bool = False adapter_names: Optional[list[str]] = None )
Parameters
- progressbar (
bool
) — whether to show a progressbar indicating the unload and merge process - safe_merge (
bool
) — whether to activate the safe merging check to check if there is any potential Nan in the adapter weights - adapter_names (
List[str]
, optional) — The list of adapter names that should be merged. If None, all active adapters will be merged. Defaults toNone
.
This method merges the LoRa layers into the base model. This is needed if someone wants to use the base model as a standalone model.
Example:
>>> from transformers import AutoModelForCausalLM
>>> from peft import PeftModel
>>> base_model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b")
>>> peft_model_id = "smangrul/falcon-40B-int4-peft-lora-sfttrainer-sample"
>>> model = PeftModel.from_pretrained(base_model, peft_model_id)
>>> merged_model = model.merge_and_unload()
set_adapter
< source >( adapter_name: str | list[str] )
Set the active adapter(s).
Additionally, this function will set the specified adapters to trainable (i.e., requires_grad=True). If this is not desired, use the following code.
subtract_mutated_init
< source >( output_state_dict: dict[str, torch.Tensor] adapter_name: str kwargs = None )
This function can calculate the updates of the [PiSSA | OLoRA] by comparing the parameters of the [PiSSA |
OLoRA] adapter in output_state_dict
with the initial values of [PiSSA | OLoRA] in adapter_name
, thus
converting [PiSSA | OLoRA] to LoRA.
Gets back the base model by removing all the lora modules without merging. This gives back the original base model.
Utility
peft.replace_lora_weights_loftq
< source >( peft_model model_path: Optional[str] = None adapter_name: str = 'default' callback: Optional[Callable[[torch.nn.Module, str], bool]] = None )
Parameters
- peft_model (
PeftModel
) — The model to replace the weights of. Must be a quantized PEFT model with LoRA layers. - model_path (
Optional[str]
) — The path to the model safetensors file. If the model is a Hugging Face model, this will be inferred from the model’s config. Otherwise, it must be provided. - adapter_name (
str
) — The name of the adapter to replace the weights of. The default adapter name is “default”. - callback (
Optional[Callable[[PeftModel, str], bool]]
) — A callback function that will be called after each module is replaced. The callback function should take the model and the name of the current module as input and return a boolean indicating whether the replacement should be kept. If the callback returns False, the replacement will be rolled back. This can be very useful to confirm that the LoftQ initialization actually decreases the quantization error of the model. As an example, this callback could generate logits for given input and compare it with the logits from the original, non-quanitzed model with the same input, and only returnTrue
if there is an improvement. As this is a greedy optimization, it’s possible that calling this function multiple times yields incremental improvements.
Replace the LoRA weights of a model quantized with bitsandbytes, using the LoftQ technique.
The replacement is done on the fly by loading in the non-quantized weights from a locally stored safetensors model file and initializing the LoRA weights such that the quantization error between the original and quantized weights is minimized.
As lazy loading is not possible with pickle, normal PyTorch checkpoint files cannot be supported.
Depending on the model size, calling this function may take some time to finish.