VeRA: Vector-based Random Matrix Adaptation
VeRA is a parameter-efficient fine-tuning technique that is similar to LoRA but requires even fewer extra parameters while promising similar or even better performance. As such, it is particularly useful when the parameter budget is very limited, e.g. when scaling to very large models. The reduction of the count of trainable parameters is achieved by sharing the same low-rank matrices across all layers, and only training two additional vectors per layer.
When saving the adapter parameters, it’s possible to eschew storing the low rank matrices by setting save_projection=False
on the VeraConfig
. In that case, these matrices will be restored based on the fixed random seed from the projection_prng_key
argument. This cuts down on the size of the checkpoint, but we cannot guarantee reproducibility on all devices and for all future versions of PyTorch. If you want to ensure reproducibility, set save_projection=True
(which is the default).
To handle different shapes of adapted layers, VeRA initializes shared A and B matrices with the largest required size for each dimension. During the forward pass, submatrices A and B for a given layer are sliced out from these shared matrices and used as described in the paper. For example, adapting two linear layers of shapes (100, 20) and (80, 50) will create A and B matrices of shapes (rank, 50) and (100, rank) respectively. Then, to adapt a layer of shape (100, 20), submatrices A and B of shapes (rank, 20) and (100, rank) will be extracted.
VeRA currently has the following constraint:
- Only
nn.Linear
layers are supported.
The abstract from the paper is:
Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameters compared to LoRA, yet maintains the same performance. It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. We demonstrate its effectiveness on the GLUE and E2E benchmarks, image classification tasks, and show its application in instruction-tuning of 7B and 13B language models.
VeRAConfig
class peft.VeraConfig
< source >( peft_type: Union = None auto_mapping: Optional = None base_model_name_or_path: Optional = None revision: Optional = None task_type: Union = None inference_mode: bool = False r: int = 256 target_modules: Optional[Union[list[str], str]] = None projection_prng_key: int = 0 save_projection: bool = True vera_dropout: float = 0.0 d_initial: float = 0.1 fan_in_fan_out: bool = False bias: str = 'none' modules_to_save: Optional[list[str]] = None init_weights: bool = True layers_to_transform: Optional[Union[list[int], int]] = None layers_pattern: Optional[Union[list[str], str]] = None )
Parameters
- r (
int
, optional, defaults to256
) — VeRA parameter dimension (“rank”). Choose higher values than LoRA ranks here, since VeRA uses far fewer parameters than LoRA (see Table 1). - target_modules (
Union[List[str], str]
) — The names of the modules to apply Vera to. Only linear layers are supported. - projection_prng_key (
int
) — Vera PRNG init key. Used for initialising vera_A and vera_B for new models or when loading a checkpoint that did not include these projections. Defaults to0
. - save_projection (
bool
) — Whether to save the vera_A / vera_B projections in the state dict alongside per layer lambda_b / lambda_d weights. This will increase the size of the checkpoint, but guarantee that we can reload the checkpoint on all system configurations. Defaults toTrue
. - vera_dropout (
float
) — The dropout probability for Vera layers. - d_initial (
float
, optional, defaults to0.1
) — Initial init value forvera_lambda_d
vector used when initializing the VeRA parameters. Small values (<=0.1) are recommended (see Table 6c in the paper). - fan_in_fan_out (
bool
) — Set this to True if the layer to replace stores weight like (fan_in, fan_out). For example, gpt-2 usesConv1D
which stores weights like (fan_in, fan_out) and hence this should be set toTrue
. - bias (
str
) — Bias type for Vera. Can be ‘none’, ‘all’ or ‘vera_only’. If ‘all’ or ‘vera_only’, the corresponding biases will be updated during training. Be aware that this means that, even when disabling the adapters, the model will not produce the same output as the base model would have without adaptation. - modules_to_save (
List[str]
) — List of modules apart from Vera layers to be set as trainable and saved in the final checkpoint. - init_weights (
bool
) — Whether to initialize the weights of the Vera layers with their default initialization. Don’t change this setting, except if you know exactly what you’re doing. - layers_to_transform (
Union[List[int],int]
) — The layer indexes to transform, if this argument is specified, it will apply the Vera transformations on the layer indexes that are specified in this list. If a single integer is passed, it will apply the Vera transformations on the layer at this index. - layers_pattern (
Optional[Union[List[str], str]]
) — The layer pattern name, used only iflayers_to_transform
is different fromNone
. This should target thenn.ModuleList
of the model, which is often called'layers'
or'h'
.
This is the configuration class to store the configuration of a VeraModel.
Paper: https://arxiv.org/abs/2310.11454.
VeRAModel
class peft.VeraModel
< source >( model config adapter_name low_cpu_mem_usage: bool = False ) → torch.nn.Module
Parameters
- model (PreTrainedModel) — The model to be adapted.
- config (VeraConfig) — The configuration of the Vera model.
- adapter_name (
str
) — The name of the adapter, defaults to"default"
. - low_cpu_mem_usage (
bool
,optional
, defaults toFalse
) — Create empty adapter weights on meta device. Useful to speed up the loading process.
Returns
torch.nn.Module
The Vera model.
Creates Vector-based Random Matrix Adaptation (Vera) model from a pretrained transformers model.
Example:
>>> from transformers import AutoModelForCausalLM
>>> from peft import VeraConfig, get_peft_model
>>> base_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
>>> config = VeraConfig(r=128)
>>> model = get_peft_model(base_model, config)
Attributes:
- model (PreTrainedModel) — The model to be adapted.
- peft_config (VeraConfig): The configuration of the Vera model.
delete_adapter
< source >( adapter_name: str )
Deletes an existing adapter.
merge_and_unload
< source >( progressbar: bool = False safe_merge: bool = False adapter_names: Optional[list[str]] = None )
Parameters
- progressbar (
bool
) — whether to show a progressbar indicating the unload and merge process - safe_merge (
bool
) — whether to activate the safe merging check to check if there is any potential Nan in the adapter weights - adapter_names (
list[str]
, optional) — The list of adapter names that should be merged. If None, all active adapters will be merged. Defaults toNone
.
This method merges the Vera layers into the base model. This is needed if someone wants to use the base model as a standalone model.
Example:
>>> from transformers import AutoModelForCausalLM
>>> from peft import PeftModel
>>> base_model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b")
>>> peft_model_id = "smangrul/falcon-40B-int4-peft-lora-sfttrainer-sample"
>>> model = PeftModel.from_pretrained(base_model, peft_model_id)
>>> merged_model = model.merge_and_unload()
Gets back the base model by removing all the Vera modules without merging. This gives back the original base model.