Optimization
This page contains the API reference documentation for learning rate optimizers included in timm
.
Optimizers
Factory functions
Legacy optimizer factory for backwards compatibility. NOTE: Use create_optimizer_v2 for new code.
timm.optim.create_optimizer_v2
< source >( model_or_params opt: str = 'sgd' lr: Optional = None weight_decay: float = 0.0 momentum: float = 0.9 foreach: Optional = None filter_bias_and_bn: bool = True layer_decay: Optional = None param_group_fn: Optional = None **kwargs )
Parameters
- model_or_params (nn.Module) — model containing parameters to optimize opt — name of optimizer to create lr — initial learning rate weight_decay — weight decay to apply in optimizer momentum — momentum for momentum based optimizers (others may use betas via kwargs) foreach — Enable / disable foreach (multi-tensor) operation if True / False. Choose safe default if None filter_bias_and_bn — filter out bias, bn and other 1d params from weight decay **kwargs — extra optimizer specific kwargs to pass through
Create an optimizer.
TODO currently the model is passed in and all parameters are selected for optimization. For more general use an interface that allows selection of parameters to optimize and lr groups, one of:
- a filter fn interface that further breaks params into groups in a weight_decay compatible fashion
- expose the parameters interface and leave it up to caller
Optimizer Classes
class timm.optim.AdaBelief
< source >( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-16 weight_decay = 0 amsgrad = False decoupled_decay = True fixed_decay = False rectify = True degenerated_to_sgd = True )
Parameters
- params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
- lr (float, optional) — learning rate (default: 1e-3)
- betas (Tuple[float, float], optional) — coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
- eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-16)
- weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
- amsgrad (boolean, optional) — whether to use the AMSGrad variant of this
algorithm from the paper
On the Convergence of Adam and Beyond
_ (default: False) - decoupled_decay (boolean, optional) — (default: True) If set as True, then the optimizer uses decoupled weight decay as in AdamW
- fixed_decay (boolean, optional) — (default: False) This is used when weightdecouple is set as True. When fixed_decay == True, the weight decay is performed as $W{new} = W{old} - W{old} \times decay$. When fixeddecay == False, the weight decay is performed as $W{new} = W{old} - W{old} \times decay \times lr$. Note that in this case, the weight decay ratio decreases with learning rate (lr).
- rectify (boolean, optional) — (default: True) If set as True, then perform the rectified update similar to RAdam
- degenerated_to_sgd (boolean, optional) (default —True) If set as True, then perform SGD update when variance of gradient is high
Implements AdaBelief algorithm. Modified from Adam in PyTorch
reference: AdaBelief Optimizer, adapting stepsizes by the belief in observed gradients, NeurIPS 2020
For a complete table of recommended hyperparameters, see https://github.com/juntang-zhuang/Adabelief-Optimizer’ For example train/args for EfficientNet see these gists
- link to train_scipt: https://gist.github.com/juntang-zhuang/0a501dd51c02278d952cf159bc233037
- link to args.yaml: https://gist.github.com/juntang-zhuang/517ce3c27022b908bb93f78e4f786dc3
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.Adafactor
< source >( params lr = None eps = 1e-30 eps_scale = 0.001 clip_threshold = 1.0 decay_rate = -0.8 betas = None weight_decay = 0.0 scale_parameter = True warmup_init = False )
Parameters
- params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
- lr (float, optional) — external learning rate (default: None)
- eps (tuple[float, float]) — regularization constants for square gradient and parameter scale respectively (default: (1e-30, 1e-3))
- clip_threshold (float) — threshold of root mean square of final gradient update (default: 1.0)
- decay_rate (float) — coefficient used to compute running averages of square gradient (default: -0.8)
- beta1 (float) — coefficient used for computing running averages of gradient (default: None)
- weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
- scale_parameter (bool) — if True, learning rate is scaled by root mean square of parameter (default: True)
- warmup_init (bool) — time-dependent learning rate computation depends on whether warm-up initialization is being used (default: False)
Implements Adafactor algorithm.
This implementation is based on: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
(see https://arxiv.org/abs/1804.04235)
Note that this optimizer internally adjusts the learning rate depending on the scale_parameter, relative_step and warmup_init options.
To use a manual (external) learning rate schedule you should set scale_parameter=False
and
relative_step=False
.
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.Adahessian
< source >( params lr = 0.1 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.0 hessian_power = 1.0 update_each = 1 n_samples = 1 avg_conv_kernel = False )
Parameters
- params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
- lr (float, optional) — learning rate (default: 0.1)
- betas ((float, float), optional) — coefficients used for computing running averages of gradient and the squared hessian trace (default: (0.9, 0.999))
- eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-8)
- weight_decay (float, optional) — weight decay (L2 penalty) (default: 0.0)
- hessian_power (float, optional) — exponent of the hessian trace (default: 1.0)
- update_each (int, optional) — compute the hessian trace approximation only after this number of steps (to save time) (default: 1)
- n_samples (int, optional) — how many times to sample
z
for the approximation of the hessian trace (default: 1)
Implements the AdaHessian algorithm from “ADAHESSIAN: An Adaptive Second OrderOptimizer for Machine Learning”
Gets all parameters in all param_groups with gradients
Computes the Hutchinson approximation of the hessian trace and accumulates it for each trainable parameter.
step
< source >( closure = None )
Performs a single optimization step.
Zeros out the accumalated hessian traces.
class timm.optim.AdamP
< source >( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 delta = 0.1 wd_ratio = 0.1 nesterov = False )
class timm.optim.AdamW
< source >( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.01 amsgrad = False )
Parameters
- params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
- lr (float, optional) — learning rate (default: 1e-3)
- betas (Tuple[float, float], optional) — coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
- eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-8)
- weight_decay (float, optional) — weight decay coefficient (default: 1e-2)
- amsgrad (boolean, optional) — whether to use the AMSGrad variant of this
algorithm from the paper
On the Convergence of Adam and Beyond
_ (default: False)
Implements AdamW algorithm.
The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization
.
The AdamW variant was proposed in Decoupled Weight Decay Regularization
.
.. _Adam\: A Method for Stochastic Optimization: https://arxiv.org/abs/1412.6980 .. _Decoupled Weight Decay Regularization: https://arxiv.org/abs/1711.05101 .. _On the Convergence of Adam and Beyond: https://openreview.net/forum?id=ryQu7f-RZ
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.Lamb
< source >( params lr = 0.001 bias_correction = True betas = (0.9, 0.999) eps = 1e-06 weight_decay = 0.01 grad_averaging = True max_grad_norm = 1.0 trust_clip = False always_adapt = False )
Parameters
- params (iterable) — iterable of parameters to optimize or dicts defining parameter groups.
- lr (float, optional) — learning rate. (default: 1e-3)
- betas (Tuple[float, float], optional) — coefficients used for computing running averages of gradient and its norm. (default: (0.9, 0.999))
- eps (float, optional) — term added to the denominator to improve numerical stability. (default: 1e-8)
- weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
- grad_averaging (bool, optional) — whether apply (1-beta2) to grad when calculating running averages of gradient. (default: True)
- max_grad_norm (float, optional) — value used to clip global grad norm (default: 1.0)
- trust_clip (bool) — enable LAMBC trust ratio clipping (default: False)
- always_adapt (boolean, optional) — Apply adaptive learning rate to 0.0 weight decay parameter (default: False)
Implements a pure pytorch variant of FuseLAMB (NvLamb variant) optimizer from apex.optimizers.FusedLAMB reference: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/Transformer-XL/pytorch/lamb.py
LAMB was proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
_.
.. _Large Batch Optimization for Deep Learning - Training BERT in 76 minutes: https://arxiv.org/abs/1904.00962 .. _On the Convergence of Adam and Beyond: https://openreview.net/forum?id=ryQu7f-RZ
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.Lars
< source >( params lr = 1.0 momentum = 0 dampening = 0 weight_decay = 0 nesterov = False trust_coeff = 0.001 eps = 1e-08 trust_clip = False always_adapt = False )
Parameters
- params (iterable) — iterable of parameters to optimize or dicts defining parameter groups.
- lr (float, optional) — learning rate (default: 1.0).
- momentum (float, optional) — momentum factor (default: 0)
- weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
- dampening (float, optional) — dampening for momentum (default: 0)
- nesterov (bool, optional) — enables Nesterov momentum (default: False)
- trust_coeff (float) — trust coefficient for computing adaptive lr / trust_ratio (default: 0.001)
- eps (float) — eps for division denominator (default: 1e-8)
- trust_clip (bool) — enable LARC trust ratio clipping (default: False)
- always_adapt (bool) — always apply LARS LR adapt, otherwise only when group weight_decay != 0 (default: False)
LARS for PyTorch
Paper: Large batch training of Convolutional Networks
- https://arxiv.org/pdf/1708.03888.pdf
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.MADGRAD
< source >( params: Any lr: float = 0.01 momentum: float = 0.9 weight_decay: float = 0 eps: float = 1e-06 decoupled_decay: bool = False )
Parameters
- params (iterable) — Iterable of parameters to optimize or dicts defining parameter groups.
- lr (float) — Learning rate (default: 1e-2).
- momentum (float) — Momentum value in the range [0,1) (default: 0.9).
- weight_decay (float) — Weight decay, i.e. a L2 penalty (default: 0).
- eps (float) — Term added to the denominator outside of the root operation to improve numerical stability. (default: 1e-6).
MADGRAD_: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization.
.. _MADGRAD: https://arxiv.org/abs/2101.11075
MADGRAD is a general purpose optimizer that can be used in place of SGD or Adam may converge faster and generalize better. Currently GPU-only. Typically, the same learning rate schedule that is used for SGD or Adam may be used. The overall learning rate is not comparable to either method and should be determined by a hyper-parameter sweep.
MADGRAD requires less weight decay than other methods, often as little as zero. Momentum values used for SGD or Adam’s beta1 should work here also.
On sparse problems both weight_decay and momentum should be set to 0.
step
< source >( closure: Optional = None )
Performs a single optimization step.
class timm.optim.Nadam
< source >( params lr = 0.002 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 schedule_decay = 0.004 )
Parameters
- params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
- lr (float, optional) — learning rate (default: 2e-3)
- betas (Tuple[float, float], optional) — coefficients used for computing running averages of gradient and its square
- eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-8)
- weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
- schedule_decay (float, optional) — momentum schedule decay (default: 4e-3)
Implements Nadam algorithm (a variant of Adam based on Nesterov momentum).
It has been proposed in Incorporating Nesterov Momentum into Adam
__.
http://cs229.stanford.edu/proj2015/054_report.pdf http://www.cs.toronto.edu/~fritz/absps/momentum.pdf
Originally taken from: https://github.com/pytorch/pytorch/pull/1408 NOTE: Has potential issues but does work well on some problems.
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.NvNovoGrad
< source >( params lr = 0.001 betas = (0.95, 0.98) eps = 1e-08 weight_decay = 0 grad_averaging = False amsgrad = False )
Parameters
- params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
- lr (float, optional) — learning rate (default: 1e-3)
- betas (Tuple[float, float], optional) — coefficients used for computing running averages of gradient and its square (default: (0.95, 0.98))
- eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-8)
- weight_decay (float, optional) — weight decay (L2 penalty) (default: 0) grad_averaging — gradient averaging
- amsgrad (boolean, optional) — whether to use the AMSGrad variant of this
algorithm from the paper
On the Convergence of Adam and Beyond
_ (default: False)
Implements Novograd algorithm.
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.RAdam
< source >( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 )
class timm.optim.RMSpropTF
< source >( params lr = 0.01 alpha = 0.9 eps = 1e-10 weight_decay = 0 momentum = 0.0 centered = False decoupled_decay = False lr_in_momentum = True )
Parameters
- params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
- lr (float, optional) — learning rate (default: 1e-2)
- momentum (float, optional) — momentum factor (default: 0)
- alpha (float, optional) — smoothing (decay) constant (default: 0.9)
- eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-10)
- centered (bool, optional) — if
True
, compute the centered RMSProp, the gradient is normalized by an estimation of its variance - weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
- decoupled_decay (bool, optional) — decoupled weight decay as per https://arxiv.org/abs/1711.05101
- lr_in_momentum (bool, optional) — learning rate scaling is included in the momentum buffer update as per defaults in Tensorflow
Implements RMSprop algorithm (TensorFlow style epsilon)
NOTE: This is a direct cut-and-paste of PyTorch RMSprop with eps applied before sqrt and a few other modifications to closer match Tensorflow for matching hyper-params.
Noteworthy changes include:
- Epsilon applied inside square-root
- square_avg initialized to ones
- LR scaling of update accumulated in momentum buffer
Proposed by G. Hinton in his course.
The centered version first appears in Generating Sequences With Recurrent Neural Networks.
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.SGDP
< source >( params lr = <required parameter> momentum = 0 dampening = 0 weight_decay = 0 nesterov = False eps = 1e-08 delta = 0.1 wd_ratio = 0.1 )