Skip to content

Optimizers

Optimizers update model parameters based on accumulated gradients.

Typically, the training flow looks like:

  1. Forward pass
  2. Compute loss
  3. Backward pass (loss.backward())
  4. Optimizer step (opt.step())
  5. Clear gradients (opt.zero_grad())

Exact method names and behavior are documented in each optimizer’s docstring.


keydnn.Adam dataclass

Bases: _Optimizer

Adam optimizer.

Adam maintains exponentially decaying averages of past gradients (first moment) and past squared gradients (second moment), and applies bias correction to both estimates.

Update rule

Let g_t be the gradient at step t:

m_t = beta1 * m_{t-1} + (1 - beta1) * g_t
v_t = beta2 * v_{t-1} + (1 - beta2) * (g_t ** 2)

m_hat = m_t / (1 - beta1^t)
v_hat = v_t / (1 - beta2^t)

p <- p - lr * m_hat / (sqrt(v_hat) + eps)

If weight_decay > 0 (classical L2 regularization):

g_t <- g_t + weight_decay * p

Parameters:

Name Type Description Default
params Sequence[Parameter]

Parameters to be optimized.

required
lr float

Learning rate. Must be positive. Defaults to 1e-3.

0.001
betas tuple[float, float]

Exponential decay rates for the first and second moments. Each must be in (0, 1). Defaults to (0.9, 0.999).

(0.9, 0.999)
eps float

Numerical stability epsilon added to the denominator. Must be positive. Defaults to 1e-8.

1e-08
weight_decay float

Classical L2 regularization coefficient (coupled). Must be non-negative. Defaults to 0.0.

0.0
Notes
  • Parameters with grad is None are skipped.
  • Weight decay here is classical L2 (not decoupled AdamW).
  • Optimizer state (m, v, t) is stored per-parameter and persists across steps.
  • This optimizer is CPU-only in the current KeyDNN implementation.

zero_grad

zero_grad() -> None

Clear gradients for all managed parameters.

Notes

This calls zero_grad() on each parameter, which clears the stored gradient tensor (if any). Training loops typically call zero_grad() before computing a new backward pass to avoid gradient accumulation.

step

step() -> None

Apply one Adam update step to all managed parameters.

Notes
  • Parameters with grad is None are skipped.
  • This method is CPU-only and raises a device-not-supported error if either a parameter or its gradient resides on a non-CPU device.
  • Weight decay is implemented as classical L2 regularization (coupled with the gradient), not as decoupled AdamW.
  • Optimizer state is created lazily on the first update for each parameter.

keydnn.SGD dataclass

Bases: _Optimizer

Stochastic Gradient Descent (SGD) optimizer.

This optimizer updates parameters in-place using their accumulated gradients and a fixed learning rate.

Update rule

For each parameter p with gradient g:

  • If weight_decay > 0 (classical L2 regularization): g <- g + weight_decay * p
  • Parameter update: p <- p - lr * g

Parameters:

Name Type Description Default
params Sequence[Parameter]

Parameters to be optimized.

required
lr float

Learning rate. Must be positive. Defaults to 1e-3.

0.001
weight_decay float

Classical L2 weight decay coefficient (coupled). Must be non-negative. Defaults to 0.0.

0.0
Notes
  • Parameters with grad is None are skipped.
  • Momentum, Nesterov, and other SGD variants are intentionally omitted in this minimal implementation.
  • This optimizer is CPU-only in the current KeyDNN implementation.

zero_grad

zero_grad() -> None

Clear gradients for all managed parameters.

Notes

This calls zero_grad() on each parameter, which clears the stored gradient tensor (if any). Training loops typically call zero_grad() before computing a new backward pass to avoid gradient accumulation.

step

step() -> None

Apply one SGD update step to all managed parameters.

Notes
  • Parameters with grad is None are skipped.
  • Weight decay is implemented as classical L2 regularization (coupled with the gradient), not as decoupled weight decay.
  • The update is performed in-place on the parameter.

Notes

  • Optimizers usually take an iterable of parameters (often from a model).
  • If KeyDNN supports weight decay, momentum, or learning-rate schedules, document those options in the optimizer docstrings.
  • If gradients can be accumulated across steps, clarify whether zero_grad() is required.