Optimizers

Optimizers update model parameters based on accumulated gradients.

Typically, the training flow looks like:

Forward pass
Compute loss
Backward pass (loss.backward())
Optimizer step (opt.step())
Clear gradients (opt.zero_grad())

Exact method names and behavior are documented in each optimizer’s docstring.

keydnn.Adam `dataclass`

Bases: _Optimizer

Adam optimizer.

Adam maintains exponentially decaying averages of past gradients (first moment) and past squared gradients (second moment), and applies bias correction to both estimates.

Update rule

Let g_t be the gradient at step t:

m_t = beta1 * m_{t-1} + (1 - beta1) * g_t
v_t = beta2 * v_{t-1} + (1 - beta2) * (g_t ** 2)

m_hat = m_t / (1 - beta1^t)
v_hat = v_t / (1 - beta2^t)

p <- p - lr * m_hat / (sqrt(v_hat) + eps)

If weight_decay > 0 (classical L2 regularization):

g_t <- g_t + weight_decay * p

Parameters:

Name	Type	Description	Default
`params`	`Sequence[Parameter]`	Parameters to be optimized.	required
`lr`	`float`	Learning rate. Must be positive. Defaults to 1e-3.	`0.001`
`betas`	`tuple[float, float]`	Exponential decay rates for the first and second moments. Each must be in (0, 1). Defaults to (0.9, 0.999).	`(0.9, 0.999)`
`eps`	`float`	Numerical stability epsilon added to the denominator. Must be positive. Defaults to 1e-8.	`1e-08`
`weight_decay`	`float`	Classical L2 regularization coefficient (coupled). Must be non-negative. Defaults to 0.0.	`0.0`

Notes

Parameters with grad is None are skipped.
Weight decay here is classical L2 (not decoupled AdamW).
Optimizer state (m, v, t) is stored per-parameter and persists across steps.
This optimizer is CPU-only in the current KeyDNN implementation.

zero_grad

zero_grad() -> None

Clear gradients for all managed parameters.

Notes

This calls zero_grad() on each parameter, which clears the stored gradient tensor (if any). Training loops typically call zero_grad() before computing a new backward pass to avoid gradient accumulation.

step

step() -> None

Apply one Adam update step to all managed parameters.

Notes

Parameters with grad is None are skipped.
This method is CPU-only and raises a device-not-supported error if either a parameter or its gradient resides on a non-CPU device.
Weight decay is implemented as classical L2 regularization (coupled with the gradient), not as decoupled AdamW.
Optimizer state is created lazily on the first update for each parameter.

keydnn.SGD `dataclass`

Bases: _Optimizer

Stochastic Gradient Descent (SGD) optimizer.

This optimizer updates parameters in-place using their accumulated gradients and a fixed learning rate.

Update rule

For each parameter p with gradient g:

If weight_decay > 0 (classical L2 regularization): g <- g + weight_decay * p
Parameter update: p <- p - lr * g

Parameters:

Name	Type	Description	Default
`params`	`Sequence[Parameter]`	Parameters to be optimized.	required
`lr`	`float`	Learning rate. Must be positive. Defaults to 1e-3.	`0.001`
`weight_decay`	`float`	Classical L2 weight decay coefficient (coupled). Must be non-negative. Defaults to 0.0.	`0.0`

Notes

Parameters with grad is None are skipped.
Momentum, Nesterov, and other SGD variants are intentionally omitted in this minimal implementation.
This optimizer is CPU-only in the current KeyDNN implementation.

zero_grad

zero_grad() -> None

Clear gradients for all managed parameters.

Notes

This calls zero_grad() on each parameter, which clears the stored gradient tensor (if any). Training loops typically call zero_grad() before computing a new backward pass to avoid gradient accumulation.

step

step() -> None

Apply one SGD update step to all managed parameters.

Notes

Parameters with grad is None are skipped.
Weight decay is implemented as classical L2 regularization (coupled with the gradient), not as decoupled weight decay.
The update is performed in-place on the parameter.

Notes

Optimizers usually take an iterable of parameters (often from a model).
If KeyDNN supports weight decay, momentum, or learning-rate schedules, document those options in the optimizer docstrings.
If gradients can be accumulated across steps, clarify whether zero_grad() is required.

Optimizers

keydnn.Adam dataclass

zero_grad

step

keydnn.SGD dataclass

zero_grad

step

Notes

keydnn.Adam `dataclass`

keydnn.SGD `dataclass`