Optimizers
Optimizers update model parameters based on accumulated gradients.
Typically, the training flow looks like:
- Forward pass
- Compute loss
- Backward pass (
loss.backward()) - Optimizer step (
opt.step()) - Clear gradients (
opt.zero_grad())
Exact method names and behavior are documented in each optimizer’s docstring.
keydnn.Adam
dataclass
Bases: _Optimizer
Adam optimizer.
Adam maintains exponentially decaying averages of past gradients (first moment) and past squared gradients (second moment), and applies bias correction to both estimates.
Update rule
Let g_t be the gradient at step t:
m_t = beta1 * m_{t-1} + (1 - beta1) * g_t
v_t = beta2 * v_{t-1} + (1 - beta2) * (g_t ** 2)
m_hat = m_t / (1 - beta1^t)
v_hat = v_t / (1 - beta2^t)
p <- p - lr * m_hat / (sqrt(v_hat) + eps)
If weight_decay > 0 (classical L2 regularization):
g_t <- g_t + weight_decay * p
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
params
|
Sequence[Parameter]
|
Parameters to be optimized. |
required |
lr
|
float
|
Learning rate. Must be positive. Defaults to 1e-3. |
0.001
|
betas
|
tuple[float, float]
|
Exponential decay rates for the first and second moments. Each must be in (0, 1). Defaults to (0.9, 0.999). |
(0.9, 0.999)
|
eps
|
float
|
Numerical stability epsilon added to the denominator. Must be positive. Defaults to 1e-8. |
1e-08
|
weight_decay
|
float
|
Classical L2 regularization coefficient (coupled). Must be non-negative. Defaults to 0.0. |
0.0
|
Notes
- Parameters with
grad is Noneare skipped. - Weight decay here is classical L2 (not decoupled AdamW).
- Optimizer state (m, v, t) is stored per-parameter and persists across steps.
- This optimizer is CPU-only in the current KeyDNN implementation.
zero_grad
zero_grad() -> None
Clear gradients for all managed parameters.
Notes
This calls zero_grad() on each parameter, which clears the stored
gradient tensor (if any). Training loops typically call zero_grad()
before computing a new backward pass to avoid gradient accumulation.
step
step() -> None
Apply one Adam update step to all managed parameters.
Notes
- Parameters with
grad is Noneare skipped. - This method is CPU-only and raises a device-not-supported error if either a parameter or its gradient resides on a non-CPU device.
- Weight decay is implemented as classical L2 regularization (coupled with the gradient), not as decoupled AdamW.
- Optimizer state is created lazily on the first update for each parameter.
keydnn.SGD
dataclass
Bases: _Optimizer
Stochastic Gradient Descent (SGD) optimizer.
This optimizer updates parameters in-place using their accumulated gradients and a fixed learning rate.
Update rule
For each parameter p with gradient g:
- If
weight_decay > 0(classical L2 regularization):g <- g + weight_decay * p - Parameter update:
p <- p - lr * g
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
params
|
Sequence[Parameter]
|
Parameters to be optimized. |
required |
lr
|
float
|
Learning rate. Must be positive. Defaults to 1e-3. |
0.001
|
weight_decay
|
float
|
Classical L2 weight decay coefficient (coupled). Must be non-negative. Defaults to 0.0. |
0.0
|
Notes
- Parameters with
grad is Noneare skipped. - Momentum, Nesterov, and other SGD variants are intentionally omitted in this minimal implementation.
- This optimizer is CPU-only in the current KeyDNN implementation.
zero_grad
zero_grad() -> None
Clear gradients for all managed parameters.
Notes
This calls zero_grad() on each parameter, which clears the stored
gradient tensor (if any). Training loops typically call zero_grad()
before computing a new backward pass to avoid gradient accumulation.
step
step() -> None
Apply one SGD update step to all managed parameters.
Notes
- Parameters with
grad is Noneare skipped. - Weight decay is implemented as classical L2 regularization (coupled with the gradient), not as decoupled weight decay.
- The update is performed in-place on the parameter.
Notes
- Optimizers usually take an iterable of parameters (often from a model).
- If KeyDNN supports weight decay, momentum, or learning-rate schedules, document those options in the optimizer docstrings.
- If gradients can be accumulated across steps, clarify whether
zero_grad()is required.