Skip to content

Tensors

keydnn.Tensor

Bases: Tensor

Presentation-layer Tensor.

This subclass preserves the full infrastructure Tensor API (including classmethods/staticmethods like zeros, ones, rand, etc.) while applying presentation-friendly defaults for direct construction.

Notes
  • Direct construction defaults to CPU if device is omitted.
  • Direct construction defaults to zero-initialized storage because that is what most users expect from a high-level DL API.

dtype property

dtype: dtype

Return the element dtype of this tensor.

Returns:

Type Description
dtype

NumPy dtype representing the tensor element type.

Notes
  • For CPU tensors, this matches the underlying ndarray dtype.
  • For CUDA tensors, this is metadata used for kernel dispatch and tests.

shape property

shape: tuple[int, ...]

Return the tensor shape.

Returns:

Type Description
tuple[int, ...]

The tensor's shape.

device property

device: Device

Return the device on which this tensor resides.

Returns:

Type Description
Device

The tensor's device placement descriptor.

requires_grad property writable

requires_grad: bool

Indicate whether this tensor should accumulate gradients.

Returns:

Type Description
bool

True if gradients should be tracked/accumulated, False otherwise.

grad property

grad: Optional['Tensor']

Return the gradient tensor associated with this tensor (if any).

Returns:

Type Description
Optional[Tensor]

The stored gradient tensor, or None if not computed or cleared.

T property

T: ITensor

Convenience property for 2D transpose.

Equivalent to calling self.transpose().

data property

data: int | ndarray

Return the underlying data handle for this tensor.

Semantics
  • CPU tensors: Returns the underlying NumPy ndarray storing the tensor data.
  • CUDA tensors: Returns the raw device pointer (dev_ptr) as an integer.

    The pointer is resolved as follows: 1) If the tensor is backed by a _CudaStorage object, return storage.dev_ptr. 2) Otherwise, fall back to the legacy _data field, which may contain a raw device pointer set by older code paths or tests.

Notes
  • For CUDA tensors, the returned value is not a NumPy array and should be treated as an opaque device pointer handle.
  • A return value of 0 indicates that no device memory is currently allocated (e.g., uninitialized tensor, freed tensor, or zero-sized tensor).
  • New code should prefer storage-backed tensors; the _data fallback exists only for backward compatibility during migration.

Returns:

Type Description
int | ndarray
  • NumPy ndarray for CPU tensors.
  • Integer device pointer (uintptr_t) for CUDA tensors.

Raises:

Type Description
ValueError

If the tensor is on an unsupported or unknown device.

nbytes property

nbytes: int

Total bytes required to store this tensor's elements.

Computed as: numel() * itemsize(dtype)

Notes
  • This is metadata-only and does not require the underlying storage to be allocated.
  • For empty tensors (numel == 0), this returns 0.

device_type property

device_type: str

Returns a string of the current device type. It may be either cpu or cuda.

Returns:

Type Description
str

The device type. Either cpu or cuda.

zero_grad

zero_grad() -> None

Clear the stored gradient.

Notes

Training loops typically call zero_grad() before backprop to avoid unintentional accumulation across iterations.

to_numpy

to_numpy() -> np.ndarray

Convert the tensor to a NumPy ndarray on the host.

Returns:

Type Description
ndarray

A NumPy array containing the tensor data.

Raises:

Type Description
RuntimeError

If device-to-host transfer is unavailable or the tensor's dtype is unknown.

Notes
  • CPU tensors return a view/copy of the underlying CPU storage (preserving existing behavior of the concrete tensor type).
  • CUDA tensors are copied from device to host via a device-to-host memcpy.

copy_from_numpy

copy_from_numpy(arr: ndarray) -> None

Copy data from a NumPy array (or array-like / scalar) into this tensor.

Backward compatibility (CPU)

The original implementation accepted any array-like input (including NumPy scalars like np.float32) by calling np.asarray(arr, dtype=np.float32). This method preserves that behavior for the CPU path.

CUDA
  • Accepts any array-like / scalar.
  • Casts to self.dtype, makes it C-contiguous, then performs a host-to-device memcpy into the tensor's existing (or newly allocated) device buffer.

Parameters:

Name Type Description Default
arr Any

Array-like object accepted by np.asarray, including NumPy scalars.

required

Raises:

Type Description
ValueError

If the input shape does not match this tensor's shape.

RuntimeError

If the tensor's device is unsupported or a CUDA copy fails.

copy_from

copy_from(
    other: ITensor, *, allow_cross_device: bool = False
) -> None

Copy data from another tensor into this tensor (in-place).

Parameters:

Name Type Description Default
other Tensor

Source tensor.

required
allow_cross_device bool

If False (backward-compatible), require self.device and other.device to match exactly (string compare) and perform same-device copies only.

If True, allow: - CPU -> CPU (same as before) - CUDA -> CUDA (D2D memcpy; same device index is recommended) - CPU -> CUDA (HtoD memcpy) - CUDA -> CPU (DtoH memcpy)

Notes
  • Shape must match.
  • dtype must match (no implicit casting).
  • Cross-GPU copies (cuda:0 -> cuda:1) are not handled here; they may work only if memcpy wrapper supports peer copies. By default this method raises for different CUDA device indices.
False

Raises:

Type Description
(TypeError, ValueError, RuntimeError)

fill

fill(value: float) -> None

Fill the tensor with a scalar value.

This method overwrites every element of the tensor with value. The concrete implementation may dispatch differently depending on device placement (e.g., NumPy fill on CPU vs. a CUDA fill kernel).

Parameters:

Name Type Description Default
value float

Scalar value to write into the tensor.

required

Raises:

Type Description
RuntimeError

If the tensor's device is unsupported by this implementation.

debug_storage_repr

debug_storage_repr() -> str

Return a stable, human-readable description of underlying storage.

Contract
  • CPU tensors: describe the NumPy ndarray storage.
  • CUDA tensors: return a stable placeholder string that includes:
    • device index
    • tensor shape
    • (if available) the device pointer value
Notes

This is a debugging aid and intentionally does not expose full contents.

numel

numel() -> int

Return the total number of elements in the tensor.

Returns:

Type Description
int

Product of all dimensions in the tensor shape.

reshape

reshape(new_shape: tuple[int, ...]) -> ITensor

Return a reshaped view of this tensor.

This operation changes the logical shape while preserving the total number of elements.

Behavior
  • CPU: reshapes via NumPy and then materializes via copy_from_numpy (preserving current copy-based semantics).
  • CUDA: returns a metadata-only alias that shares the same device pointer (no kernel launch, no device-to-device copy).
Backward

If autograd is enabled, backward reshapes grad_out back to the original shape of the parent tensor.

Notes
  • Reshape validity is checked using NumPy semantics, including support for -1 inference.
  • For CUDA tensors, an allocation must exist if numel != 0 (i.e., data cannot be 0 for non-empty tensors).

stack staticmethod

stack(
    tensors: Sequence["Tensor"], axis: int = 0
) -> "Tensor"

Stack a sequence of tensors along a new axis.

This is the differentiable counterpart of np.stack.

Requirements
  • tensors must be non-empty
  • all tensors must share the same shape
  • all tensors must share the same device
  • CPU: forward/backward uses NumPy (backward compatible)
  • CUDA: forward/backward uses stack_cuda_ext kernels (device-pointer based) and does NOT call to_numpy() on CUDA tensors.

Parameters:

Name Type Description Default
tensors Sequence[Tensor]

Input tensors to stack. Must be non-empty and same-shape.

required
axis int

Axis at which the new dimension is inserted. Supports negative axes. Defaults to 0.

0

Returns:

Type Description
Tensor

A new tensor whose shape is: out.shape = in.shape[:axis] + (len(tensors),) + in.shape[axis:]

Notes
  • Forward returns a copy (not a view).
  • Backward splits grad_out along axis and routes each slice back to the corresponding parent tensor.
  • CUDA backward overwrites dx buffers (no accumulation inside kernel).

concat staticmethod

concat(
    tensors: Sequence["Tensor"], axis: int = 0
) -> "Tensor"

Concatenate a sequence of tensors along an existing axis.

CPU behavior
  • Fully supports all axes via NumPy.
CUDA behavior (current)
  • Supports only axis == 0 for CUDA tensors using device-to-device memcpy.
  • General axis concatenation on CUDA requires a kernel (pending).
Requirements
  • tensors must be non-empty
  • all tensors must share the same device
  • shapes must match on all dimensions except axis
  • dtypes must match (CUDA path enforces this; CPU path preserves current float32-cast behavior)
Backward rule
  • Split grad_out along axis into slices matching each input's size along that axis, and route each slice back to the corresponding parent.

broadcast_to

broadcast_to(shape: Tuple[int, ...]) -> ITensor

Broadcast this tensor to a target shape by explicit expansion.

Parameters:

Name Type Description Default
shape tuple[int, ...]

Target shape to broadcast to.

required

Returns:

Type Description
ITensor

Broadcasted tensor (materialized copy).

Notes
  • The operation is conceptually a "repeat/expand" that materializes a new tensor of the requested shape.
  • Backward typically reduces gradients by summing over the broadcasted dimensions (i.e., the inverse of expansion).

sum

sum(
    axis: Optional[int] = None, keepdims: bool = False
) -> ITensor

Compute the sum of tensor elements.

Parameters:

Name Type Description Default
axis Optional[int]

Axis along which to compute the sum. If None, all elements are reduced into a scalar.

None
keepdims bool

If True, retains reduced dimensions with size 1. Defaults to False.

False

Returns:

Type Description
ITensor

Tensor containing the summed values. The shape depends on the axis and keepdims arguments.

Notes

Backward rule: The upstream gradient is broadcast back to the input tensor's shape, i.e., each input element receives the gradient of the corresponding reduced output.

Backend-specific implementations may impose additional constraints (e.g., limited axis support on CUDA).

mean

mean() -> ITensor

Compute the arithmetic mean of all elements in the tensor.

This operation always reduces the tensor to a scalar value.

Returns:

Type Description
ITensor

A scalar tensor (shape=()) containing the mean value.

Notes

Backward rule: The gradient is distributed uniformly to all input elements:

    ``d(mean(x)) / dx = 1 / numel(x)``

No axis argument is currently supported; the reduction is always performed over all elements.

log

log() -> ITensor

Compute the elementwise natural logarithm of the tensor.

Returns:

Type Description
ITensor

A tensor of the same shape as self, where each element is replaced by its natural logarithm.

Notes
CPU behavior
  • Uses NumPy to compute the forward pass for CPU tensors.
CUDA behavior (workaround)
  • For CUDA tensors, this method currently performs a CPU round-trip: device → host (to_numpy) → NumPy log → device (copy_from_numpy).
  • This preserves correctness and autograd semantics but is not performance-optimal.
Autograd

If self.requires_grad is True, the backward rule is:

``d(log(x)) / dx = 1 / x``

The behavior for non-positive input values follows NumPy semantics (e.g., -inf or nan).

TODO

Implement a native CUDA kernel for log (and a fused backward) to avoid device↔host transfers.

exp

exp() -> ITensor

Compute the elementwise exponential of the tensor.

Returns:

Type Description
ITensor

A tensor of the same shape as self with exp applied elementwise.

Notes

Backward rule: d(exp(x)) / dx = exp(x)

CUDA behavior
  • Uses the native CUDA unary exponential kernel via unary_cuda_ext.exp_forward.
  • Operates directly on device pointers without a NumPy round-trip.

max

max(axis: int = -1, keepdims: bool = False) -> ITensor

Compute the maximum values along a given axis.

Parameters:

Name Type Description Default
axis int

Axis along which to compute the maximum. Defaults to -1.

-1
keepdims bool

Whether to retain reduced dimensions with size 1. Defaults to False.

False

Returns:

Type Description
ITensor

Tensor containing the maximum values along the specified axis.

CUDA support
  • Only supports 2D input tensors.
  • axis must reduce exactly one dimension: axis in {0, 1, -1, -2}.
  • Backward propagation routes the gradient to a single argmax index per slice (ties are not split).
CPU notes

Backward rule: Gradients are routed to all positions equal to the maximum value using a mask, i.e.:

    ``dx = grad_out * 1[x == max(x)]``
Notes

The exact behavior (including tie handling and shape semantics) is backend-dependent but must conform to the rules documented here.

matmul

matmul(other: 'Tensor') -> 'Tensor'

Matrix multiplication (2D): out = self @ other.

Requirements
  • CPU: unchanged behavior (backward compatible)
  • CUDA: both operands must be CUDA, 2D, inner dims match
  • CUDA inputs MUST already have allocated device buffers (data != 0). (No implicit allocation for inputs.)
Backward

If out = A @ B, then: - dL/dA = dL/dout @ B^T - dL/dB = A^T @ dL/dout

transpose

transpose() -> ITensor

Return the 2D transpose of this tensor.

For a 2D tensor A with shape (M, N), transpose returns Aᵀ with shape (N, M) such that:

out[i, j] = self[j, i]
Requirements
  • Input must be 2D.
  • CPU and CUDA are supported by the concrete implementation.
Backward

If out = Aᵀ, then dL/dA = (dL/dout)ᵀ.

backward

backward(
    grad_out: Optional["Tensor"] = None,
    *,
    profile: bool = False,
    profile_topk: int = 20,
) -> None

Backpropagate gradients from this tensor through the autograd graph.

Parameters:

Name Type Description Default
grad_out Optional[Tensor]

Gradient w.r.t. this tensor. If omitted, this tensor must be a scalar (shape == ()) and the gradient is assumed to be 1.0.

None
Notes
  • Gradients are accumulated into .grad of leaf tensors that have requires_grad=True.
  • This implementation performs a reverse topological traversal.
  • CPU behavior is unchanged (same logic; only extra timing when profile=True).

item

item() -> float

Return the value of a scalar (or single-element) tensor as a Python float.

Notes
  • CPU: reads from _data.
  • CUDA: uses to_numpy() (D2H) to fetch the scalar.

sqrt

sqrt() -> ITensor

Compute the elementwise square root of the tensor.

Returns:

Type Description
ITensor

A tensor with the same shape as self, containing sqrt(self) applied elementwise.

Notes
CPU behavior
  • Uses NumPy to compute the forward pass for CPU tensors.
CUDA behavior (workaround)
  • For CUDA tensors, this method currently performs a CPU round-trip: device → host (to_numpy) → NumPy sqrt → device (copy_from_numpy).
  • This preserves correctness and autograd semantics but is not performance-optimal.
Autograd

If self.requires_grad is True, the returned tensor participates in autograd with parent self.

TODO

Implement a native CUDA kernel for sqrt (and optionally a fused backward) to avoid device↔host transfers.

rand staticmethod

rand(
    shape, *, device, requires_grad: bool = False
) -> "Tensor"

Create a tensor filled with uniform random values in [0, 1) on the given device.

Notes
  • CPU: random values are generated using NumPy.
  • CUDA: random values are generated on CPU and transferred to device memory. No CUDA RNG kernel is used.
  • This mirrors the initialization strategy used by many frameworks and keeps behavior deterministic and easy to test.
  • The returned tensor has dtype float32 and ctx=None.

full classmethod

full(
    shape: tuple,
    fill_value: float,
    *,
    device: Device,
    requires_grad: bool = False,
) -> ITensor

Create a tensor filled with a constant value.

Backward compatibility
  • CPU path preserves the original logic exactly: uses NumPy to create a filled array and calls copy_from_numpy().
  • Default dtype remains float32, matching the original implementation.
  • The returned tensor has ctx=None.
CUDA
  • Allocates a CUDA tensor (no host staging) and fills via Tensor.fill(), which dispatches to the CUDA fill kernel after ensuring allocation.

Parameters:

Name Type Description Default
shape tuple

Desired tensor shape. May be any shape accepted by NumPy (including () for a scalar tensor).

required
fill_value float

Constant value to write into every element.

required
device Device

Target device placement (CPU or CUDA).

required
requires_grad bool

Whether the returned tensor should participate in autograd. Defaults to False.

False

Returns:

Type Description
ITensor

A newly allocated tensor with the given shape, filled with fill_value.

Raises:

Type Description
RuntimeError

If device is not a supported device type.

tanh

tanh() -> ITensor

Compute the elementwise hyperbolic tangent of the tensor.

Returns:

Type Description
ITensor

A tensor with the same shape as self, with tanh applied elementwise.

Notes
  • This method delegates to the autograd TanhFn Function.
  • NumPy is not used directly here; numerical kernels remain encapsulated inside Tensor operations or autograd Functions.

sigmoid

sigmoid() -> ITensor

Compute the elementwise logistic sigmoid of the tensor.

The sigmoid function is defined as:

``sigmoid(x) = 1 / (1 + exp(-x))``

Returns:

Type Description
ITensor

A tensor with the same shape as self, with sigmoid applied elementwise.

Notes

This method is a thin convenience wrapper around SigmoidFn defined in ._function and integrates with the autograd system.

zeros classmethod

zeros(
    *,
    shape: tuple[int, ...],
    device: Device,
    requires_grad: bool = False,
    dtype: Any = np.float32,
) -> ITensor

Create a tensor filled with zeros on the specified device.

This factory method constructs a tensor with the given shape and device. For CPU tensors, a NumPy array is allocated and zero-initialized. For CUDA tensors, device memory is allocated and zeroed via a CUDA fill routine.

Parameters:

Name Type Description Default
shape tuple[int, ...]

Shape of the output tensor.

required
device Device

Target device placement (CPU or CUDA).

required
requires_grad bool

Whether the tensor should track gradients for autograd.

False
dtype Any

The data type of the target tensor.

float32

Returns:

Type Description
ITensor

Newly created tensor filled with zeros.

Notes
  • The dtype is currently fixed to float32.
  • Zero-sized tensors are valid and return immediately without invoking CUDA kernels.

ones classmethod

ones(
    *,
    shape: tuple[int, ...],
    device: Device,
    requires_grad: bool = False,
    dtype: Any = np.float32,
) -> ITensor

Create a tensor filled with ones on the specified device.

This factory method constructs a tensor with the given shape and device. For CPU tensors, a NumPy array is allocated and initialized with ones. For CUDA tensors, device memory is allocated and filled using a native CUDA fill routine.

Parameters:

Name Type Description Default
shape tuple[int, ...]

Shape of the output tensor.

required
device Device

Target device placement (CPU or CUDA).

required
requires_grad bool

Whether the tensor should track gradients for autograd.

False
dtype Any

The data type of the target tensor.

float32

Returns:

Type Description
ITensor

Newly created tensor filled with ones.

Notes
  • The dtype is currently fixed to float32.
  • Zero-sized tensors are valid and return immediately without invoking CUDA kernels.
  • The CUDA path prioritizes correctness and may fall back to a slower initialization strategy if the native fill kernel fails.

to

to(device, *, copy: bool = False) -> ITensor

Move or copy this tensor to another device.

This method implements explicit device placement transitions and returns a tensor on device. If the target device matches the current device, it returns self by default (or a cloned copy if copy=True).

Supported transfers
  • CPU -> CUDA: allocates device memory and performs host-to-device memcpy.
  • CUDA -> CPU: allocates host buffer and performs device-to-host memcpy.
  • CUDA -> CUDA (different device indices): currently implemented via an intermediate CPU round-trip for simplicity.

Parameters:

Name Type Description Default
device Device | str

Target device. If a string is provided, it is parsed as a Device (e.g., "cpu", "cuda:0").

required
copy bool

If True, forces a copy even when the device is unchanged. Defaults to False.

False

Returns:

Type Description
ITensor

A tensor placed on the requested device.

Notes
  • For CPU -> CUDA copies, the host buffer is made C-contiguous before raw memcpy to ensure correct layout.
  • This method does not propagate autograd context; returned tensors are created with requires_grad=False and ctx=None in the transfer paths.

clone

clone() -> ITensor

Create a deep copy of the tensor's storage.

  • CPU: copies the underlying NumPy array into a new tensor.
  • CUDA: allocates a new device buffer and performs a device-to-device memcpy.

Returns:

Type Description
ITensor

A new tensor with identical contents.

Notes
  • clone() is intended to copy raw storage and typically returns a tensor with requires_grad=False and no autograd context (ctx=None).

sum_to_shape

sum_to_shape(target_shape: tuple[int, ...]) -> ITensor

Sum-reduce this tensor to target_shape (inverse of broadcasting).

This primitive is commonly used in autograd to reduce broadcasted gradients back to the original source shape.

Dispatches via tensor_control_path_manager.

free_

free_() -> None

Explicitly release CUDA backing memory (if owned).

Semantics
  • If _storage is present: decrement refcount; free happens when it hits 0.
  • If _storage is absent:
    • If this tensor is marked as borrowed, we do NOT free the devptr.
    • If marked as "owning borrowed devptr" (transitional), we cuda_free it.
  • Idempotent: safe to call multiple times.

clamp

clamp(
    *, min: float | None = None, max: float | None = None
) -> "Tensor"

Clamp tensor values elementwise between min and max.

Parameters:

Name Type Description Default
min float

Minimum value. If None, no lower bound is applied.

None
max float

Maximum value. If None, no upper bound is applied.

None

Returns:

Type Description
Tensor

A new Tensor with values clamped to the specified range.

Notes
  • Gradients pass through unchanged for values within the range.
  • Gradients are zeroed for values clipped by min or max.

to_

to_(device, *, copy: bool = False) -> ITensor

Move this tensor to another device in-place.

This method performs an in-place device placement transition. Unlike to(), which may return a newly allocated tensor on the target device, to_() preserves the identity of self (i.e., id(self) is unchanged) by migrating the underlying storage and updating device placement fields on the same object.

Semantics
  • If the target device matches the current device:
    • returns self (no-op), unless copy=True, in which case the tensor's storage is replaced with a cloned copy on the same device.
  • If the target device differs:
    • performs the transfer using to(device, copy=True) internally, then swaps this tensor's backing storage to the transferred result.
Supported transfers
  • CPU -> CUDA: allocates device memory and performs host-to-device memcpy.
  • CUDA -> CPU: allocates host buffer and performs device-to-host memcpy.
  • CUDA -> CUDA (different device indices): currently implemented via an intermediate CPU round-trip for simplicity.

Parameters:

Name Type Description Default
device Device | str

Target device. If a string is provided, it is parsed as a Device (e.g., "cpu", "cuda:0").

required
copy bool

If True, forces a copy even when the device is unchanged. This is the in-place analogue of to(..., copy=True). Defaults to False.

False

Returns:

Type Description
ITensor

This tensor (self) after in-place migration.

Notes
  • This method intentionally does not preserve autograd context across device transfers. If this tensor participates in autograd, should treat to_() as a graph break:
    • ctx is cleared.
    • requires_grad is left unchanged (you can still accumulate grads going forward), but any prior graph history is discarded.
  • If self.grad exists and is a Tensor-like object with .to(...), this method will attempt to move it to the same device as well.

keydnn.Device

Concrete computation device descriptor.

This class encapsulates a normalized representation of a computation device, including its type (CPU or CUDA) and, for CUDA devices, an optional device index (e.g., cuda:0).

The class performs strict validation of device strings to ensure a small, well-defined set of supported device identifiers.

Parameters:

Name Type Description Default
device str

Device identifier string. Must be either: - "cpu" - "cuda:", where is a non-negative integer

required

Raises:

Type Description
ValueError

If the provided device string does not match the supported formats.

Notes
  • __slots__ is used to prevent dynamic attribute creation and reduce per-instance memory overhead.
  • This class is intentionally lightweight and does not allocate or manage any backend resources.

is_cpu

is_cpu() -> bool

Check whether this device represents a CPU.

Returns:

Type Description
bool

True if the device type is CPU, False otherwise.

is_cuda

is_cuda() -> bool

Check whether this device represents a CUDA GPU.

Returns:

Type Description
bool

True if the device type is CUDA, False otherwise.