Tensors
keydnn.Tensor
Bases: Tensor
Presentation-layer Tensor.
This subclass preserves the full infrastructure Tensor API (including
classmethods/staticmethods like zeros, ones, rand, etc.) while
applying presentation-friendly defaults for direct construction.
Notes
- Direct construction defaults to CPU if
deviceis omitted. - Direct construction defaults to zero-initialized storage because that is what most users expect from a high-level DL API.
dtype
property
dtype: dtype
Return the element dtype of this tensor.
Returns:
| Type | Description |
|---|---|
dtype
|
NumPy dtype representing the tensor element type. |
Notes
- For CPU tensors, this matches the underlying ndarray dtype.
- For CUDA tensors, this is metadata used for kernel dispatch and tests.
shape
property
shape: tuple[int, ...]
Return the tensor shape.
Returns:
| Type | Description |
|---|---|
tuple[int, ...]
|
The tensor's shape. |
device
property
device: Device
Return the device on which this tensor resides.
Returns:
| Type | Description |
|---|---|
Device
|
The tensor's device placement descriptor. |
requires_grad
property
writable
requires_grad: bool
Indicate whether this tensor should accumulate gradients.
Returns:
| Type | Description |
|---|---|
bool
|
True if gradients should be tracked/accumulated, False otherwise. |
grad
property
grad: Optional['Tensor']
Return the gradient tensor associated with this tensor (if any).
Returns:
| Type | Description |
|---|---|
Optional[Tensor]
|
The stored gradient tensor, or None if not computed or cleared. |
T
property
T: ITensor
Convenience property for 2D transpose.
Equivalent to calling self.transpose().
data
property
data: int | ndarray
Return the underlying data handle for this tensor.
Semantics
- CPU tensors: Returns the underlying NumPy ndarray storing the tensor data.
-
CUDA tensors: Returns the raw device pointer (
dev_ptr) as an integer.The pointer is resolved as follows: 1) If the tensor is backed by a
_CudaStorageobject, returnstorage.dev_ptr. 2) Otherwise, fall back to the legacy_datafield, which may contain a raw device pointer set by older code paths or tests.
Notes
- For CUDA tensors, the returned value is not a NumPy array and should be treated as an opaque device pointer handle.
- A return value of
0indicates that no device memory is currently allocated (e.g., uninitialized tensor, freed tensor, or zero-sized tensor). - New code should prefer storage-backed tensors; the
_datafallback exists only for backward compatibility during migration.
Returns:
| Type | Description |
|---|---|
int | ndarray
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the tensor is on an unsupported or unknown device. |
nbytes
property
nbytes: int
Total bytes required to store this tensor's elements.
Computed as: numel() * itemsize(dtype)
Notes
- This is metadata-only and does not require the underlying storage to be allocated.
- For empty tensors (numel == 0), this returns 0.
device_type
property
device_type: str
Returns a string of the current device type. It may be either cpu or cuda.
Returns:
| Type | Description |
|---|---|
str
|
The device type. Either |
zero_grad
zero_grad() -> None
Clear the stored gradient.
Notes
Training loops typically call zero_grad() before backprop to avoid
unintentional accumulation across iterations.
to_numpy
to_numpy() -> np.ndarray
Convert the tensor to a NumPy ndarray on the host.
Returns:
| Type | Description |
|---|---|
ndarray
|
A NumPy array containing the tensor data. |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If device-to-host transfer is unavailable or the tensor's dtype is unknown. |
Notes
- CPU tensors return a view/copy of the underlying CPU storage (preserving existing behavior of the concrete tensor type).
- CUDA tensors are copied from device to host via a device-to-host memcpy.
copy_from_numpy
copy_from_numpy(arr: ndarray) -> None
Copy data from a NumPy array (or array-like / scalar) into this tensor.
Backward compatibility (CPU)
The original implementation accepted any array-like input (including NumPy
scalars like np.float32) by calling np.asarray(arr, dtype=np.float32).
This method preserves that behavior for the CPU path.
CUDA
- Accepts any array-like / scalar.
- Casts to
self.dtype, makes it C-contiguous, then performs a host-to-device memcpy into the tensor's existing (or newly allocated) device buffer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
arr
|
Any
|
Array-like object accepted by |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input shape does not match this tensor's shape. |
RuntimeError
|
If the tensor's device is unsupported or a CUDA copy fails. |
copy_from
copy_from(
other: ITensor, *, allow_cross_device: bool = False
) -> None
Copy data from another tensor into this tensor (in-place).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
Tensor
|
Source tensor. |
required |
allow_cross_device
|
bool
|
If False (backward-compatible), require If True, allow: - CPU -> CPU (same as before) - CUDA -> CUDA (D2D memcpy; same device index is recommended) - CPU -> CUDA (HtoD memcpy) - CUDA -> CPU (DtoH memcpy) Notes
|
False
|
Raises:
| Type | Description |
|---|---|
(TypeError, ValueError, RuntimeError)
|
|
fill
fill(value: float) -> None
Fill the tensor with a scalar value.
This method overwrites every element of the tensor with value.
The concrete implementation may dispatch differently depending on
device placement (e.g., NumPy fill on CPU vs. a CUDA fill kernel).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
float
|
Scalar value to write into the tensor. |
required |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If the tensor's device is unsupported by this implementation. |
debug_storage_repr
debug_storage_repr() -> str
Return a stable, human-readable description of underlying storage.
Contract
- CPU tensors: describe the NumPy ndarray storage.
- CUDA tensors: return a stable placeholder string that includes:
- device index
- tensor shape
- (if available) the device pointer value
Notes
This is a debugging aid and intentionally does not expose full contents.
numel
numel() -> int
Return the total number of elements in the tensor.
Returns:
| Type | Description |
|---|---|
int
|
Product of all dimensions in the tensor shape. |
reshape
reshape(new_shape: tuple[int, ...]) -> ITensor
Return a reshaped view of this tensor.
This operation changes the logical shape while preserving the total number of elements.
Behavior
- CPU: reshapes via NumPy and then materializes via
copy_from_numpy(preserving current copy-based semantics). - CUDA: returns a metadata-only alias that shares the same device pointer (no kernel launch, no device-to-device copy).
Backward
If autograd is enabled, backward reshapes grad_out back to the original
shape of the parent tensor.
Notes
- Reshape validity is checked using NumPy semantics, including support
for
-1inference. - For CUDA tensors, an allocation must exist if
numel != 0(i.e.,datacannot be 0 for non-empty tensors).
stack
staticmethod
stack(
tensors: Sequence["Tensor"], axis: int = 0
) -> "Tensor"
Stack a sequence of tensors along a new axis.
This is the differentiable counterpart of np.stack.
Requirements
tensorsmust be non-empty- all tensors must share the same shape
- all tensors must share the same device
- CPU: forward/backward uses NumPy (backward compatible)
- CUDA: forward/backward uses
stack_cuda_extkernels (device-pointer based) and does NOT callto_numpy()on CUDA tensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tensors
|
Sequence[Tensor]
|
Input tensors to stack. Must be non-empty and same-shape. |
required |
axis
|
int
|
Axis at which the new dimension is inserted. Supports negative axes. Defaults to 0. |
0
|
Returns:
| Type | Description |
|---|---|
Tensor
|
A new tensor whose shape is: out.shape = in.shape[:axis] + (len(tensors),) + in.shape[axis:] |
Notes
- Forward returns a copy (not a view).
- Backward splits
grad_outalongaxisand routes each slice back to the corresponding parent tensor. - CUDA backward overwrites dx buffers (no accumulation inside kernel).
concat
staticmethod
concat(
tensors: Sequence["Tensor"], axis: int = 0
) -> "Tensor"
Concatenate a sequence of tensors along an existing axis.
CPU behavior
- Fully supports all axes via NumPy.
CUDA behavior (current)
- Supports only axis == 0 for CUDA tensors using device-to-device memcpy.
- General axis concatenation on CUDA requires a kernel (pending).
Requirements
tensorsmust be non-empty- all tensors must share the same device
- shapes must match on all dimensions except
axis - dtypes must match (CUDA path enforces this; CPU path preserves current float32-cast behavior)
Backward rule
- Split
grad_outalongaxisinto slices matching each input's size along that axis, and route each slice back to the corresponding parent.
broadcast_to
broadcast_to(shape: Tuple[int, ...]) -> ITensor
Broadcast this tensor to a target shape by explicit expansion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
shape
|
tuple[int, ...]
|
Target shape to broadcast to. |
required |
Returns:
| Type | Description |
|---|---|
ITensor
|
Broadcasted tensor (materialized copy). |
Notes
- The operation is conceptually a "repeat/expand" that materializes a new tensor of the requested shape.
- Backward typically reduces gradients by summing over the broadcasted dimensions (i.e., the inverse of expansion).
sum
sum(
axis: Optional[int] = None, keepdims: bool = False
) -> ITensor
Compute the sum of tensor elements.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
axis
|
Optional[int]
|
Axis along which to compute the sum. If None, all elements are reduced into a scalar. |
None
|
keepdims
|
bool
|
If True, retains reduced dimensions with size 1. Defaults to False. |
False
|
Returns:
| Type | Description |
|---|---|
ITensor
|
Tensor containing the summed values. The shape depends on the
|
Notes
Backward rule: The upstream gradient is broadcast back to the input tensor's shape, i.e., each input element receives the gradient of the corresponding reduced output.
Backend-specific implementations may impose additional constraints (e.g., limited axis support on CUDA).
mean
mean() -> ITensor
Compute the arithmetic mean of all elements in the tensor.
This operation always reduces the tensor to a scalar value.
Returns:
| Type | Description |
|---|---|
ITensor
|
A scalar tensor (shape=()) containing the mean value. |
Notes
Backward rule: The gradient is distributed uniformly to all input elements:
``d(mean(x)) / dx = 1 / numel(x)``
No axis argument is currently supported; the reduction is always performed over all elements.
log
log() -> ITensor
Compute the elementwise natural logarithm of the tensor.
Returns:
| Type | Description |
|---|---|
ITensor
|
A tensor of the same shape as |
Notes
CPU behavior
- Uses NumPy to compute the forward pass for CPU tensors.
CUDA behavior (workaround)
- For CUDA tensors, this method currently performs a CPU round-trip:
device → host (
to_numpy) → NumPylog→ device (copy_from_numpy). - This preserves correctness and autograd semantics but is not performance-optimal.
Autograd
If self.requires_grad is True, the backward rule is:
``d(log(x)) / dx = 1 / x``
The behavior for non-positive input values follows NumPy semantics
(e.g., -inf or nan).
TODO
Implement a native CUDA kernel for log (and a fused backward) to
avoid device↔host transfers.
exp
exp() -> ITensor
Compute the elementwise exponential of the tensor.
Returns:
| Type | Description |
|---|---|
ITensor
|
A tensor of the same shape as |
Notes
Backward rule:
d(exp(x)) / dx = exp(x)
CUDA behavior
- Uses the native CUDA unary exponential kernel via
unary_cuda_ext.exp_forward. - Operates directly on device pointers without a NumPy round-trip.
max
max(axis: int = -1, keepdims: bool = False) -> ITensor
Compute the maximum values along a given axis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
axis
|
int
|
Axis along which to compute the maximum. Defaults to -1. |
-1
|
keepdims
|
bool
|
Whether to retain reduced dimensions with size 1. Defaults to False. |
False
|
Returns:
| Type | Description |
|---|---|
ITensor
|
Tensor containing the maximum values along the specified axis. |
CUDA support
- Only supports 2D input tensors.
axismust reduce exactly one dimension:axis in {0, 1, -1, -2}.- Backward propagation routes the gradient to a single argmax index per slice (ties are not split).
CPU notes
Backward rule: Gradients are routed to all positions equal to the maximum value using a mask, i.e.:
``dx = grad_out * 1[x == max(x)]``
Notes
The exact behavior (including tie handling and shape semantics) is backend-dependent but must conform to the rules documented here.
matmul
matmul(other: 'Tensor') -> 'Tensor'
Matrix multiplication (2D): out = self @ other.
Requirements
- CPU: unchanged behavior (backward compatible)
- CUDA: both operands must be CUDA, 2D, inner dims match
- CUDA inputs MUST already have allocated device buffers (data != 0). (No implicit allocation for inputs.)
Backward
If out = A @ B, then: - dL/dA = dL/dout @ B^T - dL/dB = A^T @ dL/dout
transpose
transpose() -> ITensor
Return the 2D transpose of this tensor.
For a 2D tensor A with shape (M, N), transpose returns Aᵀ with shape (N, M) such that:
out[i, j] = self[j, i]
Requirements
- Input must be 2D.
- CPU and CUDA are supported by the concrete implementation.
Backward
If out = Aᵀ, then dL/dA = (dL/dout)ᵀ.
backward
backward(
grad_out: Optional["Tensor"] = None,
*,
profile: bool = False,
profile_topk: int = 20,
) -> None
Backpropagate gradients from this tensor through the autograd graph.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
grad_out
|
Optional[Tensor]
|
Gradient w.r.t. this tensor. If omitted, this tensor must be a scalar (shape == ()) and the gradient is assumed to be 1.0. |
None
|
Notes
- Gradients are accumulated into
.gradof leaf tensors that haverequires_grad=True. - This implementation performs a reverse topological traversal.
- CPU behavior is unchanged (same logic; only extra timing when profile=True).
item
item() -> float
Return the value of a scalar (or single-element) tensor as a Python float.
Notes
- CPU: reads from
_data. - CUDA: uses
to_numpy()(D2H) to fetch the scalar.
sqrt
sqrt() -> ITensor
Compute the elementwise square root of the tensor.
Returns:
| Type | Description |
|---|---|
ITensor
|
A tensor with the same shape as |
Notes
CPU behavior
- Uses NumPy to compute the forward pass for CPU tensors.
CUDA behavior (workaround)
- For CUDA tensors, this method currently performs a CPU round-trip:
device → host (
to_numpy) → NumPysqrt→ device (copy_from_numpy). - This preserves correctness and autograd semantics but is not performance-optimal.
Autograd
If self.requires_grad is True, the returned tensor participates in
autograd with parent self.
TODO
Implement a native CUDA kernel for sqrt (and optionally a fused
backward) to avoid device↔host transfers.
rand
staticmethod
rand(
shape, *, device, requires_grad: bool = False
) -> "Tensor"
Create a tensor filled with uniform random values in [0, 1) on the given device.
Notes
- CPU: random values are generated using NumPy.
- CUDA: random values are generated on CPU and transferred to device memory. No CUDA RNG kernel is used.
- This mirrors the initialization strategy used by many frameworks and keeps behavior deterministic and easy to test.
- The returned tensor has dtype float32 and ctx=None.
full
classmethod
full(
shape: tuple,
fill_value: float,
*,
device: Device,
requires_grad: bool = False,
) -> ITensor
Create a tensor filled with a constant value.
Backward compatibility
- CPU path preserves the original logic exactly: uses NumPy to create a filled array and calls copy_from_numpy().
- Default dtype remains float32, matching the original implementation.
- The returned tensor has ctx=None.
CUDA
- Allocates a CUDA tensor (no host staging) and fills via Tensor.fill(), which dispatches to the CUDA fill kernel after ensuring allocation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
shape
|
tuple
|
Desired tensor shape. May be any shape accepted by NumPy (including
|
required |
fill_value
|
float
|
Constant value to write into every element. |
required |
device
|
Device
|
Target device placement (CPU or CUDA). |
required |
requires_grad
|
bool
|
Whether the returned tensor should participate in autograd. Defaults to False. |
False
|
Returns:
| Type | Description |
|---|---|
ITensor
|
A newly allocated tensor with the given shape, filled with |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If |
tanh
tanh() -> ITensor
Compute the elementwise hyperbolic tangent of the tensor.
Returns:
| Type | Description |
|---|---|
ITensor
|
A tensor with the same shape as |
Notes
- This method delegates to the autograd
TanhFnFunction. - NumPy is not used directly here; numerical kernels remain encapsulated inside Tensor operations or autograd Functions.
sigmoid
sigmoid() -> ITensor
Compute the elementwise logistic sigmoid of the tensor.
The sigmoid function is defined as:
``sigmoid(x) = 1 / (1 + exp(-x))``
Returns:
| Type | Description |
|---|---|
ITensor
|
A tensor with the same shape as |
Notes
This method is a thin convenience wrapper around SigmoidFn defined
in ._function and integrates with the autograd system.
zeros
classmethod
zeros(
*,
shape: tuple[int, ...],
device: Device,
requires_grad: bool = False,
dtype: Any = np.float32,
) -> ITensor
Create a tensor filled with zeros on the specified device.
This factory method constructs a tensor with the given shape and device. For CPU tensors, a NumPy array is allocated and zero-initialized. For CUDA tensors, device memory is allocated and zeroed via a CUDA fill routine.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
shape
|
tuple[int, ...]
|
Shape of the output tensor. |
required |
device
|
Device
|
Target device placement (CPU or CUDA). |
required |
requires_grad
|
bool
|
Whether the tensor should track gradients for autograd. |
False
|
dtype
|
Any
|
The data type of the target tensor. |
float32
|
Returns:
| Type | Description |
|---|---|
ITensor
|
Newly created tensor filled with zeros. |
Notes
- The dtype is currently fixed to
float32. - Zero-sized tensors are valid and return immediately without invoking CUDA kernels.
ones
classmethod
ones(
*,
shape: tuple[int, ...],
device: Device,
requires_grad: bool = False,
dtype: Any = np.float32,
) -> ITensor
Create a tensor filled with ones on the specified device.
This factory method constructs a tensor with the given shape and device. For CPU tensors, a NumPy array is allocated and initialized with ones. For CUDA tensors, device memory is allocated and filled using a native CUDA fill routine.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
shape
|
tuple[int, ...]
|
Shape of the output tensor. |
required |
device
|
Device
|
Target device placement (CPU or CUDA). |
required |
requires_grad
|
bool
|
Whether the tensor should track gradients for autograd. |
False
|
dtype
|
Any
|
The data type of the target tensor. |
float32
|
Returns:
| Type | Description |
|---|---|
ITensor
|
Newly created tensor filled with ones. |
Notes
- The dtype is currently fixed to
float32. - Zero-sized tensors are valid and return immediately without invoking CUDA kernels.
- The CUDA path prioritizes correctness and may fall back to a slower initialization strategy if the native fill kernel fails.
to
to(device, *, copy: bool = False) -> ITensor
Move or copy this tensor to another device.
This method implements explicit device placement transitions and returns
a tensor on device. If the target device matches the current device,
it returns self by default (or a cloned copy if copy=True).
Supported transfers
- CPU -> CUDA: allocates device memory and performs host-to-device memcpy.
- CUDA -> CPU: allocates host buffer and performs device-to-host memcpy.
- CUDA -> CUDA (different device indices): currently implemented via an intermediate CPU round-trip for simplicity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
device
|
Device | str
|
Target device. If a string is provided, it is parsed as a |
required |
copy
|
bool
|
If True, forces a copy even when the device is unchanged. Defaults to False. |
False
|
Returns:
| Type | Description |
|---|---|
ITensor
|
A tensor placed on the requested device. |
Notes
- For CPU -> CUDA copies, the host buffer is made C-contiguous before raw memcpy to ensure correct layout.
- This method does not propagate autograd context; returned tensors are
created with
requires_grad=Falseandctx=Nonein the transfer paths.
clone
clone() -> ITensor
Create a deep copy of the tensor's storage.
- CPU: copies the underlying NumPy array into a new tensor.
- CUDA: allocates a new device buffer and performs a device-to-device memcpy.
Returns:
| Type | Description |
|---|---|
ITensor
|
A new tensor with identical contents. |
Notes
clone()is intended to copy raw storage and typically returns a tensor withrequires_grad=Falseand no autograd context (ctx=None).
sum_to_shape
sum_to_shape(target_shape: tuple[int, ...]) -> ITensor
Sum-reduce this tensor to target_shape (inverse of broadcasting).
This primitive is commonly used in autograd to reduce broadcasted gradients back to the original source shape.
Dispatches via tensor_control_path_manager.
free_
free_() -> None
Explicitly release CUDA backing memory (if owned).
Semantics
- If
_storageis present: decrement refcount; free happens when it hits 0. - If
_storageis absent:- If this tensor is marked as borrowed, we do NOT free the devptr.
- If marked as "owning borrowed devptr" (transitional), we cuda_free it.
- Idempotent: safe to call multiple times.
clamp
clamp(
*, min: float | None = None, max: float | None = None
) -> "Tensor"
Clamp tensor values elementwise between min and max.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min
|
float
|
Minimum value. If None, no lower bound is applied. |
None
|
max
|
float
|
Maximum value. If None, no upper bound is applied. |
None
|
Returns:
| Type | Description |
|---|---|
Tensor
|
A new Tensor with values clamped to the specified range. |
Notes
- Gradients pass through unchanged for values within the range.
- Gradients are zeroed for values clipped by
minormax.
to_
to_(device, *, copy: bool = False) -> ITensor
Move this tensor to another device in-place.
This method performs an in-place device placement transition. Unlike
to(), which may return a newly allocated tensor on the target device,
to_() preserves the identity of self (i.e., id(self) is unchanged)
by migrating the underlying storage and updating device placement fields
on the same object.
Semantics
- If the target device matches the current device:
- returns
self(no-op), unlesscopy=True, in which case the tensor's storage is replaced with a cloned copy on the same device.
- returns
- If the target device differs:
- performs the transfer using
to(device, copy=True)internally, then swaps this tensor's backing storage to the transferred result.
- performs the transfer using
Supported transfers
- CPU -> CUDA: allocates device memory and performs host-to-device memcpy.
- CUDA -> CPU: allocates host buffer and performs device-to-host memcpy.
- CUDA -> CUDA (different device indices): currently implemented via an intermediate CPU round-trip for simplicity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
device
|
Device | str
|
Target device. If a string is provided, it is parsed as a |
required |
copy
|
bool
|
If True, forces a copy even when the device is unchanged. This is the
in-place analogue of |
False
|
Returns:
| Type | Description |
|---|---|
ITensor
|
This tensor ( |
Notes
- This method intentionally does not preserve autograd context across
device transfers. If this tensor participates in autograd, should
treat
to_()as a graph break:ctxis cleared.requires_gradis left unchanged (you can still accumulate grads going forward), but any prior graph history is discarded.
- If
self.gradexists and is a Tensor-like object with.to(...), this method will attempt to move it to the same device as well.
keydnn.Device
Concrete computation device descriptor.
This class encapsulates a normalized representation of a computation device, including its type (CPU or CUDA) and, for CUDA devices, an optional device index (e.g., cuda:0).
The class performs strict validation of device strings to ensure a small, well-defined set of supported device identifiers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
device
|
str
|
Device identifier string. Must be either:
- "cpu"
- "cuda: |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the provided device string does not match the supported formats. |
Notes
__slots__is used to prevent dynamic attribute creation and reduce per-instance memory overhead.- This class is intentionally lightweight and does not allocate or manage any backend resources.
is_cpu
is_cpu() -> bool
Check whether this device represents a CPU.
Returns:
| Type | Description |
|---|---|
bool
|
True if the device type is CPU, False otherwise. |
is_cuda
is_cuda() -> bool
Check whether this device represents a CUDA GPU.
Returns:
| Type | Description |
|---|---|
bool
|
True if the device type is CUDA, False otherwise. |