Tensors

keydnn.Tensor

Bases: Tensor

Presentation-layer Tensor.

This subclass preserves the full infrastructure Tensor API (including classmethods/staticmethods like zeros, ones, rand, etc.) while applying presentation-friendly defaults for direct construction.

Notes

Direct construction defaults to CPU if device is omitted.
Direct construction defaults to zero-initialized storage because that is what most users expect from a high-level DL API.

dtype `property`

dtype: dtype

Return the element dtype of this tensor.

Returns:

Type	Description
`dtype`	NumPy dtype representing the tensor element type.

Notes

For CPU tensors, this matches the underlying ndarray dtype.
For CUDA tensors, this is metadata used for kernel dispatch and tests.

shape `property`

shape: tuple[int, ...]

Return the tensor shape.

Returns:

Type	Description
`tuple[int, ...]`	The tensor's shape.

device `property`

device: Device

Return the device on which this tensor resides.

Returns:

Type	Description
`Device`	The tensor's device placement descriptor.

requires_grad `property` `writable`

requires_grad: bool

Indicate whether this tensor should accumulate gradients.

Returns:

Type	Description
`bool`	True if gradients should be tracked/accumulated, False otherwise.

grad `property`

grad: Optional['Tensor']

Return the gradient tensor associated with this tensor (if any).

Returns:

Type	Description
`Optional[Tensor]`	The stored gradient tensor, or None if not computed or cleared.

T `property`

T: ITensor

Convenience property for 2D transpose.

Equivalent to calling self.transpose().

data `property`

data: int | ndarray

Return the underlying data handle for this tensor.

Semantics

CPU tensors: Returns the underlying NumPy ndarray storing the tensor data.
CUDA tensors: Returns the raw device pointer (dev_ptr) as an integer.

The pointer is resolved as follows: 1) If the tensor is backed by a _CudaStorage object, return storage.dev_ptr. 2) Otherwise, fall back to the legacy _data field, which may contain a raw device pointer set by older code paths or tests.

Notes

For CUDA tensors, the returned value is not a NumPy array and should be treated as an opaque device pointer handle.
A return value of 0 indicates that no device memory is currently allocated (e.g., uninitialized tensor, freed tensor, or zero-sized tensor).
New code should prefer storage-backed tensors; the _data fallback exists only for backward compatibility during migration.

Returns:

Type	Description
`int \| ndarray`	NumPy ndarray for CPU tensors. Integer device pointer (`uintptr_t`) for CUDA tensors.

Raises:

Type	Description
`ValueError`	If the tensor is on an unsupported or unknown device.

nbytes `property`

nbytes: int

Total bytes required to store this tensor's elements.

Computed as: numel() * itemsize(dtype)

Notes

This is metadata-only and does not require the underlying storage to be allocated.
For empty tensors (numel == 0), this returns 0.

device_type `property`

device_type: str

Returns a string of the current device type. It may be either cpu or cuda.

Returns:

Type	Description
`str`	The device type. Either `cpu` or `cuda`.

zero_grad

zero_grad() -> None

Clear the stored gradient.

Notes

Training loops typically call zero_grad() before backprop to avoid unintentional accumulation across iterations.

to_numpy

to_numpy() -> np.ndarray

Convert the tensor to a NumPy ndarray on the host.

Returns:

Type	Description
`ndarray`	A NumPy array containing the tensor data.

Raises:

Type	Description
`RuntimeError`	If device-to-host transfer is unavailable or the tensor's dtype is unknown.

Notes

CPU tensors return a view/copy of the underlying CPU storage (preserving existing behavior of the concrete tensor type).
CUDA tensors are copied from device to host via a device-to-host memcpy.

copy_from_numpy

copy_from_numpy(arr: ndarray) -> None

Copy data from a NumPy array (or array-like / scalar) into this tensor.

Backward compatibility (CPU)

The original implementation accepted any array-like input (including NumPy scalars like np.float32) by calling np.asarray(arr, dtype=np.float32). This method preserves that behavior for the CPU path.

CUDA

Accepts any array-like / scalar.
Casts to self.dtype, makes it C-contiguous, then performs a host-to-device memcpy into the tensor's existing (or newly allocated) device buffer.

Parameters:

Name	Type	Description	Default
`arr`	`Any`	Array-like object accepted by `np.asarray`, including NumPy scalars.	required

Raises:

Type	Description
`ValueError`	If the input shape does not match this tensor's shape.
`RuntimeError`	If the tensor's device is unsupported or a CUDA copy fails.

copy_from

copy_from(
    other: ITensor, *, allow_cross_device: bool = False
) -> None

Copy data from another tensor into this tensor (in-place).

Parameters:

Name	Type	Description	Default
`other`	`Tensor`	Source tensor.	required
`allow_cross_device`	`bool`	If False (backward-compatible), require `self.device` and `other.device` to match exactly (string compare) and perform same-device copies only. If True, allow: - CPU -> CPU (same as before) - CUDA -> CUDA (D2D memcpy; same device index is recommended) - CPU -> CUDA (HtoD memcpy) - CUDA -> CPU (DtoH memcpy) Notes Shape must match. dtype must match (no implicit casting). Cross-GPU copies (cuda:0 -> cuda:1) are not handled here; they may work only if memcpy wrapper supports peer copies. By default this method raises for different CUDA device indices.	`False`

Raises:

Type	Description
`(TypeError, ValueError, RuntimeError)`

fill

fill(value: float) -> None

Fill the tensor with a scalar value.

This method overwrites every element of the tensor with value. The concrete implementation may dispatch differently depending on device placement (e.g., NumPy fill on CPU vs. a CUDA fill kernel).

Parameters:

Name	Type	Description	Default
`value`	`float`	Scalar value to write into the tensor.	required

Raises:

Type	Description
`RuntimeError`	If the tensor's device is unsupported by this implementation.

debug_storage_repr

debug_storage_repr() -> str

Return a stable, human-readable description of underlying storage.

Contract

CPU tensors: describe the NumPy ndarray storage.
CUDA tensors: return a stable placeholder string that includes:
- device index
- tensor shape
- (if available) the device pointer value

Notes

This is a debugging aid and intentionally does not expose full contents.

numel

numel() -> int

Return the total number of elements in the tensor.

Returns:

Type	Description
`int`	Product of all dimensions in the tensor shape.

reshape

reshape(new_shape: tuple[int, ...]) -> ITensor

Return a reshaped view of this tensor.

This operation changes the logical shape while preserving the total number of elements.

Behavior

CPU: reshapes via NumPy and then materializes via copy_from_numpy (preserving current copy-based semantics).
CUDA: returns a metadata-only alias that shares the same device pointer (no kernel launch, no device-to-device copy).

Backward

If autograd is enabled, backward reshapes grad_out back to the original shape of the parent tensor.

Notes

Reshape validity is checked using NumPy semantics, including support for -1 inference.
For CUDA tensors, an allocation must exist if numel != 0 (i.e., data cannot be 0 for non-empty tensors).

stack `staticmethod`

stack(
    tensors: Sequence["Tensor"], axis: int = 0
) -> "Tensor"

Stack a sequence of tensors along a new axis.

This is the differentiable counterpart of np.stack.

Requirements

tensors must be non-empty
all tensors must share the same shape
all tensors must share the same device
CPU: forward/backward uses NumPy (backward compatible)
CUDA: forward/backward uses stack_cuda_ext kernels (device-pointer based) and does NOT call to_numpy() on CUDA tensors.

Parameters:

Name	Type	Description	Default
`tensors`	`Sequence[Tensor]`	Input tensors to stack. Must be non-empty and same-shape.	required
`axis`	`int`	Axis at which the new dimension is inserted. Supports negative axes. Defaults to 0.	`0`

Returns:

Type	Description
`Tensor`	A new tensor whose shape is: out.shape = in.shape[:axis] + (len(tensors),) + in.shape[axis:]

Notes

Forward returns a copy (not a view).
Backward splits grad_out along axis and routes each slice back to the corresponding parent tensor.
CUDA backward overwrites dx buffers (no accumulation inside kernel).

concat `staticmethod`

concat(
    tensors: Sequence["Tensor"], axis: int = 0
) -> "Tensor"

Concatenate a sequence of tensors along an existing axis.

CPU behavior

Fully supports all axes via NumPy.

CUDA behavior (current)

Supports only axis == 0 for CUDA tensors using device-to-device memcpy.
General axis concatenation on CUDA requires a kernel (pending).

Requirements

tensors must be non-empty
all tensors must share the same device
shapes must match on all dimensions except axis
dtypes must match (CUDA path enforces this; CPU path preserves current float32-cast behavior)

Backward rule

Split grad_out along axis into slices matching each input's size along that axis, and route each slice back to the corresponding parent.

broadcast_to

broadcast_to(shape: Tuple[int, ...]) -> ITensor

Broadcast this tensor to a target shape by explicit expansion.

Parameters:

Name	Type	Description	Default
`shape`	`tuple[int, ...]`	Target shape to broadcast to.	required

Returns:

Type	Description
`ITensor`	Broadcasted tensor (materialized copy).

Notes

The operation is conceptually a "repeat/expand" that materializes a new tensor of the requested shape.
Backward typically reduces gradients by summing over the broadcasted dimensions (i.e., the inverse of expansion).

sum

sum(
    axis: Optional[int] = None, keepdims: bool = False
) -> ITensor

Compute the sum of tensor elements.

Parameters:

Name	Type	Description	Default
`axis`	`Optional[int]`	Axis along which to compute the sum. If None, all elements are reduced into a scalar.	`None`
`keepdims`	`bool`	If True, retains reduced dimensions with size 1. Defaults to False.	`False`

Returns:

Type	Description
`ITensor`	Tensor containing the summed values. The shape depends on the `axis` and `keepdims` arguments.

Notes

Backward rule: The upstream gradient is broadcast back to the input tensor's shape, i.e., each input element receives the gradient of the corresponding reduced output.

Backend-specific implementations may impose additional constraints (e.g., limited axis support on CUDA).

mean

mean() -> ITensor

Compute the arithmetic mean of all elements in the tensor.

This operation always reduces the tensor to a scalar value.

Returns:

Type	Description
`ITensor`	A scalar tensor (shape=()) containing the mean value.

Notes

Backward rule: The gradient is distributed uniformly to all input elements:

    ``d(mean(x)) / dx = 1 / numel(x)``

No axis argument is currently supported; the reduction is always performed over all elements.

log

log() -> ITensor

Compute the elementwise natural logarithm of the tensor.

Returns:

Type	Description
`ITensor`	A tensor of the same shape as `self`, where each element is replaced by its natural logarithm.

Notes

CPU behavior

Uses NumPy to compute the forward pass for CPU tensors.

CUDA behavior (workaround)

For CUDA tensors, this method currently performs a CPU round-trip: device → host (to_numpy) → NumPy log → device (copy_from_numpy).
This preserves correctness and autograd semantics but is not performance-optimal.

Autograd

If self.requires_grad is True, the backward rule is:

``d(log(x)) / dx = 1 / x``

The behavior for non-positive input values follows NumPy semantics (e.g., -inf or nan).

TODO

Implement a native CUDA kernel for log (and a fused backward) to avoid device↔host transfers.

exp

exp() -> ITensor

Compute the elementwise exponential of the tensor.

Returns:

Type	Description
`ITensor`	A tensor of the same shape as `self` with `exp` applied elementwise.

Notes

Backward rule: d(exp(x)) / dx = exp(x)

CUDA behavior

Uses the native CUDA unary exponential kernel via unary_cuda_ext.exp_forward.
Operates directly on device pointers without a NumPy round-trip.

max

max(axis: int = -1, keepdims: bool = False) -> ITensor

Compute the maximum values along a given axis.

Parameters:

Name	Type	Description	Default
`axis`	`int`	Axis along which to compute the maximum. Defaults to -1.	`-1`
`keepdims`	`bool`	Whether to retain reduced dimensions with size 1. Defaults to False.	`False`

Returns:

Type	Description
`ITensor`	Tensor containing the maximum values along the specified axis.

CUDA support

Only supports 2D input tensors.
axis must reduce exactly one dimension: axis in {0, 1, -1, -2}.
Backward propagation routes the gradient to a single argmax index per slice (ties are not split).

CPU notes

Backward rule: Gradients are routed to all positions equal to the maximum value using a mask, i.e.:

    ``dx = grad_out * 1[x == max(x)]``

Notes

The exact behavior (including tie handling and shape semantics) is backend-dependent but must conform to the rules documented here.

matmul

matmul(other: 'Tensor') -> 'Tensor'

Matrix multiplication (2D): out = self @ other.

Requirements

CPU: unchanged behavior (backward compatible)
CUDA: both operands must be CUDA, 2D, inner dims match
CUDA inputs MUST already have allocated device buffers (data != 0). (No implicit allocation for inputs.)

Backward

If out = A @ B, then: - dL/dA = dL/dout @ B^T - dL/dB = A^T @ dL/dout

transpose

transpose() -> ITensor

Return the 2D transpose of this tensor.

For a 2D tensor A with shape (M, N), transpose returns Aᵀ with shape (N, M) such that:

out[i, j] = self[j, i]

Requirements

Input must be 2D.
CPU and CUDA are supported by the concrete implementation.

Backward

If out = Aᵀ, then dL/dA = (dL/dout)ᵀ.

backward

backward(
    grad_out: Optional["Tensor"] = None,
    *,
    profile: bool = False,
    profile_topk: int = 20,
) -> None

Backpropagate gradients from this tensor through the autograd graph.

Parameters:

Name	Type	Description	Default
`grad_out`	`Optional[Tensor]`	Gradient w.r.t. this tensor. If omitted, this tensor must be a scalar (shape == ()) and the gradient is assumed to be 1.0.	`None`

Notes

Gradients are accumulated into .grad of leaf tensors that have requires_grad=True.
This implementation performs a reverse topological traversal.
CPU behavior is unchanged (same logic; only extra timing when profile=True).

item

item() -> float

Return the value of a scalar (or single-element) tensor as a Python float.

Notes

CPU: reads from _data.
CUDA: uses to_numpy() (D2H) to fetch the scalar.

sqrt

sqrt() -> ITensor

Compute the elementwise square root of the tensor.

Returns:

Type	Description
`ITensor`	A tensor with the same shape as `self`, containing `sqrt(self)` applied elementwise.

Notes

CPU behavior

Uses NumPy to compute the forward pass for CPU tensors.

CUDA behavior (workaround)

For CUDA tensors, this method currently performs a CPU round-trip: device → host (to_numpy) → NumPy sqrt → device (copy_from_numpy).
This preserves correctness and autograd semantics but is not performance-optimal.

Autograd

If self.requires_grad is True, the returned tensor participates in autograd with parent self.

TODO

Implement a native CUDA kernel for sqrt (and optionally a fused backward) to avoid device↔host transfers.

rand `staticmethod`

rand(
    shape, *, device, requires_grad: bool = False
) -> "Tensor"

Create a tensor filled with uniform random values in [0, 1) on the given device.

Notes

CPU: random values are generated using NumPy.
CUDA: random values are generated on CPU and transferred to device memory. No CUDA RNG kernel is used.
This mirrors the initialization strategy used by many frameworks and keeps behavior deterministic and easy to test.
The returned tensor has dtype float32 and ctx=None.

full `classmethod`

full(
    shape: tuple,
    fill_value: float,
    *,
    device: Device,
    requires_grad: bool = False,
) -> ITensor

Create a tensor filled with a constant value.

Backward compatibility

CPU path preserves the original logic exactly: uses NumPy to create a filled array and calls copy_from_numpy().
Default dtype remains float32, matching the original implementation.
The returned tensor has ctx=None.

CUDA

Allocates a CUDA tensor (no host staging) and fills via Tensor.fill(), which dispatches to the CUDA fill kernel after ensuring allocation.

Parameters:

Name	Type	Description	Default
`shape`	`tuple`	Desired tensor shape. May be any shape accepted by NumPy (including `()` for a scalar tensor).	required
`fill_value`	`float`	Constant value to write into every element.	required
`device`	`Device`	Target device placement (CPU or CUDA).	required
`requires_grad`	`bool`	Whether the returned tensor should participate in autograd. Defaults to False.	`False`

Returns:

Type	Description
`ITensor`	A newly allocated tensor with the given shape, filled with `fill_value`.

Raises:

Type	Description
`RuntimeError`	If `device` is not a supported device type.

tanh

tanh() -> ITensor

Compute the elementwise hyperbolic tangent of the tensor.

Returns:

Type	Description
`ITensor`	A tensor with the same shape as `self`, with `tanh` applied elementwise.

Notes

This method delegates to the autograd TanhFn Function.
NumPy is not used directly here; numerical kernels remain encapsulated inside Tensor operations or autograd Functions.

sigmoid

sigmoid() -> ITensor

Compute the elementwise logistic sigmoid of the tensor.

The sigmoid function is defined as:

``sigmoid(x) = 1 / (1 + exp(-x))``

Returns:

Type	Description
`ITensor`	A tensor with the same shape as `self`, with `sigmoid` applied elementwise.

Notes

This method is a thin convenience wrapper around SigmoidFn defined in ._function and integrates with the autograd system.

zeros `classmethod`

zeros(
    *,
    shape: tuple[int, ...],
    device: Device,
    requires_grad: bool = False,
    dtype: Any = np.float32,
) -> ITensor

Create a tensor filled with zeros on the specified device.

This factory method constructs a tensor with the given shape and device. For CPU tensors, a NumPy array is allocated and zero-initialized. For CUDA tensors, device memory is allocated and zeroed via a CUDA fill routine.

Parameters:

Name	Type	Description	Default
`shape`	`tuple[int, ...]`	Shape of the output tensor.	required
`device`	`Device`	Target device placement (CPU or CUDA).	required
`requires_grad`	`bool`	Whether the tensor should track gradients for autograd.	`False`
`dtype`	`Any`	The data type of the target tensor.	`float32`

Returns:

Type	Description
`ITensor`	Newly created tensor filled with zeros.

Notes

The dtype is currently fixed to float32.
Zero-sized tensors are valid and return immediately without invoking CUDA kernels.

ones `classmethod`

ones(
    *,
    shape: tuple[int, ...],
    device: Device,
    requires_grad: bool = False,
    dtype: Any = np.float32,
) -> ITensor

Create a tensor filled with ones on the specified device.

This factory method constructs a tensor with the given shape and device. For CPU tensors, a NumPy array is allocated and initialized with ones. For CUDA tensors, device memory is allocated and filled using a native CUDA fill routine.

Parameters:

Name	Type	Description	Default
`shape`	`tuple[int, ...]`	Shape of the output tensor.	required
`device`	`Device`	Target device placement (CPU or CUDA).	required
`requires_grad`	`bool`	Whether the tensor should track gradients for autograd.	`False`
`dtype`	`Any`	The data type of the target tensor.	`float32`

Returns:

Type	Description
`ITensor`	Newly created tensor filled with ones.

Notes

The dtype is currently fixed to float32.
Zero-sized tensors are valid and return immediately without invoking CUDA kernels.
The CUDA path prioritizes correctness and may fall back to a slower initialization strategy if the native fill kernel fails.

to

to(device, *, copy: bool = False) -> ITensor

Move or copy this tensor to another device.

This method implements explicit device placement transitions and returns a tensor on device. If the target device matches the current device, it returns self by default (or a cloned copy if copy=True).

Supported transfers

CPU -> CUDA: allocates device memory and performs host-to-device memcpy.
CUDA -> CPU: allocates host buffer and performs device-to-host memcpy.
CUDA -> CUDA (different device indices): currently implemented via an intermediate CPU round-trip for simplicity.

Parameters:

Name	Type	Description	Default
`device`	`Device \| str`	Target device. If a string is provided, it is parsed as a `Device` (e.g., "cpu", "cuda:0").	required
`copy`	`bool`	If True, forces a copy even when the device is unchanged. Defaults to False.	`False`

Returns:

Type	Description
`ITensor`	A tensor placed on the requested device.

Notes

For CPU -> CUDA copies, the host buffer is made C-contiguous before raw memcpy to ensure correct layout.
This method does not propagate autograd context; returned tensors are created with requires_grad=False and ctx=None in the transfer paths.

clone

clone() -> ITensor

Create a deep copy of the tensor's storage.

CPU: copies the underlying NumPy array into a new tensor.
CUDA: allocates a new device buffer and performs a device-to-device memcpy.

Returns:

Type	Description
`ITensor`	A new tensor with identical contents.

Notes

clone() is intended to copy raw storage and typically returns a tensor with requires_grad=False and no autograd context (ctx=None).

sum_to_shape

sum_to_shape(target_shape: tuple[int, ...]) -> ITensor

Sum-reduce this tensor to target_shape (inverse of broadcasting).

This primitive is commonly used in autograd to reduce broadcasted gradients back to the original source shape.

Dispatches via tensor_control_path_manager.

free_

free_() -> None

Explicitly release CUDA backing memory (if owned).

Semantics

If _storage is present: decrement refcount; free happens when it hits 0.
If _storage is absent:
- If this tensor is marked as borrowed, we do NOT free the devptr.
- If marked as "owning borrowed devptr" (transitional), we cuda_free it.
Idempotent: safe to call multiple times.

clamp

clamp(
    *, min: float | None = None, max: float | None = None
) -> "Tensor"

Clamp tensor values elementwise between min and max.

Parameters:

Name	Type	Description	Default
`min`	`float`	Minimum value. If None, no lower bound is applied.	`None`
`max`	`float`	Maximum value. If None, no upper bound is applied.	`None`

Returns:

Type	Description
`Tensor`	A new Tensor with values clamped to the specified range.

Notes

Gradients pass through unchanged for values within the range.
Gradients are zeroed for values clipped by min or max.

to_

to_(device, *, copy: bool = False) -> ITensor

Move this tensor to another device in-place.

This method performs an in-place device placement transition. Unlike to(), which may return a newly allocated tensor on the target device, to_() preserves the identity of self (i.e., id(self) is unchanged) by migrating the underlying storage and updating device placement fields on the same object.

Semantics

If the target device matches the current device:
- returns self (no-op), unless copy=True, in which case the tensor's storage is replaced with a cloned copy on the same device.
If the target device differs:
- performs the transfer using to(device, copy=True) internally, then swaps this tensor's backing storage to the transferred result.

Supported transfers

CPU -> CUDA: allocates device memory and performs host-to-device memcpy.
CUDA -> CPU: allocates host buffer and performs device-to-host memcpy.
CUDA -> CUDA (different device indices): currently implemented via an intermediate CPU round-trip for simplicity.

Parameters:

Name	Type	Description	Default
`device`	`Device \| str`	Target device. If a string is provided, it is parsed as a `Device` (e.g., "cpu", "cuda:0").	required
`copy`	`bool`	If True, forces a copy even when the device is unchanged. This is the in-place analogue of `to(..., copy=True)`. Defaults to False.	`False`

Returns:

Type	Description
`ITensor`	This tensor (`self`) after in-place migration.

Notes

This method intentionally does not preserve autograd context across device transfers. If this tensor participates in autograd, should treat to_() as a graph break:
- ctx is cleared.
- requires_grad is left unchanged (you can still accumulate grads going forward), but any prior graph history is discarded.
If self.grad exists and is a Tensor-like object with .to(...), this method will attempt to move it to the same device as well.

keydnn.Device

Concrete computation device descriptor.

This class encapsulates a normalized representation of a computation device, including its type (CPU or CUDA) and, for CUDA devices, an optional device index (e.g., cuda:0).

The class performs strict validation of device strings to ensure a small, well-defined set of supported device identifiers.

Parameters:

Name	Type	Description	Default
`device`	`str`	Device identifier string. Must be either: - "cpu" - "cuda:", where is a non-negative integer	required

Raises:

Type	Description
`ValueError`	If the provided device string does not match the supported formats.

Notes

__slots__ is used to prevent dynamic attribute creation and reduce per-instance memory overhead.
This class is intentionally lightweight and does not allocate or manage any backend resources.

is_cpu

is_cpu() -> bool

Check whether this device represents a CPU.

Returns:

Type	Description
`bool`	True if the device type is CPU, False otherwise.

is_cuda

is_cuda() -> bool

Check whether this device represents a CUDA GPU.

Returns:

Type	Description
`bool`	True if the device type is CUDA, False otherwise.

Tensors

keydnn.Tensor

dtype property

shape property

device property

requires_grad property writable

grad property

T property

data property

nbytes property

device_type property

zero_grad

to_numpy

copy_from_numpy

copy_from

Notes

fill

debug_storage_repr

numel

reshape

stack staticmethod

concat staticmethod

broadcast_to

sum

mean

log

exp

max

matmul

transpose

backward

item

sqrt

rand staticmethod

full classmethod

tanh

sigmoid

zeros classmethod

ones classmethod

to

clone

sum_to_shape

free_

clamp

to_

keydnn.Device

is_cpu

is_cuda

dtype `property`

shape `property`

device `property`

requires_grad `property` `writable`

grad `property`

T `property`

data `property`

nbytes `property`

device_type `property`

stack `staticmethod`

concat `staticmethod`

rand `staticmethod`

full `classmethod`

zeros `classmethod`

ones `classmethod`