serverless_gpu package

Submodules

serverless_gpu.asynchronous module

Asynchronous execution and monitoring for distributed serverless GPU compute.

This module provides asynchronous execution capabilities for serverless GPU jobs, including:

  • Non-blocking job submission and execution

  • Real-time job monitoring and status updates

  • Environment synchronization across distributed nodes

  • Job lifecycle management and error handling

  • Integration with Databricks workspace for job tracking

serverless_gpu.asynchronous.get_calls(function_name=None, function_callable=None, start_time_from=None, end_time_to=None, status=None, active_only=True, limit=5)[source]

Returns a list of currently running DistributedFunctionCall instances that (optionally) match the specified function.

Parameters
  • function_name (Optional[str]) – Optional name of the function to filter by

  • function_callable (Optional[Callable]) – Optional callable to get function name from

  • start_time_from (Optional[float]) – Optional start time filter

  • end_time_to (Optional[float]) – Optional end time filter

  • status (Optional[RunStatus]) – Optional RunStatus to filter by

  • active_only (bool) – Whether to only return active runs

  • limit (int) – Optional maximum number of calls to return

serverless_gpu.asynchronous.print_runs(runs)[source]

Prints the details of all specified DistributedFunctionCall instances. If no runs are specified, prints all currently running DistributedFunctionCall instances.

Return type

None

serverless_gpu.asynchronous.stop(runs)[source]

Stops all specified DistributedFunctionCall instances. If no runs are specified, stops all currently running DistributedFunctionCall instances.

Return type

None

serverless_gpu.asynchronous.wait(runs)[source]

Waits for all specified job instances to finish.

Return type

None

serverless_gpu.compute module

GPU compute type definitions and utilities.

This module defines the available GPU types for serverless compute and provides utilities for working with GPU configurations and resource allocation.

class serverless_gpu.compute.GPUType(value)[source]

Bases: Enum

Enumeration of available GPU types for serverless compute.

This enum defines the GPU types that can be used for distributed computing on the serverless GPU platform.

H100

NVIDIA H100 80GB GPU instances.

A10

NVIDIA A10 GPU instances.

A10 = 'a10'
H100 = 'h100_80gb'
classmethod values()[source]

Get all GPU type values.

Returns

List of all available GPU type values.

Return type

List[str]

serverless_gpu.launcher module

Main launcher module for distributed serverless GPU compute.

This module provides the core functionality for launching and managing distributed functions on serverless GPU infrastructure. It includes:

  • The distributed decorator for executing functions on remote GPU resources

  • Job submission and monitoring capabilities

  • Integration with Databricks jobs API and MLflow for tracking

  • Environment synchronization and dependency management

  • Support for multi-GPU and multi-node distributed workloads

The main entry point is the distributed function which can be used as a decorator or called directly to execute functions on serverless GPU compute resources.

serverless_gpu.launcher.distributed(gpus, gpu_type=None, remote=False, run_async=False)[source]

Decorator to launch a function on remote GPUs or local GPUs.

remote GPUs: gpus that are not attached to your notebook but you have access to local GPUs: the gpus that are attached to the notebook

Parameters
  • gpus (int, optional) – Number of GPUs to use. Must be 1, 2, 4, 8 or a multiple of 8 for remote GPUs.

  • gpu_type (Optional[Union[GPUType, str]], optional) – The GPU type to use. Defaults to None. Required if remote is True.

  • remote (bool) – Whether to run the function on remote GPUs. Defaults to False.

  • run_async (bool) – Whether to run the function asynchronously. Defaults to False.

serverless_gpu.runtime module

Runtime utilities for distributed serverless GPU compute.

This module provides runtime utilities for managing distributed environments in serverless GPU compute. It includes functions for:

  • Getting distributed configuration parameters (rank, world size, etc.)

  • Environment variable management for distributed setup

  • Integration with PyTorch distributed backend

  • Node and process rank management

The functions in this module are adapted from MosaicML’s Composer library to work with the serverless GPU compute environment.

Note: This code is derived from https://github.com/mosaicml/composer.git@dc13fb0

serverless_gpu.runtime.get_global_rank(group=None)[source]

Returns the global rank of the current process in the input PG, which is on [0; group.WORLD_SIZE - 1].

Parameters

group (ProcessGroup, optional) – The process group. If None, it will return env_var RANK

Returns

The global rank in input process group.

Return type

int

serverless_gpu.runtime.get_local_rank()[source]

Returns the local rank for the current process, which is on [0; LOCAL_WORLD_SIZE - 1].

Returns

The local rank.

Return type

int

serverless_gpu.runtime.get_local_world_size()[source]

Returns the local world size, which is the number of processes for the current node.

Returns

The local world size.

Return type

int

serverless_gpu.runtime.get_node_rank()[source]

Returns the node rank.

For example, if there are 2 nodes, and 2 ranks per node, then global ranks 0-1 will have a node rank of 0, and global ranks 2-3 will have a node rank of 1.

Returns

The node rank, starting at 0.

Return type

int

serverless_gpu.runtime.get_replica_rank()[source]
Return type

int

serverless_gpu.runtime.get_world_size()[source]

Returns the world size, which is the number of processes participating in this training run.

Returns

The world size.

Return type

int