serverless_gpu package

Submodules

serverless_gpu.compute module

GPU compute type definitions and utilities.

This module defines the available GPU types for serverless compute and provides utilities for working with GPU configurations and resource allocation.

class serverless_gpu.compute.DisplayGpuType(value)[source]

Bases: Enum

GPU type to Display mapping

a10 = 'A10'
h100_80gb = '8xH100'
class serverless_gpu.compute.GPUType(value)[source]

Bases: Enum

Enumeration of available GPU types for serverless compute.

This enum defines the GPU types that can be used for distributed computing on the serverless GPU platform.

H100

NVIDIA H100 80GB GPU instances.

A10

NVIDIA A10 GPU instances.

A10 = 'a10'
H100 = 'h100_80gb'
classmethod values()[source]

Get all GPU type values.

Returns:

List of all available GPU type values.

Return type:

List[str]

serverless_gpu.launcher module

Main launcher module for distributed serverless GPU compute.

This module provides the core functionality for launching and managing distributed functions on serverless GPU infrastructure. It includes:

  • The distributed decorator for executing functions on remote GPU resources

  • Workload submission and monitoring capabilities

  • Integration with Databricks jobs API and MLflow for tracking

  • Environment synchronization and dependency management

  • Support for multi-GPU and multi-node distributed workloads

The main entry point is the distributed function which can be used as a decorator or called directly to execute functions on serverless GPU compute resources.

serverless_gpu.launcher.distributed(gpus, gpu_type=None, remote=False, run_async=False)[source]

Decorator to launch a function on remote GPUs or local GPUs.

remote GPUs: gpus that are not attached to your notebook but you have access to local GPUs: the gpus that are attached to the notebook

Parameters:
  • gpus (int, optional) – Number of GPUs to use. Must be 1, 2, 4, 8 or a multiple of 8 for remote GPUs.

  • gpu_type (Optional[Union[GPUType, str]], optional) – The GPU type to use. Defaults to None. Required if remote is True.

  • remote (bool) – Whether to run the function on remote GPUs. Defaults to False.

  • run_async (bool) – Whether to run the function asynchronously. Defaults to False.

serverless_gpu.runtime module

Runtime utilities for distributed serverless GPU compute.

This module provides runtime utilities for managing distributed environments in serverless GPU compute. It includes functions for:

  • Getting distributed configuration parameters (rank, world size, etc.)

  • Environment variable management for distributed setup

  • Integration with PyTorch distributed backend

  • Node and process rank management

The functions in this module are adapted from MosaicML’s Composer library to work with the serverless GPU compute environment.

Note: This code is derived from https://github.com/mosaicml/composer.git@dc13fb0

exception serverless_gpu.runtime.MissingEnvironmentError[source]

Bases: Exception

Raised when a required environment variable is missing.

serverless_gpu.runtime.get_global_rank(group=None)[source]

Returns the global rank of the current process in the input PG, which is on [0; group.WORLD_SIZE - 1].

Parameters:

group (ProcessGroup, optional) – The process group. If None, it will return env_var RANK

Returns:

The global rank in input process group.

Return type:

int

serverless_gpu.runtime.get_local_rank()[source]

Returns the local rank for the current process, which is on [0; LOCAL_WORLD_SIZE - 1].

Returns:

The local rank.

Return type:

int

serverless_gpu.runtime.get_local_world_size()[source]

Returns the local world size, which is the number of processes for the current node.

Returns:

The local world size.

Return type:

int

serverless_gpu.runtime.get_node_ip()[source]
serverless_gpu.runtime.get_node_rank()[source]

Returns the node rank.

For example, if there are 2 nodes, and 2 ranks per node, then global ranks 0-1 will have a node rank of 0, and global ranks 2-3 will have a node rank of 1.

Returns:

The node rank, starting at 0.

Return type:

int

serverless_gpu.runtime.get_replica_rank()[source]
Return type:

int

serverless_gpu.runtime.get_world_size()[source]

Returns the world size, which is the number of processes participating in this training run.

Returns:

The world size.

Return type:

int

serverless_gpu.ray module

Ray integration for distributed serverless GPU compute.

This module provides integration with Ray for distributed computing on serverless GPU infrastructure. It includes:

  • Ray cluster setup and management on distributed GPU nodes

  • Integration with serverless GPU launcher for Ray workloads

  • Utilities for Ray head node detection and connection management

  • Support for Ray distributed training and inference patterns

The module enables users to run Ray-based distributed workloads on serverless GPU compute resources seamlessly.

serverless_gpu.ray.ray_launch(gpus, gpu_type=None, remote=False, run_async=False)[source]

Experimental decorator to launch function with a ray cluster.

Parameters:
  • gpus (int) – Number of gpus to launch ray on.

  • gpu_type (Optional[Union[GPUType, str]]) – The GPU type to use. Defaults to None. Required if remote is True.

  • remote (bool) – Use remote gpus.

  • run_async (bool) – Whether to run the function asynchronously. Defaults to False.