serverless_gpu package
Submodules
serverless_gpu.asynchronous module
Asynchronous execution and monitoring for distributed serverless GPU compute.
This module provides asynchronous execution capabilities for serverless GPU jobs, including:
Non-blocking job submission and execution
Real-time job monitoring and status updates
Environment synchronization across distributed nodes
Job lifecycle management and error handling
Integration with Databricks workspace for job tracking
- serverless_gpu.asynchronous.get_calls(function_name=None, function_callable=None, start_time_from=None, end_time_to=None, status=None, active_only=True, limit=5)[source]
Returns a list of currently running DistributedFunctionCall instances that (optionally) match the specified function.
- Parameters
function_name (
Optional
[str
]) – Optional name of the function to filter byfunction_callable (
Optional
[Callable
]) – Optional callable to get function name fromstart_time_from (
Optional
[float
]) – Optional start time filterend_time_to (
Optional
[float
]) – Optional end time filterstatus (
Optional
[RunStatus
]) – Optional RunStatus to filter byactive_only (
bool
) – Whether to only return active runslimit (
int
) – Optional maximum number of calls to return
- serverless_gpu.asynchronous.print_runs(runs)[source]
Prints the details of all specified DistributedFunctionCall instances. If no runs are specified, prints all currently running DistributedFunctionCall instances.
- Return type
None
serverless_gpu.compute module
GPU compute type definitions and utilities.
This module defines the available GPU types for serverless compute and provides utilities for working with GPU configurations and resource allocation.
- class serverless_gpu.compute.GPUType(value)[source]
Bases:
Enum
Enumeration of available GPU types for serverless compute.
This enum defines the GPU types that can be used for distributed computing on the serverless GPU platform.
- H100
NVIDIA H100 80GB GPU instances.
- A10
NVIDIA A10 GPU instances.
- A10 = 'a10'
- H100 = 'h100_80gb'
serverless_gpu.launcher module
Main launcher module for distributed serverless GPU compute.
This module provides the core functionality for launching and managing distributed functions on serverless GPU infrastructure. It includes:
The distributed decorator for executing functions on remote GPU resources
Job submission and monitoring capabilities
Integration with Databricks jobs API and MLflow for tracking
Environment synchronization and dependency management
Support for multi-GPU and multi-node distributed workloads
The main entry point is the distributed function which can be used as a decorator or called directly to execute functions on serverless GPU compute resources.
- serverless_gpu.launcher.distributed(gpus, gpu_type=None, remote=False, run_async=False)[source]
Decorator to launch a function on remote GPUs or local GPUs.
remote GPUs: gpus that are not attached to your notebook but you have access to local GPUs: the gpus that are attached to the notebook
- Parameters
gpus (int, optional) – Number of GPUs to use. Must be 1, 2, 4, 8 or a multiple of 8 for remote GPUs.
gpu_type (Optional[Union[GPUType, str]], optional) – The GPU type to use. Defaults to None. Required if remote is True.
remote (bool) – Whether to run the function on remote GPUs. Defaults to False.
run_async (bool) – Whether to run the function asynchronously. Defaults to False.
serverless_gpu.runtime module
Runtime utilities for distributed serverless GPU compute.
This module provides runtime utilities for managing distributed environments in serverless GPU compute. It includes functions for:
Getting distributed configuration parameters (rank, world size, etc.)
Environment variable management for distributed setup
Integration with PyTorch distributed backend
Node and process rank management
The functions in this module are adapted from MosaicML’s Composer library to work with the serverless GPU compute environment.
Note: This code is derived from https://github.com/mosaicml/composer.git@dc13fb0
- serverless_gpu.runtime.get_global_rank(group=None)[source]
Returns the global rank of the current process in the input PG, which is on
[0; group.WORLD_SIZE - 1]
.- Parameters
group (ProcessGroup, optional) – The process group. If
None
, it will return env_varRANK
- Returns
The global rank in input process group.
- Return type
int
- serverless_gpu.runtime.get_local_rank()[source]
Returns the local rank for the current process, which is on
[0; LOCAL_WORLD_SIZE - 1]
.- Returns
The local rank.
- Return type
int
- serverless_gpu.runtime.get_local_world_size()[source]
Returns the local world size, which is the number of processes for the current node.
- Returns
The local world size.
- Return type
int