serverless_gpu.runtime
Runtime utilities for distributed serverless GPU compute.
This module provides runtime utilities for managing distributed environments in serverless GPU compute. It includes functions for:
Getting distributed configuration parameters (rank, world size, etc.)
Environment variable management for distributed setup
Integration with PyTorch distributed backend
Node and process rank management
The functions in this module are adapted from MosaicML’s Composer library to work with the serverless GPU compute environment.
Note: This code is derived from https://github.com/mosaicml/composer.git@dc13fb0
Functions
|
Returns the global rank of the current process in the input PG, which is on |
Returns the local rank for the current process, which is on |
|
Returns the local world size, which is the number of processes for the current node. |
|
Returns the node rank. |
|
Returns the world size, which is the number of processes participating in this training run. |
- exception serverless_gpu.runtime.MissingEnvironmentError[source]
Bases:
ExceptionRaised when a required environment variable is missing.
- serverless_gpu.runtime.get_global_rank(group=None)[source]
Returns the global rank of the current process in the input PG, which is on
[0; group.WORLD_SIZE - 1].- Parameters:
group (ProcessGroup, optional) – The process group. If
None, it will return env_varRANK- Returns:
The global rank in input process group.
- Return type:
int
- serverless_gpu.runtime.get_local_rank()[source]
Returns the local rank for the current process, which is on
[0; LOCAL_WORLD_SIZE - 1].- Returns:
The local rank.
- Return type:
int
- serverless_gpu.runtime.get_local_world_size()[source]
Returns the local world size, which is the number of processes for the current node.
- Returns:
The local world size.
- Return type:
int