serverless_gpu.runtime

Runtime utilities for distributed serverless GPU compute.

This module provides runtime utilities for managing distributed environments in serverless GPU compute. It includes functions for:

Getting distributed configuration parameters (rank, world size, etc.)
Environment variable management for distributed setup
Integration with PyTorch distributed backend
Node and process rank management

The functions in this module are adapted from MosaicML’s Composer library to work with the serverless GPU compute environment.

Note: This code is derived from https://github.com/mosaicml/composer.git@dc13fb0

Functions

`get_global_rank`([group])	Returns the global rank of the current process in the input PG, which is on `[0; group.WORLD_SIZE - 1]`.
`get_local_rank`()	Returns the local rank for the current process, which is on `[0; LOCAL_WORLD_SIZE - 1]`.
`get_local_world_size`()	Returns the local world size, which is the number of processes for the current node.
`get_node_rank`()	Returns the node rank.
`get_replica_rank`()
`get_world_size`()	Returns the world size, which is the number of processes participating in this training run.

exception serverless_gpu.runtime.MissingEnvironmentError[source]

Bases: Exception

Raised when a required environment variable is missing.

serverless_gpu.runtime.get_global_rank(group=None)[source]

Returns the global rank of the current process in the input PG, which is on [0; group.WORLD_SIZE - 1].

Parameters:: group (ProcessGroup, optional) – The process group. If None, it will return env_var RANK
Returns:: The global rank in input process group.
Return type:: int

serverless_gpu.runtime.get_local_rank()[source]

Returns the local rank for the current process, which is on [0; LOCAL_WORLD_SIZE - 1].

Returns:: The local rank.
Return type:: int

serverless_gpu.runtime.get_local_world_size()[source]

Returns the local world size, which is the number of processes for the current node.

Returns:: The local world size.
Return type:: int

serverless_gpu.runtime.get_node_ip()[source]

serverless_gpu.runtime.get_node_rank()[source]

Returns the node rank.

For example, if there are 2 nodes, and 2 ranks per node, then global ranks 0-1 will have a node rank of 0, and global ranks 2-3 will have a node rank of 1.

Returns:: The node rank, starting at 0.
Return type:: int

serverless_gpu.runtime.get_replica_rank()[source]

Return type:: int

serverless_gpu.runtime.get_world_size()[source]

Returns the world size, which is the number of processes participating in this training run.

Returns:: The world size.
Return type:: int