serverless_gpu.runtime

Runtime utilities for distributed serverless GPU compute.

This module provides runtime utilities for managing distributed environments in serverless GPU compute. It includes functions for:

  • Getting distributed configuration parameters (rank, world size, etc.)

  • Environment variable management for distributed setup

  • Integration with PyTorch distributed backend

  • Node and process rank management

The functions in this module are adapted from MosaicML’s Composer library to work with the serverless GPU compute environment.

Note: This code is derived from https://github.com/mosaicml/composer.git@dc13fb0

Functions

get_global_rank([group])

Returns the global rank of the current process in the input PG, which is on [0; group.WORLD_SIZE - 1].

get_local_rank()

Returns the local rank for the current process, which is on [0; LOCAL_WORLD_SIZE - 1].

get_local_world_size()

Returns the local world size, which is the number of processes for the current node.

get_node_rank()

Returns the node rank.

get_replica_rank()

rtype

int

get_world_size()

Returns the world size, which is the number of processes participating in this training run.

serverless_gpu.runtime.get_global_rank(group=None)[source]

Returns the global rank of the current process in the input PG, which is on [0; group.WORLD_SIZE - 1].

Parameters

group (ProcessGroup, optional) – The process group. If None, it will return env_var RANK

Returns

The global rank in input process group.

Return type

int

serverless_gpu.runtime.get_local_rank()[source]

Returns the local rank for the current process, which is on [0; LOCAL_WORLD_SIZE - 1].

Returns

The local rank.

Return type

int

serverless_gpu.runtime.get_local_world_size()[source]

Returns the local world size, which is the number of processes for the current node.

Returns

The local world size.

Return type

int

serverless_gpu.runtime.get_node_rank()[source]

Returns the node rank.

For example, if there are 2 nodes, and 2 ranks per node, then global ranks 0-1 will have a node rank of 0, and global ranks 2-3 will have a node rank of 1.

Returns

The node rank, starting at 0.

Return type

int

serverless_gpu.runtime.get_replica_rank()[source]
Return type

int

serverless_gpu.runtime.get_world_size()[source]

Returns the world size, which is the number of processes participating in this training run.

Returns

The world size.

Return type

int