Overview

What is Serverless GPU API?

Serverless GPU API is a light-weight, intuitive library for launching multi-GPU workloads from Databricks notebooks onto managed Serverless GPU compute. It’s designed to make distributed computing on Databricks simple and accessible.

Key Features

Easy Integration: Works seamlessly with Databricks notebooks
Multi-GPU Support: Efficiently utilize multiple GPUs for your workloads
Flexible Configuration: Customizable compute resources and runtime settings
Comprehensive Logging: Built-in logging and monitoring capabilities

Architecture

Distributed execution (`@distributed`)

The @distributed decorator captures GPU count, type, remote settings, and a reference to your callable. Those values are baked into a DistributedFunction wrapper.
When you call my_fn.distributed(...), the serverless GPU API serializes the wrapped function and its validated arguments with cloudpickle into a per-run directory that also holds an auto-generated _air.py entrypoint.
The local notebook environment (site-packages and user site-packages) is snapshotted and staged to DBFS via the env manager API. This snapshot is re-hydrated on the workers at launch time so your pip environment is available remotely.
For remote=True, the API creates a DistributedFunctionCall object and streams results and MLflow system metrics before returning or raising the remote exception. For local runs, _execute_local spawns multiple processes with torchrun-style arguments and then aggregates outputs with collect_outputs_or_raise.
Set the following inside the decorator: GPU count, GPU type, whether it is remote or local.
Set the following outside (before calling .distributed) the decorator: environment variables such as DATABRICKS_USE_RESERVED_GPU_POOL/DATABRICKS_REMOTE_GPU_POOL_ID, any %pip installs that affect the captured environment, and input arguments passed when invoking .distributed.

Ray execution (`@ray_launch`)

@ray_launch is layered on top of @distributed. Each task first bootstraps a torch-distributed rendezvous to decide the Ray head worker and gather IPs.
Rank-zero starts ray start --head (with metrics export if enabled), sets RAY_ADDRESS, and runs your decorated function as the Ray driver. Other nodes join via ray start --address and wait until the driver writes a completion marker.
Ray system metrics is collected per node via RayMetricsMonitor when remote=True.
Configure Ray runtime details (actors, datasets, placement groups, scheduling options) inside your decorated function using normal Ray APIs. Cluster-wide controls (number/type of GPUs, remote vs. local, async behavior, and Databricks pool env vars) stay outside in the decorator arguments or notebook env.

Use Cases

Serverless GPU API is ideal for:

Machine learning model training at scale
Distributed data processing
GPU-accelerated computations
Research and experimentation workflows

Distributed Execution Details

When running in distributed mode:

The function is serialized and distributed across the specified number of GPUs
Each GPU runs a copy of the function with the same parameters
The environment is synchronized across all nodes
Results are collected and returned from all GPUs

Best Practices

Always specify gpu_type when using remote=True

Limitations

Pip environment size is limited to 15GB.
We do not support Ray Serve APIs.

Troubleshooting

See the Databricks Serverless GPU troubleshooting guide for fixes to the most common errors.