Overview
What is Serverless GPU API?
Serverless GPU API is a light-weight, intuitive library for launching multi-GPU workloads from Databricks notebooks onto managed Serverless GPU compute. It’s designed to make distributed computing on Databricks simple and accessible.
Key Features
Easy Integration: Works seamlessly with Databricks notebooks
Multi-GPU Support: Efficiently utilize multiple GPUs for your workloads
Flexible Configuration: Customizable compute resources and runtime settings
Comprehensive Logging: Built-in logging and monitoring capabilities
Architecture
Distributed execution (@distributed)
The
@distributeddecorator captures GPU count, type,remotesettings, and a reference to your callable. Those values are baked into aDistributedFunctionwrapper.When you call
my_fn.distributed(...), the serverless GPU API serializes the wrapped function and its validated arguments withcloudpickleinto a per-run directory that also holds an auto-generated_air.pyentrypoint.The local notebook environment (site-packages and user site-packages) is snapshotted and staged to DBFS via the env manager API. This snapshot is re-hydrated on the workers at launch time so your pip environment is available remotely.
For
remote=True, the API creates aDistributedFunctionCallobject and streams results and MLflow system metrics before returning or raising the remote exception. For local runs,_execute_localspawns multiple processes withtorchrun-style arguments and then aggregates outputs withcollect_outputs_or_raise.Set the following inside the decorator: GPU count, GPU type, whether it is remote or local.
Set the following outside (before calling
.distributed) the decorator: environment variables such asDATABRICKS_USE_RESERVED_GPU_POOL/DATABRICKS_REMOTE_GPU_POOL_ID, any%pipinstalls that affect the captured environment, and input arguments passed when invoking.distributed.
Ray execution (@ray_launch)
@ray_launchis layered on top of@distributed. Each task first bootstraps a torch-distributed rendezvous to decide the Ray head worker and gather IPs.Rank-zero starts
ray start --head(with metrics export if enabled), setsRAY_ADDRESS, and runs your decorated function as the Ray driver. Other nodes join viaray start --addressand wait until the driver writes a completion marker.Ray system metrics is collected per node via
RayMetricsMonitorwhenremote=True.Configure Ray runtime details (actors, datasets, placement groups, scheduling options) inside your decorated function using normal Ray APIs. Cluster-wide controls (number/type of GPUs, remote vs. local, async behavior, and Databricks pool env vars) stay outside in the decorator arguments or notebook env.
Use Cases
Serverless GPU API is ideal for:
Machine learning model training at scale
Distributed data processing
GPU-accelerated computations
Research and experimentation workflows
Distributed Execution Details
When running in distributed mode:
The function is serialized and distributed across the specified number of GPUs
Each GPU runs a copy of the function with the same parameters
The environment is synchronized across all nodes
Results are collected and returned from all GPUs
Best Practices
Always specify gpu_type when using remote=True
Limitations
Pip environment size is limited to 15GB.
We do not support Ray Serve APIs.
Troubleshooting
See the Databricks Serverless GPU troubleshooting guide for fixes to the most common errors.