Overview
========
What is Serverless GPU API?
---------------------------
Serverless GPU API is a light-weight, intuitive library for launching multi-GPU workloads from `Databricks notebooks `_ onto managed `Serverless GPU compute `_.
It's designed to make distributed computing on Databricks simple and accessible.
Key Features
------------
* **Easy Integration**: Works seamlessly with Databricks notebooks
* **Multi-GPU Support**: Efficiently utilize multiple GPUs for your workloads
* **Flexible Configuration**: Customizable compute resources and runtime settings
* **Comprehensive Logging**: Built-in logging and monitoring capabilities
Architecture
------------
Distributed execution (``@distributed``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* The ``@distributed`` decorator captures GPU count, type, ``remote`` settings, and a reference to your callable. Those values are baked into a ``DistributedFunction`` wrapper.
* When you call ``my_fn.distributed(...)``, the serverless GPU API serializes the wrapped function and its validated arguments with ``cloudpickle`` into a per-run directory that also holds an auto-generated ``_air.py`` entrypoint.
* The local notebook environment (site-packages and user site-packages) is snapshotted and staged to DBFS via the env manager API. This snapshot is re-hydrated on the workers at launch time so your pip environment is available remotely.
* For ``remote=True``, the API creates a ``DistributedFunctionCall`` object and streams results and MLflow system metrics before returning or raising the remote exception. For local runs, ``_execute_local`` spawns multiple processes with ``torchrun``-style arguments and then aggregates outputs with ``collect_outputs_or_raise``.
* Set the following **inside** the decorator: GPU count, GPU type, whether it is remote or local.
* Set the following **outside** (before calling ``.distributed``) the decorator: environment variables such as ``DATABRICKS_USE_RESERVED_GPU_POOL``/``DATABRICKS_REMOTE_GPU_POOL_ID``, any ``%pip`` installs that affect the captured environment, and input arguments passed when invoking ``.distributed``.
Ray execution (``@ray_launch``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* ``@ray_launch`` is layered on top of ``@distributed``. Each task first bootstraps a torch-distributed rendezvous to decide the Ray head worker and gather IPs.
* Rank-zero starts ``ray start --head`` (with metrics export if enabled), sets ``RAY_ADDRESS``, and runs your decorated function as the Ray driver. Other nodes join via ``ray start --address`` and wait until the driver writes a completion marker.
* Ray system metrics is collected per node via ``RayMetricsMonitor`` when ``remote=True``.
* Configure Ray runtime details (actors, datasets, placement groups, scheduling options) **inside** your decorated function using normal Ray APIs. Cluster-wide controls (number/type of GPUs, remote vs. local, async behavior, and Databricks pool env vars) stay **outside** in the decorator arguments or notebook env.
Use Cases
---------
Serverless GPU API is ideal for:
* Machine learning model training at scale
* Distributed data processing
* GPU-accelerated computations
* Research and experimentation workflows
Distributed Execution Details
-----------------------------
When running in distributed mode:
* The function is serialized and distributed across the specified number of GPUs
* Each GPU runs a copy of the function with the same parameters
* The environment is synchronized across all nodes
* Results are collected and returned from all GPUs
Best Practices
--------------
* Always specify gpu_type when using remote=True
Limitations
-----------
* Pip environment size is limited to 15GB.
* We do not support Ray Serve APIs.
Troubleshooting
---------------
See the Databricks `Serverless GPU troubleshooting guide `_ for fixes to the most common errors.