Databricks FeatureEngineeringClient

class databricks.feature_engineering.client.FeatureEngineeringClient(*, model_registry_uri: Optional[str] = None)

Bases: object

Client for interacting with the Databricks Feature Engineering in Unity Catalog.

Note

Use Databricks FeatureStoreClient for workspace feature tables in hive metastore

create_feature_serving_endpoint(*, name: Optional[str] = None, config: Optional[EndpointCoreConfig] = None, **kwargs) → FeatureServingEndpoint

Note

Experimental: This function may change or be removed in a future release without warning.

Experimental feature: Creates a Feature Serving Endpoint

Parameters

name – The name of the endpoint. Must only contain alphanumerics and dashes.
config – Configuration of the endpoint, including features, workload_size, etc.

create_feature_spec(*, name: str, features: List[Union[FeatureLookup, FeatureFunction]], exclude_columns: Optional[List[str]] = None) → FeatureSpecInfo

Creates a feature specification in Unity Catalog. The feature spec can be used for serving features & functions.

Parameters

name – The name of the feature spec.
features – List of FeatureLookups and FeatureFunctions to include in the feature spec.
exclude_columns – List of columns to drop from the final output.

update_feature_spec(*, name: str, owner: str) → None

Update the owner of a feature spec.

Parameters

name – The name of the feature spec.
owner – The new owner of the feature spec.

delete_feature_spec(*, name: str) → None

Delete a feature spec.

Parameters: name – The name of the feature spec.

create_table(*, name: str, primary_keys: Union[str, List[str]], df: Optional[DataFrame] = None, timeseries_column: Optional[str] = None, partition_columns: Optional[Union[str, List[str]]] = None, schema: Optional[StructType] = None, description: Optional[str] = None, tags: Optional[Dict[str, str]] = None, **kwargs) → FeatureTable

Create and return a feature table with the given name and primary keys.

The returned feature table has the given name and primary keys. Uses the provided schema or the inferred schema of the provided df. If df is provided, this data will be saved in a Delta table. Supported data types for features are: IntegerType, LongType, FloatType, DoubleType, StringType, BooleanType, DateType, TimestampType, ShortType, ArrayType, MapType, and BinaryType, DecimalType, and StructType.

Parameters

name – A feature table name. The format is <catalog_name>.<schema_name>.<table_name>, for example ml.dev.user_features.
primary_keys – The feature table’s primary keys. If multiple columns are required, specify a list of column names, for example ['customer_id', 'region'].
df – Data to insert into this feature table. The schema of df will be used as the feature table schema.
timeseries_column –
Column containing the event time associated with feature value. Timeseries column should be part of the primary keys. Combined, the timeseries column and other primary keys of the feature table uniquely identify the feature value for an entity at a point in time.

Note

Experimental: This argument may change or be removed in a future release without warning.
partition_columns –
Columns used to partition the feature table. If a list is provided, column ordering in the list will be used for partitioning.

Note

When choosing partition columns for your feature table, use columns that do not have a high cardinality. An ideal strategy would be such that you expect data in each partition to be at least 1 GB. The most commonly used partition column is a date.

Additional info: Choosing the right partition columns for Delta tables
schema – Feature table schema. Either schema or df must be provided.
description – Description of the feature table.
tags – Tags to associate with the feature table.

create_training_set(*, df: DataFrame, feature_lookups: Optional[List[Union[FeatureLookup, FeatureFunction]]] = None, feature_spec: Optional[str] = None, label: Optional[Union[str, List[str]]], exclude_columns: Optional[List[str]] = None, use_spark_native_join: bool = False, **kwargs) → TrainingSet

Create a TrainingSet using either feature_lookups or feature_spec.

Parameters

df – The DataFrame used to join features into.
feature_lookups – List of features to use in the TrainingSet. FeatureLookups are joined into the DataFrame, and FeatureFunctions are computed on-demand. feature_lookups cannot be specified if feature_spec is provided.
feature_spec – Full name of the FeatureSpec in Unity Catalog. feature_spec cannot be specified if feature_lookups is provided.
label – Names of column(s) in DataFrame that contain training set labels. To create a training set without a label field, i.e. for unsupervised training set, specify label = None.
exclude_columns – Names of the columns to drop from the TrainingSet DataFrame.
use_spark_native_join – Use spark to optimize table joining performance. The optimization is only applicable when Photon is enabled.

Returns

A TrainingSet object.

delete_feature_serving_endpoint(*, name=None, **kwargs) → None: Note

Experimental: This function may change or be removed in a future release without warning.

Experimental feature

delete_feature_table_tag(*, name: str, key: str) → None

Delete the tag associated with the feature table. Deleting a non-existent tag will emit a warning.

Parameters

name – the feature table name.
key – the tag key to delete.

drop_online_table(name: str, online_store: OnlineStoreSpec) → None

Drop a table in an online store.

This API first attempts to make a call to the online store provider to drop the table. If successful, it then deletes the online store from the feature catalog.

Parameters

name – Name of feature table associated with online store table to drop.
online_store – Specification of the online store.

Note

Deleting an online published table can lead to unexpected failures in downstream dependencies. Ensure that the online table being dropped is no longer used for Model Serving feature lookup or any other use cases.

drop_table(*, name: str) → None

Delete the specified feature table. This API also drops the underlying Delta table.

Parameters: name – A feature table name. The format is <catalog_name>.<schema_name>.<table_name>, for example ml.dev.user_features.

Note

Deleting a feature table can lead to unexpected failures in upstream producers and downstream consumers (models, endpoints, and scheduled jobs). You must delete any existing published online stores separately.

get_feature_serving_endpoint(*, name=None, **kwargs) → FeatureServingEndpoint: Note

Experimental: This function may change or be removed in a future release without warning.

Experimental feature

get_table(*, name: str) → FeatureTable

Get a feature table’s metadata.

Parameters: name – A feature table name. The format is <catalog_name>.<schema_name>.<table_name>, for example ml.dev.user_features.

log_model(*, model: Any, artifact_path: str, flavor: module, training_set: Optional[TrainingSet] = None, registered_model_name: Optional[str] = None, await_registration_for: int = 300, infer_input_example: bool = False, **kwargs)

Log an MLflow model packaged with feature lookup information.

Note

The DataFrame returned by TrainingSet.load_df() must be used to train the model. If it has been modified (for example data normalization, add a column, and similar), these modifications will not be applied at inference time, leading to training-serving skew.

Parameters

model – Model to be saved. This model must be capable of being saved by flavor.save_model. See the MLflow Model API.
artifact_path – Run-relative artifact path.
flavor – MLflow module to use to log the model. flavor should have type ModuleType. The module must have a method save_model, and must support the python_function flavor. For example, mlflow.sklearn, mlflow.xgboost, and similar.
training_set – The TrainingSet used to train this model.
registered_model_name –

Note

Experimental: This argument may change or be removed in a future release without warning.

If given, create a model version under registered_model_name, also creating a registered model if one with the given name does not exist.
await_registration_for – Number of seconds to wait for the model version to finish being created and is in READY status. By default, the function waits for five minutes. Specify 0 or None to skip waiting.
infer_input_example –

Note

Experimental: This argument may change or be removed in a future release without warning.

Automatically log an input example along with the model, using supplied training data. Defaults to False.

Kwargs

If other keyword arguments are provided, in most cases, they are passed to the underlying MLflow API flavor.save_model() when saving and registering the model.

Note

signature is not recommended and it’s preferred to use infer_input_example.

output_schema: When logging a model without labels in the training set, you must pass output_schema to log_model to suggest the output schema explicitly. For example:

from mlflow.types import ColSpec, DataType, Schema

output_schema = Schema([ColSpec(DataType.???)]) # Refer to mlflow signature types for the right choice of type here
...
fe.log_model(
    ...
    output_schema=output_schema
)

Returns

None

publish_table(*, online_store: Union[OnlineStoreSpec, DatabricksOnlineStore], source_table_name: Optional[str] = None, online_table_name: Optional[str] = None, filter_condition: Optional[str] = None, mode: str = 'merge', streaming: bool = False, checkpoint_location: Optional[str] = None, trigger: Dict[str, Any] = {'processingTime': '5 minutes'}, features: Optional[Union[str, List[str]]] = None, **kwargs) → Optional[Union[StreamingQuery, PublishedTable]]

Publish a feature table to an online store.

Parameters

source_table_name – Name of the feature table. This is a required parameter.
online_table_name – Name of the online table. This is a required parameter when publishing to Databricks Online Store.
online_store – Specification of the online store. This is a required parameter.
filter_condition – A SQL expression using feature table columns that filters feature rows prior to publishing to the online store. For example, "dt > '2020-09-10'". This is analogous to running df.filter or a WHERE condition in SQL on a feature table prior to publishing.
mode –
Specifies the behavior when data already exists in this feature table. The only supported mode is "merge", with which the new data will be merged in, under these conditions:
- If a key exists in the online table but not the offline table, the row in the online table is unmodified.
- If a key exists in the offline table but not the online table, the offline table row is inserted into the online table.
- If a key exists in both the offline and the online tables, the online table row will be updated.
streaming – If True, streams data to the online store.
checkpoint_location – Sets the Structured Streaming checkpointLocation option. By setting a checkpoint_location, Spark Structured Streaming will store progress information and intermediate state, enabling recovery after failures. This parameter is only supported when streaming=True.
trigger – If streaming=True, trigger defines the timing of stream data processing. The dictionary will be unpacked and passed to DataStreamWriter.trigger as arguments. For example, trigger={'once': True} will result in a call to DataStreamWriter.trigger(once=True).
features –
Specifies the feature column(s) to be published to the online store. The selected features must be a superset of existing online store features. Primary key columns and timestamp key columns will always be published.

Note

When features is not set, the whole feature table will be published.

Note

The following parameters are not supported for Databricks online store: - filter_condition - checkpoint_location - mode - trigger - features

Note

Change Data Feed (CDF) must be enabled when publishing to databricks online store.

Returns: If streaming=True, returns a PySpark StreamingQuery, None otherwise.

read_table(*, name: str, **kwargs) → DataFrame

Read the contents of a feature table.

Parameters: name – A feature table name. The format is <catalog_name>.<schema_name>.<table_name>, for example ml.dev.user_features.
Returns: The feature table contents, or an exception will be raised if this feature table does not exist.

score_batch(*, model_uri: str, df: DataFrame, result_type: str = 'double', env_manager: str = 'local', params: Optional[dict[str, Any]] = None, use_spark_native_join: bool = False, **kwargs) → DataFrame

Evaluate the model on the provided DataFrame.

Additional features required for model evaluation will be automatically retrieved from feature tables.

The model must have been logged with FeatureEngineeringClient.log_model(), which packages the model with feature metadata. Unless present in df, these features will be looked up from feature tables and joined with df prior to scoring the model.

If a feature is included in df, the provided feature values will be used rather than those stored in feature tables.

For example, if a model is trained on two features account_creation_date and num_lifetime_purchases, as in:

feature_lookups = [
    FeatureLookup(
        table_name = 'trust_and_safety.customer_features',
        feature_name = 'account_creation_date',
        lookup_key = 'customer_id',
    ),
    FeatureLookup(
        table_name = 'trust_and_safety.customer_features',
        feature_name = 'num_lifetime_purchases',
        lookup_key = 'customer_id'
    ),
]

with mlflow.start_run():
    training_set = fe.create_training_set(
        df,
        feature_lookups = feature_lookups,
        label = 'is_banned',
        exclude_columns = ['customer_id']
    )
    ...
      fe.log_model(
        model,
        "model",
        flavor=mlflow.sklearn,
        training_set=training_set,
        registered_model_name="example_model"
      )

Then at inference time, the caller of FeatureEngineeringClient.score_batch() must pass a DataFrame that includes customer_id, the lookup_key specified in the FeatureLookups of the training_set. If the DataFrame contains a column account_creation_date, the values of this column will be used in lieu of those in feature tables. As in:

# batch_df has columns ['customer_id', 'account_creation_date']
predictions = fe.score_batch(
    'models:/example_model/1',
    batch_df
)

Parameters

model_uri –
The location, in URI format, of the MLflow model logged using FeatureEngineeringClient.log_model(). One of:
- runs:/<mlflow_run_id>/run-relative/path/to/model
- models:/<model_name>/<model_version>
- models:/<model_name>/<stage>
For more information about URI schemes, see Referencing Artifacts.
df –
The DataFrame to score the model on. Features from feature tables will be joined with df prior to scoring the model. df must:

1. Contain columns for lookup keys required to join feature data from feature tables, as specified in the feature_spec.yaml artifact.

2. Contain columns for all source keys required to score the model, as specified in the feature_spec.yaml artifact.

3. Not contain a column prediction, which is reserved for the model’s predictions. df may contain additional columns.

Streaming DataFrames are not supported.
result_type – The return type of the model. See mlflow.pyfunc.spark_udf() result_type.
env_manager – The environment manager to use in order to create the python environment for model inference. See mlflow.pyfunc.spark_udf() env_manager.
params – Additional parameters to pass to the model for inference.
use_spark_native_join – Use spark to optimize table joining performance. The optimization is only applicable when Photon is enabled.

Returns

A DataFrame containing:

All columns of df.

All feature values retrieved from feature tables.

A column prediction containing the output of the model.

set_feature_table_tag(*, name: str, key: str, value: str) → None

Create or update a tag associated with the feature table. If the tag with the corresponding key already exists, its value will be overwritten with the new value.

Parameters

name – the feature table name
key – tag key
value – tag value

write_table(*, name: str, df: DataFrame, mode: str = 'merge', checkpoint_location: Optional[str] = None, trigger: Dict[str, Any] = {'processingTime': '5 seconds'}) → Optional[StreamingQuery]

Writes to a feature table.

If the input DataFrame is streaming, will create a write stream.

Parameters

name – A feature table name. The format is <catalog_name>.<schema_name>.<table_name>, for example ml.dev.user_features.
df – Spark DataFrame with feature data. Raises an exception if the schema does not match that of the feature table.
mode –
There is only one supported write mode:
- "merge" upserts the rows in df into the feature table. If df contains columns not present in the feature table, these columns will be added as new features.
If you want to overwrite a table, run SQL DELETE FROM <table name>; to delete all rows, or drop and recreate the table before calling this method.
checkpoint_location – Sets the Structured Streaming checkpointLocation option. By setting a checkpoint_location, Spark Structured Streaming will store progress information and intermediate state, enabling recovery after failures. This parameter is only supported when the argument df is a streaming DataFrame.
trigger – If df.isStreaming, trigger defines the timing of stream data processing, the dictionary will be unpacked and passed to DataStreamWriter.trigger as arguments. For example, trigger={'once': True} will result in a call to DataStreamWriter.trigger(once=True).

Returns

If df.isStreaming, returns a PySpark StreamingQuery. None otherwise.

aggregate_features(*, features: FeatureAggregations) → DataFrame

Computes the specified aggregations and returns a DataFrame containing the results.

Parameters: features – The aggregation specification to compute.

create_materialized_view(*, features: FeatureAggregations, output_table: str, schedule: Optional[CronSchedule]) → MaterializedViewInfo

Creates and runs a pipeline that materializes the given feature aggregation specification into a materialized view.

Parameters

features – The aggregation specification to materialize.
output_table – The name of the output materialized view.
schedule – The schedule at which to run the materialization pipeline. If not provided, the pipeline can only be run manually.

build_model_env(model_uri: str, save_path: str) → str

Prebuild the model Python environment required by the given model and generate an archive file saved to the specified save_path. The resulting environment can then be used in FeatureEngineeringClient.score_batch() as the prebuilt_env_uri parameter.

Parameters

model_uri – URI of the model used to build the Python environment.
save_path – Directory path to save the prebuilt model environment archive file. This can be a local directory path, a mounted DBFS path (e.g., ‘/dbfs/…’), or a mounted UC volume path (e.g., ‘/Volumes/…’).

Returns

The path of the archive file containing the Python environment data.

create_online_store(*, name: str, capacity: str, read_replica_count: Optional[int] = None) → DatabricksOnlineStore

Create an Online Feature Store.

Parameters

name – The name of the online store.
capacity – The capacity of the online store. Valid values are “CU_1”, “CU_2”, “CU_4”, “CU_8”.
read_replica_count – The number of read replicas for the online store.

Returns

The created online store.

get_online_store(*, name: str) → Optional[DatabricksOnlineStore]

Get an Online Feature Store.

Parameters

name – The name of the online store.

Returns

The retrieved online store, or None if not found.

update_online_store(*, name: str, capacity: ~typing.Union[str, ~databricks.feature_engineering.client._UnsetType] = <UNSET>, read_replica_count: ~typing.Union[int, ~databricks.feature_engineering.client._UnsetType] = <UNSET>) → DatabricksOnlineStore

Update an Online Feature Store. Only the fields specified will be updated. Fields that are not specified will remain unchanged.

Parameters

name – The name of the online store.
capacity – The capacity of the online store. Valid values are “CU_1”, “CU_2”, “CU_4”, “CU_8”.
read_replica_count – The number of read replicas for the online store.

Returns

The updated online store.

delete_online_store(*, name: str) → None

Delete an Online Feature Store.

Parameters: name – The name of the online store.
Returns: None.