Feature
- class databricks.ml_features.entities.feature.Feature(*, source: DataSource, inputs: List[str], function: Function, time_window: TimeWindow, catalog_name: str, schema_name: str, name: Optional[str] = None, description: Optional[str] = None, filter_condition: Optional[str] = None)
Bases:
_FeatureStoreObjectRepresents a feature definition that combines a data source with aggregation logic.
- Parameters
catalog_name – The catalog name for the feature (required)
schema_name – The schema name for the feature (required)
name – The name of the feature. Leading and trailing whitespace will be stripped. If not provided or empty after stripping, a name will be auto-generated based on the input columns, function, and time window.
source – The data source for this feature
inputs – List of column names from the source to use as input
function – The aggregation function to apply to the input columns
time_window – The time window for the aggregation
description – Optional description of the feature
filter_condition – Optional SQL filter condition to apply on the source data before aggregation
- __init__(*, source: DataSource, inputs: List[str], function: Function, time_window: TimeWindow, catalog_name: str, schema_name: str, name: Optional[str] = None, description: Optional[str] = None, filter_condition: Optional[str] = None)
Initialize a Feature object. See class documentation. Should not be invoked directly, use
FeatureEngineeringClient.create_featureinstead.create_featureensures the Feature is registered in Unity Catalog and properly validated.
- property name: str
The leaf name of the feature.
- property full_name: str
The fully qualified Unity Catalog name of the feature.
- property catalog_name: str
The catalog name of the feature.
- property schema_name: str
The schema name of the feature.
- property source: DataSource
The data source for this feature.
- property function: Function
The aggregation function to apply to the input columns.
- property time_window: TimeWindow
The time window for the aggregation.
Feature Aggregations
- class databricks.ml_features.entities.feature_aggregations.FeatureAggregations(*, source_table: str, lookup_key: Union[str, List[str]], timestamp_key: str, granularity: timedelta, start_time: datetime, end_time: Optional[datetime] = None, aggregations: List[Aggregation])
Bases:
_FeatureStoreObjectNote
Aliases:
databricks.feature_engineering.entities.feature_lookup.FeatureLookup,databricks.feature_store.entities.feature_lookup.FeatureLookupDefines an aggregation specification.
- Parameters
source_table – The source table to perform aggregation on. The source table can be any delta table.
lookup_key – Key to use when computing aggregation. It can be a single key or a list of keys.
timestamp_key – Key for timestamp. Used for determining the temporal position of data points.
granularity – The temporal granularity at which to generate aggregated features. For example, a granularity of 1 day means the aggregation materialized view will contain one row per primary key and per day since start_time until now.
start_time – The earliest time to generate aggregated features from. For example, a start_time of 2020-01-01 means the aggregation materialized view will not contain any rows before this time. This will be the start of the first granularity interval.
end_time – The latest time to generate aggregated features to. If None, it means the time of materialization pipeline run; if a datetime object, it means to use it as the end time.
aggregations – A list of aggregations to perform. Each aggregation defines an output column.
- __init__(*, source_table: str, lookup_key: Union[str, List[str]], timestamp_key: str, granularity: timedelta, start_time: datetime, end_time: Optional[datetime] = None, aggregations: List[Aggregation])
Initialize a FeatureAggregations object. See class documentation.
- property source_table: str
Returns the source table used for aggregation.
- property timestamp_key: str
Returns the timestamp key used for aggregation.
- property granularity: timedelta
Returns the granularity at which features are aggregated.
- property start_time: datetime
Returns the start time from which to generate aggregated features.
- property end_time: Optional[datetime]
Returns the end time up to which to generate aggregated features.
- property aggregations: List[Aggregation]
Returns the list of aggregations to perform.
- class databricks.ml_features.entities.aggregation.Aggregation(*, function: Union[str, Function], time_window: Optional[TimeWindow] = None, column: Optional[str] = None, output_column: Optional[str] = None, filter_condition: Optional[str] = None, **kwargs)
Bases:
_FeatureStoreObjectDefines a single aggregated feature.
- Parameters
column – The source column to aggregate. The column must exist in the parent FeatureAggregation source_table.
output_column – The output column name. If not provided, a default name will be generated.
function – The function to use. If a string is given, it will be interpreted as short-hand (e.g., “sum”, “avg”, “count”).
time_window – The time window to aggregate data with.
filter_condition – Optional SQL WHERE clause to filter source data before aggregation.
- __init__(*, function: Union[str, Function], time_window: Optional[TimeWindow] = None, column: Optional[str] = None, output_column: Optional[str] = None, filter_condition: Optional[str] = None, **kwargs)
Initialize an Aggregation object. See class documentation.
- property output_column: str
The output column name.
- property function: Function
The aggregation function to use.
- property time_window: TimeWindow
The time window to aggregate data with.
- property window: TimeWindow
The time window to aggregate data with.
- databricks.ml_features.entities.aggregation.Window
alias of
TimeWindow
Aggregation Functions
- class databricks.ml_features.entities.function.Function
Bases:
_FeatureStoreObjectAbstract base class for all aggregation functions.
- abstract property name: str
Return the name of the aggregation function.
- extra_parameters() Dict[str, Any]
Return the extra parameters of the function. Only applicable to a few functions that require additional parameters.
- classmethod from_string(function_str: str) Function
Create a Function instance from a string representation.
- Parameters
function_str – String name of the aggregation function
- Returns
Function instance
- Raises
ValueError – If the function string is not recognized
- class databricks.ml_features.entities.function.Avg
Bases:
FunctionClass representing the average (avg) aggregation function.
- property name: str
Return the name of the aggregation function.
- class databricks.ml_features.entities.function.Count
Bases:
FunctionClass representing the count aggregation function.
- property name: str
Return the name of the aggregation function.
- class databricks.ml_features.entities.function.ApproxCountDistinct(relativeSD: Optional[float] = None)
Bases:
FunctionClass representing the approximate count distinct aggregation function. See https://docs.databricks.com/en/sql/language-manual/functions/approx_count_distinct.html
- Parameters
relativeSD – The relative standard deviation allowed in the approximation.
- property name: str
Return the name of the aggregation function.
- class databricks.ml_features.entities.function.ApproxPercentile(percentile: float, accuracy: Optional[int] = None)
Bases:
FunctionClass representing the percentile approximation aggregation function. See https://docs.databricks.com/en/sql/language-manual/functions/approx_percentile.html
- Parameters
percentile – The percentile to approximate.
accuracy – The accuracy of the approximation.
- property name: str
Return the name of the aggregation function.
- class databricks.ml_features.entities.function.First
Bases:
FunctionClass representing the first aggregation function.
- property name: str
Return the name of the aggregation function.
- class databricks.ml_features.entities.function.Last
Bases:
FunctionClass representing the last aggregation function.
- property name: str
Return the name of the aggregation function.
- class databricks.ml_features.entities.function.Max
Bases:
FunctionClass representing the maximum (max) aggregation function.
- property name: str
Return the name of the aggregation function.
- class databricks.ml_features.entities.function.Min
Bases:
FunctionClass representing the minimum (min) aggregation function.
- property name: str
Return the name of the aggregation function.
- class databricks.ml_features.entities.function.StddevPop
Bases:
FunctionClass representing the population standard deviation (stddev_pop) aggregation function.
- property name: str
Return the name of the aggregation function.
- class databricks.ml_features.entities.function.StddevSamp
Bases:
FunctionClass representing the sample standard deviation (stddev_samp) aggregation function.
- property name: str
Return the name of the aggregation function.
- class databricks.ml_features.entities.function.Sum
Bases:
FunctionClass representing the sum aggregation function.
- property name: str
Return the name of the aggregation function.
Data Sources
- class databricks.ml_features.entities.data_source.DataSourceTypes(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Enumeration of supported data source types.
- class databricks.ml_features.entities.data_source.DataSource(*, source_type: DataSourceTypes, entity_columns: List[str], timeseries_column: str)
Bases:
_FeatureStoreObject,ABCAbstract base class for data sources used in feature computation.
- Parameters
source_type – The type of data source
entity_columns – List of column names that serve as primary keys
timeseries_column – Column name that contains timestamp data
- __init__(*, source_type: DataSourceTypes, entity_columns: List[str], timeseries_column: str)
Initialize a DataSource object. See class documentation.
- property source_type: DataSourceTypes
The type of data source.
- property timeseries_column: str
Column name that contains timestamp data.
- property order_column: str
The column name for the order column.
- abstract full_name() str
Return the full name/identifier for this data source.
- class databricks.ml_features.entities.data_source.DeltaTableSource(*, catalog_name: str, schema_name: str, table_name: str, entity_columns: List[str], timeseries_column: str)
Bases:
BackfillSourceData source implementation for Delta Lake tables.
- Parameters
catalog_name – The name of the Unity Catalog catalog
schema_name – The name of the schema within the catalog
table_name – The name of the table within the schema
entity_columns – List of column names that serve as primary keys
timeseries_column – Column name that contains timestamp data
- __init__(*, catalog_name: str, schema_name: str, table_name: str, entity_columns: List[str], timeseries_column: str)
Initialize a DeltaTableSource object. See class documentation.
- property catalog_name: str
The name of the Unity Catalog catalog.
- property schema_name: str
The name of the schema within the catalog.
- property table_name: str
The name of the table within the schema.
- full_name() str
Return the full table name in catalog.schema.table format.
- class databricks.ml_features.entities.data_source.VolumeSource(*, entity_columns: List[str], timeseries_column: str)
Bases:
DataSourceData source implementation for Unity Catalog Volumes.
TODO: Implementation to be defined based on volume requirements.
- __init__(*, entity_columns: List[str], timeseries_column: str)
Initialize a VolumeSource object. See class documentation.
- full_name() str
Return the full volume path identifier.
- class databricks.ml_features.entities.data_source.KafkaSource(*, name: str, entity_columns: List[str], timeseries_column: str)
Bases:
DataSourceData source implementation for Kafka streams.
KafkaSource references a KafkaConfig by name. The KafkaConfig contains connection details, authentication, and schemas for the Kafka topics. Column names must be prefixed with ‘key:’ or ‘value:’ to indicate which schema to use. Examples: ‘key:customer_id’ or ‘value:trip_details.pickup_zip’ for nested JSON fields.
- Parameters
name – Name of the KafkaConfig to use (uniquely identifies the KafkaConfig in the metastore)
entity_columns – List of column names with schema prefix (e.g., [‘key:customer_id’, ‘value:trip_details.pickup_zip’])
timeseries_column – Column name with schema prefix (e.g., ‘value:event_timestamp’)
- __init__(*, name: str, entity_columns: List[str], timeseries_column: str)
Initialize a KafkaSource object. See class documentation.
- class SchemaType
Bases:
objectConstants for Kafka schema types.
- property name: str
The name of the KafkaConfig this source references.
- full_name() str
Return the Kafka config name as the full identifier.
- load_df(spark_client)
Return an empty DataFrame for Kafka sources with schema from KafkaConfig.
Note: Features with Kafka sources are not supported in the training workflow. Calling load_df() or score_batch() on training sets or models with Kafka-source features will raise NotImplementedError.
This method returns an empty DataFrame to allow training set creation for model logging purposes only. The feature metadata will be captured and associated with the model, enabling feature lookups at inference time from the online store.
- Args:
spark_client: The Spark client for DataFrame operations
- static get_columns_from_kafka_config(kafka_config) Dict[str, DataType]
Extract all columns and their Spark types from a KafkaConfig.
This method parses the key_schema and value_schema from the KafkaConfig and returns a dictionary mapping column names (with key:/value: prefixes) to their Spark DataTypes.
- Args:
kafka_config: The KafkaConfig object containing schema information
- Returns:
Dictionary mapping prefixed column names to Spark DataTypes
- class databricks.ml_features.entities.data_source.DataFrameSource(*, dataframe, entity_columns: List[str], timeseries_column: str, source_name: Optional[str] = None)
Bases:
DataSourceData source implementation for Spark DataFrames.
This allows using an existing Spark DataFrame directly as a data source for feature computation, useful for in-memory data processing and testing.
- Parameters
dataframe – The Spark DataFrame to use as the data source
entity_columns – List of column names that serve as primary keys
timeseries_column – Column name that contains timestamp data
source_name – Optional name for the DataFrame source (for identification)
- __init__(*, dataframe, entity_columns: List[str], timeseries_column: str, source_name: Optional[str] = None)
Initialize a DataFrameSource object. See class documentation.
- property source_name: str
The name identifier for this DataFrame source.
- full_name() str
Return the source name identifier for this DataFrame.