databricks.vector_search package

class databricks.vector_search.client.VectorSearchClient(workspace_url=None, personal_access_token=None, service_principal_client_id=None, service_principal_client_secret=None, azure_tenant_id=None, azure_login_id=None, disable_notice=False)

Bases: object

A client for interacting with the Vector Search service.

This client provides methods for managing endpoints and indexes in the Vector Search service.

create_delta_sync_index(endpoint_name, index_name, primary_key, source_table_name, pipeline_type, embedding_dimension=None, embedding_vector_column=None, embedding_source_column=None, embedding_model_endpoint_name=None, sync_computed_embeddings=False)

Create a delta sync index.

Parameters:

endpoint_name (str) – The name of the endpoint.
index_name (str) – The name of the index.
primary_key (str) – The primary key of the index.
source_table_name (str) – The name of the source table.
pipeline_type (str) – The type of the pipeline. Must be CONTINUOUS or TRIGGERED.
embedding_dimension (int) – The dimension of the embedding vector.
embedding_vector_column (str) – The name of the embedding vector column.
embedding_source_column (str) – The name of the embedding source column.
embedding_model_endpoint_name (str) – The name of the embedding model endpoint.
sync_computed_embeddings (bool) – Whether to automatically sync the vector index contents and computed embeddings to a new UC table, table name will be ${index_name}_writeback_table.

create_delta_sync_index_and_wait(endpoint_name, index_name, primary_key, source_table_name, pipeline_type, embedding_dimension=None, embedding_vector_column=None, embedding_source_column=None, embedding_model_endpoint_name=None, sync_computed_embeddings=False, verbose=False, timeout=datetime.timedelta(days=1))

Create a delta sync index and wait for it to be ready.

Parameters:

endpoint_name (str) – The name of the endpoint.
index_name (str) – The name of the index.
primary_key (str) – The primary key of the index.
source_table_name (str) – The name of the source table.
pipeline_type (str) – The type of the pipeline. Must be CONTINUOUS or TRIGGERED.
embedding_dimension (int) – The dimension of the embedding vector.
embedding_vector_column (str) – The name of the embedding vector column.
embedding_source_column (str) – The name of the embedding source column.
embedding_model_endpoint_name (str) – The name of the embedding model endpoint.
verbose (bool) – Whether to print status messages.
timeout (datetime.timedelta) – The time allowed until we timeout with an Exception.
sync_computed_embeddings (bool) – Whether to automatically sync the vector index contents and computed embeddings to a new UC table, table name will be ${index_name}_writeback_table.

create_direct_access_index(endpoint_name, index_name, primary_key, embedding_dimension, embedding_vector_column, schema, embedding_model_endpoint_name=None)

Create a direct access index.

Parameters:

endpoint_name (str) – The name of the endpoint.
index_name (str) – The name of the index.
primary_key (str) – The primary key of the index.
embedding_dimension (int) – The dimension of the embedding vector.
embedding_vector_column (str) – The name of the embedding vector column.
schema (dict) – The schema of the index.
embedding_model_endpoint_name (str) – The name of the optional embedding model endpoint to use when querying.

create_endpoint(name, endpoint_type='STANDARD')

Create an endpoint.

Parameters:

name (str) – The name of the endpoint.
endpoint_type (str) – The type of the endpoint. Must be set to STANDARD.

create_endpoint_and_wait(name, endpoint_type='STANDARD', verbose=False, timeout=datetime.timedelta(seconds=3600))

Create an endpoint and wait for it to be online.

Parameters:

name (str) – The name of the endpoint.
endpoint_type (str) – The type of the endpoint. Must be set to STANDARD.
verbose (bool) – Whether to print status messages.
timeout (datetime.timedelta) – The time allowed until we timeout with an Exception.

delete_endpoint(name)

Delete an endpoint.

Parameters:: name (str) – The name of the endpoint.

delete_index(endpoint_name=None, index_name=None)

Delete an index.

Parameters:

endpoint_name (Option[str]) – The optional name of the endpoint.
index_name (str) – The name of the index.

get_endpoint(name)

Get an endpoint.

Parameters:: name (str) – The name of the endpoint.

get_index(endpoint_name=None, index_name=None)

Get an index.

Parameters:

endpoint_name (Option[str]) – The optional name of the endpoint.
index_name (str) – The name of the index.

list_endpoints(): List all endpoints.

list_indexes(name)

List all indexes for an endpoint.

Parameters:: name (str) – The name of the endpoint.

validate(disable_notice=False)

wait_for_endpoint(name, verbose=False, timeout=datetime.timedelta(seconds=3600))

Wait for an endpoint to be online.

Parameters:

name (str) – The name of the endpoint.
verbose (bool) – Whether to print status messages.
timeout (datetime.timedelta) – The time allowed until we timeout with an Exception.

class databricks.vector_search.index.VectorSearchIndex(workspace_url, index_url, name, endpoint_name, personal_access_token=None, service_principal_client_id=None, service_principal_client_secret=None, azure_tenant_id=None, azure_login_id=None, use_user_passed_credentials=False)

Bases: object

VectorSearchIndex is a helper class that represents a Vector Search Index.

Those who wish to use this class should not instantiate it directly, but rather use the VectorSearchClient class.

delete(primary_keys)

Delete data from the index.

Parameters:: primary_keys – List of primary keys to delete from the index.

describe(): Describe the index. This returns metadata about the index.

scan(num_results=10, last_primary_key=None)

Given all the data in the index sorted by primary key, this returns the next num_results data after the primary key specified by last_primary_key. If last_primary_key is None , it returns the first num_results.

Please note if there’s ongoing updates to the index, the scan results may not be consistent.

Parameters:

num_results – Number of results to return.
last_primary_key – last primary key from previous pagination, it will be used as the exclusive starting primary key.

similarity_search(columns, query_text=None, query_vector=None, filters=None, num_results=5, debug_level=1, query_type=None)

Perform a similarity search on the index. This returns the top K results that are most similar to the query.

Parameters:

columns – List of column names to return in the results.
query_text – Query text to search for.
query_vector – Query vector to search for.
filters – Filters to apply to the query.
num_results – Number of results to return.
debug_level – Debug level to use for the query.
score_threshold – Score threshold to use for the query.
query_type – The type of this query. Supported values are "ANN" and "HYBRID".

sync(): Sync the index. This is used to sync the index with the source delta table. This only works with managed delta sync index with pipeline type=”TRIGGERED”.

upsert(inputs)

Upsert data into the index.

Parameters:: inputs – List of dictionaries to upsert into the index.

wait_until_ready(verbose=False, timeout=datetime.timedelta(days=1))

Wait for the index to be online.

Parameters:

verbose (bool) – Whether to print status messages.
timeout (datetime.timedelta) – The time allowed until we timeout with an Exception.