databricks.vector_search package

class databricks.vector_search.client.VectorSearchClient(workspace_url=None, personal_access_token=None, service_principal_client_id=None, service_principal_client_secret=None, azure_tenant_id=None, azure_login_id=None, disable_notice=False)

Bases: object

A client for interacting with the Vector Search service.

This client provides methods for managing endpoints and indexes in the Vector Search service.

create_delta_sync_index(endpoint_name, index_name, primary_key, source_table_name, pipeline_type, embedding_dimension=None, embedding_vector_column=None, embedding_source_column=None, embedding_model_endpoint_name=None, sync_computed_embeddings=False)

Create a delta sync index.

Parameters:
  • endpoint_name (str) – The name of the endpoint.

  • index_name (str) – The name of the index.

  • primary_key (str) – The primary key of the index.

  • source_table_name (str) – The name of the source table.

  • pipeline_type (str) – The type of the pipeline. Must be CONTINUOUS or TRIGGERED.

  • embedding_dimension (int) – The dimension of the embedding vector.

  • embedding_vector_column (str) – The name of the embedding vector column.

  • embedding_source_column (str) – The name of the embedding source column.

  • embedding_model_endpoint_name (str) – The name of the embedding model endpoint.

  • sync_computed_embeddings (bool) – Whether to automatically sync the vector index contents and computed embeddings to a new UC table, table name will be ${index_name}_writeback_table.

create_delta_sync_index_and_wait(endpoint_name, index_name, primary_key, source_table_name, pipeline_type, embedding_dimension=None, embedding_vector_column=None, embedding_source_column=None, embedding_model_endpoint_name=None, sync_computed_embeddings=False, verbose=False, timeout=datetime.timedelta(days=1))

Create a delta sync index and wait for it to be ready.

Parameters:
  • endpoint_name (str) – The name of the endpoint.

  • index_name (str) – The name of the index.

  • primary_key (str) – The primary key of the index.

  • source_table_name (str) – The name of the source table.

  • pipeline_type (str) – The type of the pipeline. Must be CONTINUOUS or TRIGGERED.

  • embedding_dimension (int) – The dimension of the embedding vector.

  • embedding_vector_column (str) – The name of the embedding vector column.

  • embedding_source_column (str) – The name of the embedding source column.

  • embedding_model_endpoint_name (str) – The name of the embedding model endpoint.

  • verbose (bool) – Whether to print status messages.

  • timeout (datetime.timedelta) – The time allowed until we timeout with an Exception.

  • sync_computed_embeddings (bool) – Whether to automatically sync the vector index contents and computed embeddings to a new UC table, table name will be ${index_name}_writeback_table.

create_direct_access_index(endpoint_name, index_name, primary_key, embedding_dimension, embedding_vector_column, schema, embedding_model_endpoint_name=None)

Create a direct access index.

Parameters:
  • endpoint_name (str) – The name of the endpoint.

  • index_name (str) – The name of the index.

  • primary_key (str) – The primary key of the index.

  • embedding_dimension (int) – The dimension of the embedding vector.

  • embedding_vector_column (str) – The name of the embedding vector column.

  • schema (dict) – The schema of the index.

  • embedding_model_endpoint_name (str) – The name of the optional embedding model endpoint to use when querying.

create_endpoint(name, endpoint_type='STANDARD')

Create an endpoint.

Parameters:
  • name (str) – The name of the endpoint.

  • endpoint_type (str) – The type of the endpoint. Must be set to STANDARD.

create_endpoint_and_wait(name, endpoint_type='STANDARD', verbose=False, timeout=datetime.timedelta(seconds=3600))

Create an endpoint and wait for it to be online.

Parameters:
  • name (str) – The name of the endpoint.

  • endpoint_type (str) – The type of the endpoint. Must be set to STANDARD.

  • verbose (bool) – Whether to print status messages.

  • timeout (datetime.timedelta) – The time allowed until we timeout with an Exception.

delete_endpoint(name)

Delete an endpoint.

Parameters:

name (str) – The name of the endpoint.

delete_index(endpoint_name=None, index_name=None)

Delete an index.

Parameters:
  • endpoint_name (Option[str]) – The optional name of the endpoint.

  • index_name (str) – The name of the index.

get_endpoint(name)

Get an endpoint.

Parameters:

name (str) – The name of the endpoint.

get_index(endpoint_name=None, index_name=None)

Get an index.

Parameters:
  • endpoint_name (Option[str]) – The optional name of the endpoint.

  • index_name (str) – The name of the index.

list_endpoints()

List all endpoints.

list_indexes(name)

List all indexes for an endpoint.

Parameters:

name (str) – The name of the endpoint.

validate(disable_notice=False)
wait_for_endpoint(name, verbose=False, timeout=datetime.timedelta(seconds=3600))

Wait for an endpoint to be online.

Parameters:
  • name (str) – The name of the endpoint.

  • verbose (bool) – Whether to print status messages.

  • timeout (datetime.timedelta) – The time allowed until we timeout with an Exception.

class databricks.vector_search.index.VectorSearchIndex(workspace_url, index_url, name, endpoint_name, personal_access_token=None, service_principal_client_id=None, service_principal_client_secret=None, azure_tenant_id=None, azure_login_id=None, use_user_passed_credentials=False)

Bases: object

VectorSearchIndex is a helper class that represents a Vector Search Index.

Those who wish to use this class should not instantiate it directly, but rather use the VectorSearchClient class.

delete(primary_keys)

Delete data from the index.

Parameters:

primary_keys – List of primary keys to delete from the index.

describe()

Describe the index. This returns metadata about the index.

scan(num_results=10, last_primary_key=None)

Given all the data in the index sorted by primary key, this returns the next num_results data after the primary key specified by last_primary_key. If last_primary_key is None , it returns the first num_results.

Please note if there’s ongoing updates to the index, the scan results may not be consistent.

Parameters:
  • num_results – Number of results to return.

  • last_primary_key – last primary key from previous pagination, it will be used as the exclusive starting primary key.

Perform a similarity search on the index. This returns the top K results that are most similar to the query.

Parameters:
  • columns – List of column names to return in the results.

  • query_text – Query text to search for.

  • query_vector – Query vector to search for.

  • filters – Filters to apply to the query.

  • num_results – Number of results to return.

  • debug_level – Debug level to use for the query.

  • score_threshold – Score threshold to use for the query.

  • query_type – The type of this query. Supported values are "ANN" and "HYBRID".

sync()

Sync the index. This is used to sync the index with the source delta table. This only works with managed delta sync index with pipeline type=”TRIGGERED”.

upsert(inputs)

Upsert data into the index.

Parameters:

inputs – List of dictionaries to upsert into the index.

wait_until_ready(verbose=False, timeout=datetime.timedelta(days=1))

Wait for the index to be online.

Parameters:
  • verbose (bool) – Whether to print status messages.

  • timeout (datetime.timedelta) – The time allowed until we timeout with an Exception.