Agent Evaluation
Databricks Agent Evaluation Python SDK.
For more details see Databricks Agent Evaluation.
- databricks.agents.evals.generate_evals_df(docs: DataFrame | pyspark.sql.DataFrame, *, num_evals: int, agent_description: str | None = None, question_guidelines: str | None = None, guidelines: str | None = None) DataFrame
Generate an evaluation dataset with synthetic requests and synthetic expected_facts, given a set of documents.
The generated evaluation set can be used with Databricks Agent Evaluation.
For more details, see the Synthesize evaluation set guide.
- Parameters:
docs – A pandas/Spark DataFrame with a text column
content
and adoc_uri
column.num_evals – The total number of evaluations to generate across all the documents. The function tries to distribute generated evals over all of your documents, taking into consideration their size. If num_evals is less than the number of documents, not all documents will be covered in the evaluation set.
agent_description – Optional task description of the agent used to guide the generation.
question_guidelines – Optional guidelines to guide the question generation. The string can be formatted in markdown and may include sections like: - User Personas: Types of users the agent should support - Example Questions: Sample questions to guide generation - Additional Guidelines: Extra rules or requirements
- databricks.agents.evals.estimate_synthetic_num_evals(docs: DataFrame | pyspark.sql.DataFrame, *, eval_per_x_tokens: int) int
Estimate the number of evals to synthetically generate for full coverage over the documents.
- Parameters:
docs – A pandas/Spark DataFrame with a text column
content
.eval_per_x_tokens – Generate 1 eval for every x tokens to control the coverage level. 500 tokens is ~1 page of text.
- Returns:
The estimated number of evaluations to generate.
- databricks.agents.evals.metric(eval_fn=None, *, name: str | None = None)
Note
Experimental: This function may change or be removed in a future release without warning.
Create a custom agent metric from a user-defined eval function.
Can be used as a decorator on the eval_fn.
The eval_fn should have the following signature:
def eval_fn( *, request_id: str, request: Union[ChatCompletionRequest, str], response: Optional[Any], retrieved_context: Optional[List[Dict[str, str]]] expected_response: Optional[Any], expected_facts: Optional[List[str]], expected_retrieved_context: Optional[List[Dict[str, str]]], custom_expected: Optional[Dict[str, Any]], custom_inputs: Optional[Dict[str, Any]], custom_outputs: Optional[Dict[str, Any]], trace: Optional[mlflow.entities.Trace], tool_calls: Optional[List[ToolCallInvocation]], **kwargs, ) -> Optional[Union[int, float, bool]]: """ Args: request_id: The ID of the request. request: The agent's input from your input eval dataset. response: The agent's raw output. Whatever we get from the agent, we will pass it here as is. retrieved_context: Retrieved context, can be from your input eval dataset or from the trace, we will try to extract retrieval context from the trace; if you have custom extraction logic, use the `trace` field. expected_response: The expected response from your input eval dataset. expected_facts: The expected facts from your input eval dataset. expected_retrieved_context: The expected retrieved context from your input eval dataset. custom_expected: Custom expected information from your input eval dataset. custom_inputs: Custom inputs from your input eval dataset. custom_outputs: Custom outputs from the agent's response. trace: The trace object. You can use this to extract additional information from the trace. tool_calls: List of tool call invocations, can be from your agent's response (ChatAgent only) or from the trace. We will prioritize extracting from the trace as it contains additional information such as available tools and from which span the tool was called. """
eval_fn will always be called with named arguments. You only need to declare the arguments you need. If kwargs is declared, all available arguments will be passed.
The return value of the function should be either a number or a boolean. It will be used as the metric value. Return None if the metric cannot be computed.
- Parameters:
eval_fn – The user-defined eval function.
name – The name of the metric. If not provided, the function name will be used.
- class databricks.agents.evals.ToolCallInvocation(tool_name: str, tool_call_args: Dict[str, Any], tool_call_id: str | None = None, tool_call_result: Dict[str, Any] | None = None, raw_span: mlflow.entities.span.Span | None = None, available_tools: List[Dict[str, Any]] | None = None)
Bases:
object
Judges
- databricks.agents.evals.judges.chunk_relevance(request: str | Dict[str, Any], retrieved_context: List[Dict[str, Any]], assessment_name: str | None = None) List[Assessment]
The chunk-relevance-precision LLM judge determines whether the chunks returned by the retriever are relevant to the input request. Precision is calculated as the number of relevant chunks returned divided by the total number of chunks returned. For example, if the retriever returns four chunks, and the LLM judge determines that three of the four returned documents are relevant to the request, then llm_judged/chunk_relevance/precision is 0.75.
- Parameters:
request – Input to the application to evaluate, user’s question or query. For example, “What is RAG?”.
retrieved_context –
Retrieval results generated by the retriever in the application being evaluated. It should be a list of dictionaries with the following keys:
doc_uri (Optional): The doc_uri of the context.
content: The content of the context.
assessment_name – Optional override for the assessment name. If present, the output Assessment will use this as the name instead of “chunk_relevance”
- Required input arguments:
request, retrieved_context
- Returns:
Chunk relevance assessment result for each of the chunks in the given input.
- databricks.agents.evals.judges.context_sufficiency(request: str | Dict[str, Any], retrieved_context: List[Dict[str, Any]], expected_response: str | None = None, expected_facts: List[str] | None = None, assessment_name: str | None = None) Assessment
The context_sufficiency LLM judge determines whether the retriever has retrieved documents that are sufficient to produce the expected response or expected facts.
- Parameters:
request – Input to the application to evaluate, user’s question or query. For example, “What is RAG?”.
expected_response – Ground-truth (correct) answer for the input request.
retrieved_context –
Retrieval results generated by the retriever in the application being evaluated. It should be a list of dictionaries with the following keys:
doc_uri (Optional): The doc_uri of the context.
content: The content of the context.
expected_facts – Array of strings containing facts expected in the correct response for the input request.
assessment_name – Optional override for the assessment name. If present, the output Assessment will use this as the name instead of “context_sufficiency”
- Required input arguments:
request, retrieved_context, oneof(expected_response, expected_facts)
- Returns:
Context sufficiency assessment result for the given input.
- databricks.agents.evals.judges.correctness(request: str | Dict[str, Any], response: str | Dict[str, Any], expected_response: str | None = None, expected_facts: List[str] | None = None, assessment_name: str | None = None) Assessment
The correctness LLM judge gives a binary evaluation and written rationale on whether the response generated by the agent is factually accurate and semantically similar to the provided expected response or expected facts.
- Parameters:
request – Input to the application to evaluate, user’s question or query. For example, “What is RAG?”.
response – Response generated by the application being evaluated.
expected_response – Ground-truth (correct) answer for the input request.
expected_facts – Array of strings containing facts expected in the correct response for the input request.
assessment_name – Optional override for the assessment name. If present, the output Assessment will use this as the name instead of “correctness”
- Required input arguments:
request, response, oneof(expected_response, expected_facts)
- Returns:
Correctness assessment result for the given input.
- databricks.agents.evals.judges.groundedness(request: str | Dict[str, Any], response: str | Dict[str, Any], retrieved_context: List[Dict[str, Any]], assessment_name: str | None = None) Assessment
The groundedness LLM judge returns a binary evaluation and written rationale on whether the generated response is factually consistent with the retrieved context.
- Parameters:
request – Input to the application to evaluate, user’s question or query. For example, “What is RAG?”.
response – Response generated by the application being evaluated.
retrieved_context –
Retrieval results generated by the retriever in the application being evaluated. It should be a list of dictionaries with the following keys:
doc_uri (Optional): The doc_uri of the context.
content: The content of the context.
assessment_name – Optional override for the assessment name. If present, the output Assessment will use this as the name instead of “groundedness”
- Required input arguments:
request, response, retrieved_context
- Returns:
Groundedness assessment result for the given input.
- databricks.agents.evals.judges.guideline_adherence(request: str | Dict[str, Any], response: str | Dict[str, Any], guidelines: List[str] | Dict[str, List[str]], assessment_name: str | None = None) Assessment | List[Assessment]
The guideline_adherence LLM judge determines whether the response to the request adheres to the provided guidelines.
- Parameters:
request – Input to the application to evaluate, user’s question or query. For example, “What is RAG?”.
response – Response generated by the application being evaluated.
guidelines – One of the following: - Array of strings containing the guidelines that the response should adhere to. - Mapping of string (named guidelines) to array of strings containing the guidelines the response should adhere to.
assessment_name – Optional override for the assessment name. If present, the output Assessment will use this as the name instead of “guideline_adherence”
- Required input arguments:
request, response, guidelines
- Returns:
Guideline adherence assessment(s) result for the given input. Returns a list when named guidelines are provided.
- databricks.agents.evals.judges.relevance_to_query(request: str | Dict[str, Any], response: str | Dict[str, Any], assessment_name: str | None = None) Assessment
The relevance_to_query LLM judge determines whether the response is relevant to the input request.
- Parameters:
request – Input to the application to evaluate, user’s question or query. For example, “What is RAG?”.
response – Response generated by the application being evaluated.
assessment_name – Optional override for the assessment name. If present, the output Assessment will use this as the name instead of “relevance_to_query”
- Required input arguments:
request, response
- Returns:
Relevance to query assessment result for the given input.
- databricks.agents.evals.judges.safety(request: str | Dict[str, Any], response: str | Dict[str, Any], assessment_name: str | None = None) Assessment
The safety LLM judge returns a binary rating and a written rationale on whether the generated response has harmful or toxic content.
- Parameters:
request – Input to the application to evaluate, user’s question or query. For example, “What is RAG?”.
response – Response generated by the application being evaluated.
assessment_name – Optional override for the assessment name. If present, the output Assessment will use this as the name instead of “safety”
- Required input arguments:
request, response
- Returns:
Safety assessment result for the given input.
Datasets
Databricks Agent Datasets Python SDK.
For more details see Databricks Agent Evaluation <https://docs.databricks.com/en/generative-ai/agent-evaluation/index.html>
- databricks.agents.datasets.create_dataset(uc_table_name: str, experiment_id: str | list[str] | None = None) Dataset
Create a dataset with the given name and associate it with the given experiment.
- Parameters:
uc_table_name – The UC table location of the dataset.
experiment_id – The ID of the experiment to associate the dataset with. If not provided, the current experiment is inferred from the environment.
- class databricks.agents.datasets.Dataset(dataset_id: str, digest: str | None = None, name: str | None = None, schema: str | None = None, profile: str | None = None, source: str | None = None, source_type: str | None = None, create_time: str | None = None, created_by: str | None = None, last_update_time: str | None = None, last_updated_by: str | None = None)
Bases:
Dataset
A dataset for storing evaluation records (inputs and expectations).
- digest: str | None = None
String digest (hash) of the dataset provided by the caller that uniquely identifies
- classmethod schema(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) SchemaF[A]
The schema of the dataset. E.g., MLflow ColSpec JSON for a dataframe, MLflow TensorSpec JSON for an ndarray, or another schema format.
- source_type: str | None = None
The type of the dataset source, e.g. “databricks-uc-table”, “DBFS”, “S3”, …
- insert(records: list[Dict] | DataFrame | pyspark.sql.DataFrame) Dataset
Insert records into the dataset. records that share the same inputs will be merged into a single record.
- Parameters:
records – A list of dicts, a pandas DataFrame, or a Spark DataFrame. For the input schema
https (see) – //docs.databricks.com/en/generative-ai/agent-evaluation/evaluation-schema.html
- classmethod from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) A
Review App
Databricks Agent Review App Python SDK.
For more details see Databricks Agent Evaluation <https://docs.databricks.com/en/generative-ai/agent-evaluation/index.html>
- class databricks.agents.review_app.Agent(agent_name: str, model_serving_endpoint: str)
Bases:
object
The agent configuration, used for generating responses in the review app.
- classmethod from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) A
- classmethod from_json(s: str | bytes | bytearray, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) A
- databricks.agents.review_app.get_review_app(experiment_id: str | None = None) ReviewApp
Gets or creates (if it doesn’t exist) the review app for the given experiment ID.
- Parameters:
experiment_id – Optional. The experiment ID for which to get the review app. If not provided, the experiment ID is inferred from the current active environment.
- class databricks.agents.review_app.LabelingSession(name: str, assigned_users: list[str], agent: str | None, label_schemas: list[str], labeling_session_id: str, mlflow_run_id: str, review_app_id: str, experiment_id: str, url: str)
Bases:
object
A session for labeling items in the review app.
- add_dataset(dataset_name: str, record_ids: list[str] | None = None) LabelingSession
Add a dataset to the labeling session.
- Parameters:
dataset_name – The name of the dataset.
record_ids – Optional. The individiual record ids to be added to the session. If not provided, all records in the dataset will be added.
- add_traces(traces: Iterable[Trace] | Iterable[str] | DataFrame) LabelingSession
Add traces to the labeling session.
- Parameters:
traces – Can be either: a) a pandas DataFrame with a ‘trace’ column. The ‘trace’ column should contain either
mlflow.entities.Trace
objects or their json string representations. b) an iterable ofmlflow.entities.Trace
objects. c) an iterable of json string representations ofmlflow.entities.Trace
objects.
- sync_expectations(to_dataset: str) None
Sync the expectations from the labeling session to a dataset.
- set_assigned_users(assigned_users: list[str]) LabelingSession
Set the assigned users for the labeling session.
- classmethod from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) A
- classmethod from_json(s: str | bytes | bytearray, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) A
- class databricks.agents.review_app.ReviewApp(review_app_id: str, experiment_id: str, url: str, agents: list[~databricks.rag_eval.review_app.entities.Agent] = <factory>, label_schemas: list[~databricks.rag_eval.review_app.entities.LabelSchema] = <factory>)
Bases:
object
A review app is used to collect feedback from stakeholders for a given experiment.
- agents
The agents to be used to generate responses.
- label_schemas
The label schemas to be used in the review app.
- label_schemas: list[LabelSchema]
- add_agent(*, agent_name: str, model_serving_endpoint: str, overwrite: bool = False) ReviewApp
Add an agent to the review app to be used to generate responses.
- create_label_schema(name: str, *, type: Literal['feedback', 'expectation'], title: str, input: InputCategorical | InputCategoricalList | InputText | InputTextList | InputNumeric, instruction: str | None = None, enable_comment: bool = False, overwrite: bool = False) LabelSchema
Create a new label schema for the review app.
A label schema defines the type of input that stakeholders will provide when labeling items in the review app.
- Parameters:
name – The name of the label schema. Must be unique across the review app.
type – The type of the label schema. Either “feedback” or “expectation”.
title – The title of the label schema shown to stakeholders.
input – The input type of the label schema.
instruction – Optional. The instruction shown to stakeholders.
enable_comment – Optional. Whether to enable comments for the label schema.
overwrite – Optional. Whether to overwrite the existing label schema with the same name.
- create_labeling_session(name: str, *, assigned_users: list[str] = [], agent: str | None = None, label_schemas: list[str] = []) LabelingSession
Create a new labeling session in the review app.
- Parameters:
name – The name of the labeling session.
assigned_users – The users that will be assigned to label items in the session.
agent – The agent to be used to generate responses for the items in the session.
label_schemas – The label schemas to be used in the session.
- get_labeling_sessions() list[LabelingSession]
Get all labeling sessions in the review app.
- delete_labeling_session(labeling_session: LabelingSession) ReviewApp
Delete a labeling session from the review app.
- classmethod from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) A
- classmethod from_json(s: str | bytes | bytearray, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) A
Label Schemas
Label schemas for configuring the Review App.
- class databricks.agents.review_app.label_schemas.LabelSchemaType(value)
Bases:
StrEnum
Type of label schema.
- FEEDBACK = 'feedback'
- EXPECTATION = 'expectation'
- class databricks.agents.review_app.label_schemas.LabelSchema(name: str, type: Literal['feedback', 'expectation'], title: str, input: InputCategorical | InputCategoricalList | InputText | InputTextList | InputNumeric, instruction: str | None = None, enable_comment: bool = False)
Bases:
object
A label schema for collecting input from stakeholders.
- input: InputCategorical | InputCategoricalList | InputText | InputTextList | InputNumeric
- classmethod from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) A
- classmethod from_json(s: str | bytes | bytearray, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) A
- class databricks.agents.review_app.label_schemas.InputCategorical(options: list[str])
Bases:
InputType
A single-select dropdown for collecting assessments from stakeholders.
- classmethod from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) A
- classmethod from_json(s: str | bytes | bytearray, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) A
- class databricks.agents.review_app.label_schemas.InputCategoricalList(options: list[str])
Bases:
InputType
A multi-select dropdown for collecting assessments from stakeholders.
- classmethod from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) A
- classmethod from_json(s: str | bytes | bytearray, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) A
- class databricks.agents.review_app.label_schemas.InputNumeric(min_value: float | None = None, max_value: float | None = None)
Bases:
InputType
A numeric input for collecting assessments from stakeholders.
- classmethod from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) A
- classmethod from_json(s: str | bytes | bytearray, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) A
- class databricks.agents.review_app.label_schemas.InputText(max_length: int | None = None)
Bases:
InputType
A free-form text box for collecting assessments from stakeholders.
- classmethod from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) A
- classmethod from_json(s: str | bytes | bytearray, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) A
- class databricks.agents.review_app.label_schemas.InputTextList(max_length_each: int | None = None, max_count: int | None = None)
Bases:
InputType
Like
Working with text data
, but allows multiple entries.- classmethod from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) A
- classmethod from_json(s: str | bytes | bytearray, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) A
- databricks.agents.review_app.label_schemas.EXPECTED_FACTS = "expected_facts"
- databricks.agents.review_app.label_schemas.GUIDELINES = "guidelines"
- databricks.agents.review_app.label_schemas.EXPECTED_RESPONSE = "expected_response"