Agent Evaluation

Databricks Agent Evaluation Python SDK.

For more details see Databricks Agent Evaluation.

databricks.agents.evals.generate_evals_df(docs: DataFrame | pyspark.sql.DataFrame, *, num_evals: int, agent_description: str | None = None, question_guidelines: str | None = None, guidelines: str | None = None) DataFrame

Generate an evaluation dataset with synthetic requests and synthetic expected_facts, given a set of documents.

The generated evaluation set can be used with Databricks Agent Evaluation.

For more details, see the Synthesize evaluation set guide.

Parameters:
  • docs – A pandas/Spark DataFrame with a text column content and a doc_uri column.

  • num_evals – The total number of evaluations to generate across all the documents. The function tries to distribute generated evals over all of your documents, taking into consideration their size. If num_evals is less than the number of documents, not all documents will be covered in the evaluation set.

  • agent_description – Optional task description of the agent used to guide the generation.

  • question_guidelines – Optional guidelines to guide the question generation. The string can be formatted in markdown and may include sections like: - User Personas: Types of users the agent should support - Example Questions: Sample questions to guide generation - Additional Guidelines: Extra rules or requirements

databricks.agents.evals.estimate_synthetic_num_evals(docs: DataFrame | pyspark.sql.DataFrame, *, eval_per_x_tokens: int) int

Estimate the number of evals to synthetically generate for full coverage over the documents.

Parameters:
  • docs – A pandas/Spark DataFrame with a text column content.

  • eval_per_x_tokens – Generate 1 eval for every x tokens to control the coverage level. 500 tokens is ~1 page of text.

Returns:

The estimated number of evaluations to generate.

databricks.agents.evals.metric(eval_fn=None, *, name: str | None = None)

Note

Experimental: This function may change or be removed in a future release without warning.

Create a custom agent metric from a user-defined eval function.

Can be used as a decorator on the eval_fn.

The eval_fn should have the following signature:

def eval_fn(
    *,
    request_id: str,
    request: Union[ChatCompletionRequest, str],
    response: Optional[Any],
    retrieved_context: Optional[List[Dict[str, str]]]
    expected_response: Optional[Any],
    expected_facts: Optional[List[str]],
    expected_retrieved_context: Optional[List[Dict[str, str]]],
    custom_expected: Optional[Dict[str, Any]],
    custom_inputs: Optional[Dict[str, Any]],
    custom_outputs: Optional[Dict[str, Any]],
    trace: Optional[mlflow.entities.Trace],
    tool_calls: Optional[List[ToolCallInvocation]],
    **kwargs,
) -> Optional[Union[int, float, bool]]:
    """
    Args:
        request_id: The ID of the request.
        request: The agent's input from your input eval dataset.
        response: The agent's raw output. Whatever we get from the agent, we will pass it here as is.
        retrieved_context: Retrieved context, can be from your input eval dataset or from the trace,
                           we will try to extract retrieval context from the trace;
                           if you have custom extraction logic, use the `trace` field.
        expected_response: The expected response from your input eval dataset.
        expected_facts: The expected facts from your input eval dataset.
        expected_retrieved_context: The expected retrieved context from your input eval dataset.
        custom_expected: Custom expected information from your input eval dataset.
        custom_inputs: Custom inputs from your input eval dataset.
        custom_outputs: Custom outputs from the agent's response.
        trace: The trace object. You can use this to extract additional information from the trace.
        tool_calls: List of tool call invocations, can be from your agent's response (ChatAgent only)
                    or from the trace. We will prioritize extracting from the trace as it contains
                    additional information such as available tools and from which span the tool was called.
    """

eval_fn will always be called with named arguments. You only need to declare the arguments you need. If kwargs is declared, all available arguments will be passed.

The return value of the function should be either a number or a boolean. It will be used as the metric value. Return None if the metric cannot be computed.

Parameters:
  • eval_fn – The user-defined eval function.

  • name – The name of the metric. If not provided, the function name will be used.

class databricks.agents.evals.ToolCallInvocation(tool_name: str, tool_call_args: Dict[str, Any], tool_call_id: str | None = None, tool_call_result: Dict[str, Any] | None = None, raw_tool_span: mlflow.entities.span.Span | None = None)

Bases: object

tool_name: str
tool_call_args: Dict[str, Any]
tool_call_id: str | None = None
tool_call_result: Dict[str, Any] | None = None
raw_tool_span: Span | None = None
to_dict() Dict[str, Any]
classmethod from_dict(tool_calls: List[Dict[str, Any]] | Dict[str, Any] | None) ToolCallInvocation | List[ToolCallInvocation] | None

Judges

databricks.agents.evals.judges.chunk_relevance(request: str | Dict[str, Any], retrieved_context: List[Dict[str, Any]]) List[Assessment]

The chunk-relevance-precision LLM judge determines whether the chunks returned by the retriever are relevant to the input request. Precision is calculated as the number of relevant chunks returned divided by the total number of chunks returned. For example, if the retriever returns four chunks, and the LLM judge determines that three of the four returned documents are relevant to the request, then llm_judged/chunk_relevance/precision is 0.75.

Parameters:
  • request – Input to the application to evaluate, user’s question or query. For example, “What is RAG?”.

  • retrieved_context

    Retrieval results generated by the retriever in the application being evaluated. It should be a list of dictionaries with the following keys:

    • doc_uri (Optional): The doc_uri of the context.

    • content: The content of the context.

Required input arguments:

request, retrieved_context

Returns:

Chunk relevance assessment result for each of the chunks in the given input.

databricks.agents.evals.judges.context_sufficiency(request: str | Dict[str, Any], retrieved_context: List[Dict[str, Any]], expected_response: str | None = None, expected_facts: List[str] | None = None) Assessment

The context_sufficiency LLM judge determines whether the retriever has retrieved documents that are sufficient to produce the expected response or expected facts.

Parameters:
  • request – Input to the application to evaluate, user’s question or query. For example, “What is RAG?”.

  • expected_response – Ground-truth (correct) answer for the input request.

  • retrieved_context

    Retrieval results generated by the retriever in the application being evaluated. It should be a list of dictionaries with the following keys:

    • doc_uri (Optional): The doc_uri of the context.

    • content: The content of the context.

  • expected_facts – Array of strings containing facts expected in the correct response for the input request.

Required input arguments:

request, retrieved_context, oneof(expected_response, expected_facts)

Returns:

Context sufficiency assessment result for the given input.

databricks.agents.evals.judges.correctness(request: str | Dict[str, Any], response: str | Dict[str, Any], expected_response: str | None = None, expected_facts: List[str] | None = None) Assessment

The correctness LLM judge gives a binary evaluation and written rationale on whether the response generated by the agent is factually accurate and semantically similar to the provided expected response or expected facts.

Parameters:
  • request – Input to the application to evaluate, user’s question or query. For example, “What is RAG?”.

  • response – Response generated by the application being evaluated.

  • expected_response – Ground-truth (correct) answer for the input request.

  • expected_facts – Array of strings containing facts expected in the correct response for the input request.

Required input arguments:

request, response, oneof(expected_response, expected_facts)

Returns:

Correctness assessment result for the given input.

databricks.agents.evals.judges.groundedness(request: str | Dict[str, Any], response: str | Dict[str, Any], retrieved_context: List[Dict[str, Any]]) Assessment

The groundedness LLM judge returns a binary evaluation and written rationale on whether the generated response is factually consistent with the retrieved context.

Parameters:
  • request – Input to the application to evaluate, user’s question or query. For example, “What is RAG?”.

  • response – Response generated by the application being evaluated.

  • retrieved_context

    Retrieval results generated by the retriever in the application being evaluated. It should be a list of dictionaries with the following keys:

    • doc_uri (Optional): The doc_uri of the context.

    • content: The content of the context.

Required input arguments:

request, response, retrieved_context

Returns:

Groundedness assessment result for the given input.

databricks.agents.evals.judges.guideline_adherence(request: str | Dict[str, Any], response: str | Dict[str, Any], guidelines: List[str]) Assessment

The guideline_adherence LLM judge determines whether the response to the request adheres to the provided guidelines.

Parameters:
  • request – Input to the application to evaluate, user’s question or query. For example, “What is RAG?”.

  • response – Response generated by the application being evaluated.

  • guidelines – Array of strings containing the guidelines that the response should adhere to.

Required input arguments:

request, response, guidelines

Returns:

Guideline adherence assessment result for the given input.

databricks.agents.evals.judges.relevance_to_query(request: str | Dict[str, Any], response: str | Dict[str, Any]) Assessment

The relevance_to_query LLM judge determines whether the response is relevant to the input request.

Parameters:
  • request – Input to the application to evaluate, user’s question or query. For example, “What is RAG?”.

  • response – Response generated by the application being evaluated.

Required input arguments:

request, response

Returns:

Relevance to query assessment result for the given input.

databricks.agents.evals.judges.safety(request: str | Dict[str, Any], response: str | Dict[str, Any]) Assessment

The safety LLM judge returns a binary rating and a written rationale on whether the generated response has harmful or toxic content.

Parameters:
  • request – Input to the application to evaluate, user’s question or query. For example, “What is RAG?”.

  • response – Response generated by the application being evaluated.

Required input arguments:

request, response

Returns:

Safety assessment result for the given input.