Client-Side Evals (SDK)

The Phoenix Evals SDK provides composable building blocks for writing and running evaluations in Python or TypeScript. This page covers the core mental model: what an evaluator is, the two evaluator types, and how input mapping works.

What Is an Evaluator?

An evaluator is anything that takes inputs and returns a Score. The Score object is the universal output of all evaluators:

Property	Required	Description
`name`	✓	Human-readable name of the evaluator
`kind`	✓	Origin of the signal: `llm`, `code`, or `human`
`direction`	✓	Whether a higher score is better or worse
`score`	optional	Numeric result
`label`	optional	Categorical outcome (e.g. `"correct"`, `"hallucinated"`)
`explanation`	optional	Reasoning behind the result
`metadata`	optional	Arbitrary extra context

Every evaluator exposes evaluate and async_evaluate methods for running on a single record, and an input_schema that describes what fields it needs.

Evaluator Types

LLM-based evaluators use a judge model to assess qualitative criteria — things like faithfulness, toxicity, or relevance — where “correct” is subjective. The judge reads a prompt template and produces a labeled score with an explanation. See Custom LLM Evaluators and Configuring the LLM. Code evaluators use deterministic logic or heuristics — exact match, regex, Levenshtein distance — where “correct” is objective. They run without any LLM call and are fast and cheap. See Code Evaluators.

Input Mapping

Your data rarely matches an evaluator’s expected field names exactly. Instead of reshaping your data to fit each evaluator, input mapping makes the evaluator fit your data. Each evaluator has a discoverable input_schema that lists the fields it needs. You pass an input_mapping alongside your data to tell the evaluator how to extract those fields. Mapping values can be one of three types:

Key mapping — a plain string that maps directly to a top-level key in your input: "response"
Path mapping — a dot-path string that traverses nested structures and arrays using JSONPath syntax: "output.response", "messages[0].content"
Callable — a function that receives the full input and returns the value, for transforms that can’t be expressed as a path: lambda x: " ".join(x["documents"])

Python
TypeScript

eval_input = {
    "input": {"query": "What is photosynthesis?", "documents": ["doc A", "doc B"]},
    "output": {"response": "Photosynthesis converts sunlight to energy."},
}

# Map evaluator field names → paths into your data
input_mapping = {
    "input": "input.query",        # dot notation for nested keys
    "context": lambda x: " ".join(x["input"]["documents"]),  # callable for transforms
    "output": "output.response",
}

scores = faithfulness_evaluator.evaluate(eval_input, input_mapping)

import { bindEvaluator, createFaithfulnessEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const evaluator = bindEvaluator(
  createFaithfulnessEvaluator({ model: openai("gpt-4o") }),
  {
    inputMapping: {
      input: "input.query",
      context: (data) => data.input.documents.join(" "),
      output: "output.response",
    },
  }
);

const scores = await evaluator.evaluate(evalInput);

The Bind Pattern

When you want to reuse the same evaluator with the same mapping across many records (for example, batch eval runs or inside an experiment), bind the mapping once:

Python
TypeScript

from phoenix.evals import bind_evaluator

bound = bind_evaluator(faithfulness_evaluator, {
    "input": "input.query",
    "context": lambda x: " ".join(x["input"]["documents"]),
    "output": "output.response",
})

# Now call it with just the data — mapping is baked in
scores = bound(eval_input)

// bindEvaluator returns a pre-configured evaluator
const bound = bindEvaluator(faithfulness_evaluator, {
  inputMapping: { input: "input.query", output: "output.response" },
});

const scores = await bound.evaluate(evalInput);

See Custom LLM Evaluators for deeper examples.

Sync vs Async

Use evaluate for simple scripts or notebooks. Use async_evaluate when you’re running many evaluations concurrently — the executor underneath handles rate limits, retries, and dynamic concurrency automatically. For running evaluations over a full dataframe, use async_evaluate_dataframe. See Batch Evaluations for the full workflow.

Next Steps

Custom LLM Evaluators

Build classification and scoring evaluators with prompt templates

Code Evaluators

Create deterministic evaluators using functions

Batch Evaluations

Run evaluations efficiently over dataframes

Pre-Built Metrics

Use pre-tested evaluators for faithfulness, relevance, and more

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

Client-Side Evals (SDK)

What Is an Evaluator?

Evaluator Types

Input Mapping

The Bind Pattern

Sync vs Async

Next Steps

Custom LLM Evaluators

Code Evaluators

Batch Evaluations

Pre-Built Metrics

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

​What Is an Evaluator?

​Evaluator Types

​Input Mapping

​The Bind Pattern

​Sync vs Async

​Next Steps

Custom LLM Evaluators

Code Evaluators

Batch Evaluations

Pre-Built Metrics

What Is an Evaluator?

Evaluator Types

Input Mapping

The Bind Pattern

Sync vs Async

Next Steps