Skip to main content
The Phoenix Evals SDK provides composable building blocks for writing and running evaluations in Python or TypeScript. This page covers the core mental model: what an evaluator is, the two evaluator types, and how input mapping works.

What Is an Evaluator?

An evaluator is anything that takes inputs and returns a Score. The Score object is the universal output of all evaluators:
PropertyRequiredDescription
nameHuman-readable name of the evaluator
kindOrigin of the signal: llm, code, or human
directionWhether a higher score is better or worse
scoreoptionalNumeric result
labeloptionalCategorical outcome (e.g. "correct", "hallucinated")
explanationoptionalReasoning behind the result
metadataoptionalArbitrary extra context
Every evaluator exposes evaluate and async_evaluate methods for running on a single record, and an input_schema that describes what fields it needs.

Evaluator Types

LLM-based evaluators use a judge model to assess qualitative criteria — things like faithfulness, toxicity, or relevance — where “correct” is subjective. The judge reads a prompt template and produces a labeled score with an explanation. See Custom LLM Evaluators and Configuring the LLM. Code evaluators use deterministic logic or heuristics — exact match, regex, Levenshtein distance — where “correct” is objective. They run without any LLM call and are fast and cheap. See Code Evaluators.

Input Mapping

Your data rarely matches an evaluator’s expected field names exactly. Instead of reshaping your data to fit each evaluator, input mapping makes the evaluator fit your data. Each evaluator has a discoverable input_schema that lists the fields it needs. You pass an input_mapping alongside your data to tell the evaluator how to extract those fields. Mapping values can be one of three types:
  • Key mapping — a plain string that maps directly to a top-level key in your input: "response"
  • Path mapping — a dot-path string that traverses nested structures and arrays using JSONPath syntax: "output.response", "messages[0].content"
  • Callable — a function that receives the full input and returns the value, for transforms that can’t be expressed as a path: lambda x: " ".join(x["documents"])
eval_input = {
    "input": {"query": "What is photosynthesis?", "documents": ["doc A", "doc B"]},
    "output": {"response": "Photosynthesis converts sunlight to energy."},
}

# Map evaluator field names → paths into your data
input_mapping = {
    "input": "input.query",        # dot notation for nested keys
    "context": lambda x: " ".join(x["input"]["documents"]),  # callable for transforms
    "output": "output.response",
}

scores = faithfulness_evaluator.evaluate(eval_input, input_mapping)

The Bind Pattern

When you want to reuse the same evaluator with the same mapping across many records (for example, batch eval runs or inside an experiment), bind the mapping once:
from phoenix.evals import bind_evaluator

bound = bind_evaluator(faithfulness_evaluator, {
    "input": "input.query",
    "context": lambda x: " ".join(x["input"]["documents"]),
    "output": "output.response",
})

# Now call it with just the data — mapping is baked in
scores = bound(eval_input)
See Custom LLM Evaluators for deeper examples.

Sync vs Async

Use evaluate for simple scripts or notebooks. Use async_evaluate when you’re running many evaluations concurrently — the executor underneath handles rate limits, retries, and dynamic concurrency automatically. For running evaluations over a full dataframe, use async_evaluate_dataframe. See Batch Evaluations for the full workflow.

Next Steps