What Is an Evaluator?
An evaluator is anything that takes inputs and returns a Score. TheScore object is the universal output of all evaluators:
| Property | Required | Description |
|---|---|---|
name | ✓ | Human-readable name of the evaluator |
kind | ✓ | Origin of the signal: llm, code, or human |
direction | ✓ | Whether a higher score is better or worse |
score | optional | Numeric result |
label | optional | Categorical outcome (e.g. "correct", "hallucinated") |
explanation | optional | Reasoning behind the result |
metadata | optional | Arbitrary extra context |
evaluate and async_evaluate methods for running on a single record, and an input_schema that describes what fields it needs.
Evaluator Types
LLM-based evaluators use a judge model to assess qualitative criteria — things like faithfulness, toxicity, or relevance — where “correct” is subjective. The judge reads a prompt template and produces a labeled score with an explanation. See Custom LLM Evaluators and Configuring the LLM. Code evaluators use deterministic logic or heuristics — exact match, regex, Levenshtein distance — where “correct” is objective. They run without any LLM call and are fast and cheap. See Code Evaluators.Input Mapping
Your data rarely matches an evaluator’s expected field names exactly. Instead of reshaping your data to fit each evaluator, input mapping makes the evaluator fit your data. Each evaluator has a discoverableinput_schema that lists the fields it needs. You pass an input_mapping alongside your data to tell the evaluator how to extract those fields. Mapping values can be one of three types:
- Key mapping — a plain string that maps directly to a top-level key in your input:
"response" - Path mapping — a dot-path string that traverses nested structures and arrays using JSONPath syntax:
"output.response","messages[0].content" - Callable — a function that receives the full input and returns the value, for transforms that can’t be expressed as a path:
lambda x: " ".join(x["documents"])
- Python
- TypeScript
The Bind Pattern
When you want to reuse the same evaluator with the same mapping across many records (for example, batch eval runs or inside an experiment), bind the mapping once:- Python
- TypeScript
Sync vs Async
Useevaluate for simple scripts or notebooks. Use async_evaluate when you’re running many evaluations concurrently — the executor underneath handles rate limits, retries, and dynamic concurrency automatically.
For running evaluations over a full dataframe, use async_evaluate_dataframe. See Batch Evaluations for the full workflow.

