Evaluation framework for testing AI agents' ability to write Dart and Flutter code. Built on Inspect AI.
Tip
Full documentation at evals-docs.web.app/
evals provides:
- Evaluation Runner — Python package for running LLM evaluations with configurable tasks, variants, and models
- Evaluation Configuration — Dart and Python packages that resolve dataset YAML into EvalSet JSON for the runner
- devals CLI — Dart CLI for creating and managing dataset samples, tasks, and jobs
- Evaluation Explorer — Dart/Flutter app for browsing and analyzing results
- Dataset — Curated samples for Dart/Flutter Q&A, code generation, and debugging tasks
| Package | Description | Docs |
|---|---|---|
| dash_evals | Python evaluation runner using Inspect AI | dash_evals docs |
| dataset_config_dart | Dart library for resolving dataset YAML into EvalSet JSON (includes shared data models) | dataset_config_dart docs |
| dataset_config_python | Python configuration models | — |
| devals_cli | Dart CLI for managing evaluation tasks and jobs | CLI docs |
| eval_explorer | Dart/Flutter results viewer (Serverpod) | eval_explorer docs |
Note
The uploader and report_app packages are deprecated and will be replaced by eval_explorer.
| Doc | Description |
|---|---|
| Quick Start | Get started authoring your own evals |
| Contributing Guide | Development setup and guidelines |
| CLI Reference | Full devals CLI command reference |
| Configuration Reference | YAML configuration file reference |
| Repository Structure | Project layout |
| Glossary | Terminology guide |
See CONTRIBUTING.md for details, or go directly to the Contributing Guide.
See LICENSE for details.