NOTE: This repository is deprecated as of 2022/11/07, and will be removed soon. If you are using MLflow 2.0, please refer to MLflow Recipes Regression Template instead.
The MLflow Regression Pipeline is an MLflow Pipeline for developing
high-quality regression models.
It is designed for developing models using scikit-learn and frameworks that integrate with scikit-learn,
such as the XGBRegressor
API from XGBoost.
This repository is a template for developing production-ready regression models with the MLflow Regression Pipeline. It provides a pipeline structure for creating models as well as pointers to configurations and code files that should be filled in to produce a working pipeline.
Code developed with this template should be run with MLflow Pipelines. An example implementation of this template can be found in the MLP Regression Example repo, which targets the NYC taxi dataset for its training problem.
Note: MLflow Pipelines is an experimental feature in MLflow. If you observe any issues, please report them here. For suggestions on improvements, please file a discussion topic here. Your contribution to MLflow Pipelines is greatly appreciated by the community!
- Deterministic data splitting
- Reproducible data transformations
- Hyperparameter tuning support
- Model registration for use in production
- Starter code for ingest, split, transform and train steps
- Cards containing step results, including dataset profiles, model leaderboard, performance plots and more
Follow the MLflow Pipelines installation guide. You may need to install additional libraries for extra features:
- Hyperopt is required for hyperparameter tuning.
- PySpark is required for distributed training or to ingest Spark tables.
- Delta is required to ingest Delta tables. These libraries are available natively in the Databricks Runtime for Machine Learning.
After installing MLflow Pipelines, you can clone this repository to get started. Simply fill in the required values annotated by FIXME::REQUIRED
comments in the Pipeline configuration file
and in the appropriate profile configuration: local.yaml
(if running locally) or databricks.yaml
(if running on Databricks).
The Pipeline will then be in a runnable state, and when run completely, will produce a trained model ready for batch
scoring, along with cards containing detailed information about the results of each step.
The model will also be registered to the MLflow Model Registry if it meets registration thresholds.
To iterate and improve your model, follow the MLflow Pipelines usage guide.
Note that iteration will likely involve filling in the optional FIXME
s in the
step code files with your own code, in addition to the configuration keys.
This is a visual overview of the MLflow Regression Pipeline's information flow.
Model develompent consists of the following sequential steps:
ingest -> split -> transform -> train -> evaluate -> register
The batch scoring workflow consists of the following sequential steps:
ingest_scoring -> predict
A detailed reference for each step follows.
Each of the steps in the pipeline produces artifacts after completion. These artifacts consist of cards containing
detailed execution information, as well as other step-specific information.
The Pipeline.inspect()
API is used to view step cards. The get_artifact
API is used to load all other step artifacts by name.
Per-step artifacts are further detailed in the following step references.
The ingest step resolves the dataset specified by the data
section in pipeline.yaml
and converts it to parquet format, leveraging the custom loader code specified in the data
section if necessary.
Note: If you make changes to the dataset referenced by the ingest step (e.g. by adding new records or columns),
you must manually re-run the ingest step in order to use the updated dataset in the pipeline.
The ingest step does not automatically detect changes in the dataset.
The custom loader function allows use of datasets in other formats, such as csv
.
The function should be defined in steps/ingest.py
,
and should accept two parameters:
file_path
:str
. Path to the dataset file.file_format
:str
. The file format string, such as"csv"
.
It should return a Pandas DataFrame representing the content of the specified file. steps/ingest.py
contains an example placeholder function.
The input dataset is specified by the data
section in pipeline.yaml
as follows:
Full configuration reference
-
location
: string. Required, unlessformat
isspark_sql
.
Dataset locations on the local filesystem are supported, as well as HTTP(S) URLs and any other remote locations resolvable by MLflow. One may specify multiple data locations by a list of locations as long as they have the same data format (see example below) Examples:location: ./data/sample.parquet
location: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
location: ["./data/sample.parquet", "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet"]
-
format
: string. Required.
One ofparquet
,spark_sql
anddelta
. -
custom_loader_method
: string. Optional.
Fully qualified name of the custom loader function.
Example:custom_loader_method: steps.ingest.load_file_as_dataframe
-
sql
: string. Required if format isspark_sql
.
Specifies a SparkSQL statement that identifies the dataset to use. -
version
: int. Optional.
If thedelta
format is specified, use this to specify the Delta table version to read from. -
timestamp
: timestamp. Optional.
If thedelta
format is specified, use this to specify the timestamp at which to read data.
Step artifacts
ingested_data
: The ingested data as a Pandas DataFrame.
The split step splits the ingested dataset produced by the ingest step into:
- a training dataset for model training
- a validation dataset for model performance evaluation & tuning, and
- a test dataset for model performance evaluation.
The fraction of records allocated to each dataset is defined by the split_ratios
attribute of the split
step
definition in pipeline.yaml
.
The split step also preprocesses the datasets using logic defined in steps/split.py
.
Subsequent steps use these datasets to develop a model and measure its performance.
The post-split method should be written in steps/split.py
and should accept three parameters:
train_df
: DataFrame. The unprocessed train dataset.validation_df
: DataFrame. The unprocessed validation dataset.test_df
: DataFrame. The unprocessed test dataset.
It should return a triple representing the processed train, validation and test datasets. steps/split.py
contains an example placeholder function.
The split step is configured by the steps.split
section in pipeline.yaml
as follows:
Full configuration reference
split_ratios
: list. Optional.
A YAML list specifying the ratios by which to split the dataset into training, validation and test sets.
Example:split_ratios: [0.75, 0.125, 0.125] # Defaults to this ratio if unspecified
post_split_filter_method
: string. Optional.
Fully qualified name of the method to use to "post-process" the split datasets. This procedure is meant for removing/filtering records, or other cleaning processes. Arbitrary data transformations should be done in the transform step.
Example:post_split_filter_method: steps.split.process_splits
Step artifacts:
training_data
: the training dataset as a Pandas DataFrame.validation_data
: the validation dataset as a Pandas DataFrame.test_data
: the test dataset as a Pandas DataFrame.
The transform step uses the training dataset created by the split step to fit a transformer that performs the user-defined transformations. The transformer is then applied to the training dataset and the validation dataset, creating transformed datasets that are used by subsequent steps for estimator training and model performance evaluation.
The user-defined transformation function is not required. If absent, an identity transformer will be used.
The user-defined function should be written in
steps/transform.py
,
and should return an unfitted estimator that is sklearn-compatible; that is, the returned object should define
fit()
and transform()
methods. steps/transform.py
contains an example placeholder function.
The transform step is configured by the steps.transform
section in pipeline.yaml
:
Full configuration reference
transformer_method
: string. Optional.
Fully qualified name of the method that returns ansklearn
-compatible transformer which applies feature transformation during model training and inference. If absent, an identity transformer with will be used. Example:transformer_method: steps.split.transformer_fn
Step artifacts:
transformed_training_data
: transformed training dataset as a Pandas DataFrame.transformed_validation_data
: transformed validation dataset as a Pandas DataFrame.transformer
: the sklearn transformer.
The train step uses the transformed training dataset output from the transform step to fit an user-defined estimator. The estimator is then joined with the fitted transformer output from the transform step to create a model pipeline. Finally, this model pipeline is evaluated against the transformed training and validation datasets to compute performance metrics.
Custom evaluation metrics are computed according to definitions in steps/custom_metrics.py
and the metrics
section of pipeline.yaml
; see Custom Metrics section for reference.
The model pipeline and its associated parameters, performance metrics, and lineage information are logged to MLflow Tracking, producing an MLflow Run.
The user-defined estimator function should be written in steps/train.py
,
and should return an unfitted estimator that is sklearn
-compatible; that is, the returned object should define
fit()
and transform()
methods. steps/train.py
contains an example placeholder function.
The train step is configured by the steps.train
section in pipeline.yaml
:
Full configuration reference
-
estimator_method
: string. Required.
Fully qualified name of the method that returns ansklearn
-compatible estimator used for model training.
Example:estimator_method: steps.train.estimator_fn
-
Tuning configuration reference
-
enabled
: boolean. Required.
Indicates whether or not tuning is enabled. -
max_trials
: int. Required.
Max tuning trials to run. -
algorithm
: string. Optional.
Indicates whether or not tuning is enabled. -
early_stop_fn
: string. Optional.
Early stopping function to be passed tohyperopt
. -
parallelism
: int. Optional.
Number of workers to runhyperopt
across. -
sample_fraction
: float. Optional.
Sampling fraction in the range(0, 1.0]
to indicate the amount of data used in tuning. -
parameters
: list. Required.
hyperopt
search space in yaml format.
Example:
tuning: enabled: True algorithm: "hyperopt.rand.suggest" max_trials: 5 early_stop_fn: "hyperopt.early_stop.no_progress_loss(10)" parallelism: 1 sample_fraction: 0.5 parameters: alpha: distribution: "uniform" low: 0.0 high: 0.01 penalty: values: ["l1", "l2", "elasticnet"]
-
Step artifacts:
model
: the MLflow Model pipeline created in the train step as a PyFuncModel instance.
The evaluate step evaluates the model pipeline created by the train step on the test dataset output from the split step, computing performance metrics and model explanations.
Performance metrics are compared against configured thresholds to produce a model_validation_status
, which indicates
whether or not a model is validated to be registered to the MLflow Model Registry
by the subsequent register step.
These model performance thresholds are defined in the
validation_criteria
section of the evaluate
step definition in pipeline.yaml
.
Custom evaluation metrics are computed according to definitions in steps/custom_metrics.py
and the metrics
section of pipeline.yaml
; see the custom metrics section for reference.
Model performance metrics and explanations are logged to the same MLflow Tracking Run used by the train step.
The evaluate step is configured by the steps.evaluate
section in pipeline.yaml
:
Full configuration reference
validation_criteria
: list. Optional.
A list of validation thresholds, each of which a trained model must meet in order to be eligible for registration in the register step. A definition for a validation threshold consists of a metric name (either a built-in metric or a custom metric), and a threshold value.
Example:validation_critera: - metric: root_mean_squared_error threshold: 10
Step artifacts:
run
: the MLflow Tracking Run containing the model pipeline, as well as performance and metrics created during the train and evaluate steps.
The register step checks the model_validation_status
output of the preceding evaluate step and,
if model validation was successful (if model_validation_status is 'VALIDATED'
), registers the model pipeline created
by the train step to the MLflow Model Registry. If the model_validation_status
does not indicate that the model
passed validation checks (if model_validation_status is 'REJECTED'
), the model pipeline is not registered to the
MLflow Model Registry.
If the model pipeline is registered to the MLflow Model Registry, a registered_model_version
is produced containing
the model name and the model version.
The register step is configured by the steps.register
section in pipeline.yaml
:
Full configuration reference
-
model_name
: string. Required.
Specifies the name to use when registering the trained model to the model registry. -
allow_non_validated_model
: boolean. Required.
Whether to allow registration of models that fail to meet performance thresholds.
Step artifacts:
registered_model_version
: the MLflow Model Registry ModelVersion registered in this step.
After model training, the regression pipeline provides the capability to score new data with the trained model.
The ingest scoring step, defined in the data_scoring
section in pipeline.yaml
,
specifies the dataset used for batch scoring and has the same API as the ingest step.
Step artifacts:
ingested_scoring_data
: the ingested scoring data as a Pandas DataFrame.
The predict step uses the model registered by the register step to score the
ingested dataset produced by the ingest scoring step and writes the resulting
dataset to the specified output format and location. To fix a specific model for use in the predict
step, provide its model URI as the model_uri
attribute of the pipeline.yaml
predict step definition.
The predict step is configured by the steps.predict
section in pipeline.yaml
:
Full configuration reference
-
output_format
: string. Required.
Specifies the output format of the scored data from the predict step. One ofparquet
,delta
, andtable
. Theparquet
format writes the scored data as parquet files under a specified path. Thedelta
format writes the scored data as a delta table under a specified path. Thetable
format writes the scored data as delta table and creates a metastore entry for this table with a specified name. -
output_location
: string. Required.
For theparquet
anddelta
output formats, this attribute specifies the output path for writing the scored data. In Databricks, this path will be written to be under DBFS, e.g. the pathmy/special/path
will be written under/dbfs/my/special/path
. For thetable
output format, this attribute specifies the table name that is used to create the metastore entry for the written delta table. Example:output_location: ./outputs/predictions
-
model_uri
: string. Optional.
Specifies the URI of the model to use for batch scoring. If empty, the latest model version produced by the register step is used. If the register step was cleared, the latest version of the registered model specified by themodel_name
attribute of thepipeline.yaml
register step will be used.
Example:model_uri: models/model.pkl
-
result_type
: string. Optional. Defaults todouble
.
Specifies the data type for predictions generated by the model. See the MLflow spark_udf API docs for more information. -
save_mode
: string. Optional. Defaults todefault
.
Specifies the save mode used by Spark for writing the scored data. See the PySpark save modes documentation for more information.
Step artifacts:
scored_data
: the scored dataset, with model predictions under theprediction
column, as a Pandas DataFrame.
The MLflow Tracking server can be configured to log MLflow runs to a specific server. Tracking information is specified
in the profile configuration files - profiles/local.yaml
if running locally and profiles/databricks.yaml
if running on Databricks.
Configuring a tracking server is optional. If this configuration is absent, the default experiment will be used.
Tracking information is configured with the experiment
section in the profile configuration:
Full configuration reference
-
name
: string. Required, if configuring tracking.
Name of the experiment to log MLflow runs to. -
tracking_uri
: string. Required, if configuring tracking.
URI of the MLflow tracking server to log runs to. Alternatively, theMLFLOW_TRACKING_URI
environment variable can be set to point to a valid tracking server. -
artifact_location
: string. Optional. URI of the location to log run artifacts to.
To register trained models to the MLflow Model Registry, further configuration may be required. If unspecified, models will be logged to the same server as specified in the tracking URI.
To register models to a different server, specify the desired server in the model_registry
section in the profile configuration:
Full configuration reference
uri
: string. Required, if this section is present.
URI of the model registry server to which to register trained models.
Evaluation metrics calculate model performance against different datasets. The metrics defined in the pipeline will be calculated as part of the training and evaluation steps, and calculated values will be recorded in each step’s information card.
This regression pipeline features a set of built-in metrics, and supports user-defined metrics as well.
The primary evaluation metric is the one that will be used to select the best performing model in the MLflow UI as
well as in the train and evaluation steps. This can be either a built-in metric or a custom metric (see below).
Models are ranked by this primary metric.
Metrics are configured under the metrics
section of pipeline.yaml
, according to the following specification:
Full configuration reference
-
primary
: string. Required.
The name of the primary evaluation metric. -
custom
: string. Optional.
A list of custom metric configurations.
Note that each metric specifies a boolean value greater_is_better
, which indicates whether a higher value for that
metric is associated with better model performance.
The following metrics are built-in. Note that greater_is_better = False
for all these metrics:
mean_absolute_error
mean_squared_error
root_mean_squared_error
max_error
mean_absolute_percentage_error
Custom evaluation metrics define how trained models should be evaluated against custom criteria not captured by
built-in sklearn
evaluation metrics.
Custom evaluation metric functions should be defined in steps/custom_metrics.py
.
Each should accept two parameters:
-
eval_df
: DataFrame.
A Pandas DataFrame containing two columns:prediction
: Predictions produced by submitting input data to the model.target
: Corresponding target truth values.
-
builtin_metrics
:Dict[str, int]
.
The built-in metrics calculated during model evaluation. Maps metric names to corresponding scalar values.
The custom metric function should return a Dict[str, int]
, mapping custom metric names to corresponding scalar metric values.
Custom metrics are specified as a list under the metrics.custom
key in pipeline.yaml
, specified as follows:
-
name
: string. Required.
Name of the custom metric. This will be the name by which you refer to this metric when including it in model evaluation or model training. -
function
: string. Required. Specifies the function this custom metric refers to. -
greater_is_better
: boolean. Required. Boolean indicating whether a higher metric value indicates better model performance.
An example custom metric configuration is as follows:
custom:
- name: weighted_mean_square_error
function: steps.custom_metrics.get_custom_metrics
greater_is_better: True