The Chicago Taxi example demonstrates the end-to-end workflow and the steps required to analyze, validate, and transform data, train a model, analyze its performance, and serve it. This example uses the following TFX components:
- ExampleGen ingests and splits the input dataset.
- StatisticsGen calculates statistics for the dataset.
- SchemaGen examines the statistics and creates a data schema.
- ExampleValidator looks for anomalies and missing values in the dataset.
- Transform performs feature engineering on the dataset.
- Trainer trains the model using native Keras. or Keras.
- Evaluator performs deep analysis of the training results.
- InfraValidator checks the model is actually servable from the infrastructure, and prevents bad model from being pushed.
- Pusher deploys the model to a serving infrastructure.
- BulkInferrer batch inference on the model with unlabelled examples.
Inference in the example is powered by:
- TensorFlow Serving for serving.
This example uses the Taxi Trips dataset released by the City of Chicago.
Note: This site provides applications using data that has been modified for use from its original source, www.cityofchicago.org, the official website of the City of Chicago. The City of Chicago makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site. The data provided at this site is subject to change at any time. It is understood that the data provided at this site is being used at one’s own risk.
You can read more about the dataset in Google BigQuery. Explore the full dataset in the BigQuery UI.
- Apache Airflow is used for pipeline orchestration.
- Apache Beam is used for distributed processing.
- TensorFlow is used for model training, evaluation and inference.
Development for this example will be isolated in a Python virtual environment. This allows us to experiment with different versions of dependencies.
There are many ways to install virtualenv
, see the
TensorFlow install guides for different
platforms, but here are a couple:
- For Linux:
sudo apt-get install python-pip python-virtualenv python-dev build-essential
- For Mac:
sudo easy_install pip pip install --upgrade virtualenv
Create a Python 3.6 virtual environment for this example and activate the
virtualenv
:
virtualenv -p python3.6 taxi_pipeline source ./taxi_pipeline/bin/activate
Configure common paths:
export AIRFLOW_HOME=~/airflow export TAXI_DIR=~/taxi export TFX_DIR=~/tfx
Next, install the dependencies required by the Chicago Taxi example:
pip install apache-airflow==1.10.9 pip install -U tfx[examples]
Next, initialize Airflow
airflow initdb
The benefit of the local example is that you can edit any part of the pipeline and experiment very quickly with various components. First let's download the data for the example:
mkdir -p $TAXI_DIR/data/simple wget -O $TAXI_DIR/data/simple/data.csv https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi_pipeline/data/simple/data.csv?raw=true
Next, copy the TFX pipeline definition to Airflow's DAGs directory
($AIRFLOW_HOME/dags)
so it can run the pipeline. To find the
location of your TFX installation, use this command:
pip show tfx
Use the location shown when setting the TFX_EXAMPLES path below.
export TFX_EXAMPLES=~/taxi_pipeline/lib/python3.6/site-packages/tfx/examples/chicago_taxi_pipeline
Copy the Chicago Taxi example pipeline into the Airflow DAG folder.
mkdir -p $AIRFLOW_HOME/dags/ cp $TFX_EXAMPLES/taxi_pipeline_simple.py $AIRFLOW_HOME/dags/
The module file taxi_utils.py
used by the Trainer and Transform
components will reside in $TAXI_DIR. Copy it there.
cp $TFX_EXAMPLES/taxi_utils.py $TAXI_DIR
Start the Airflow webserver
(in 'taxi_pipeline' virtualenv):
airflow webserver
Open a new terminal window:
source ./taxi_pipeline/bin/activate
and start the Airflow scheduler
:
airflow scheduler
Open a browser to 127.0.0.1:8080
and click on the chicago_taxi_simple
example.
It should look like the image below if you click the Graph View option.
If you were looking at the graph above, click on the DAGs
button to
get back to the DAGs view.
Enable the chicago_taxi_simple
pipeline in Airflow by toggling
the DAG to On
. Now that it is schedulable, click on the
Trigger DAG button
(triangle inside a circle) to start the run. You
can view status by clicking on the started job, found in the
Last run
column. This process will take several minutes.
Once the pipeline completes, the model will be copied by the Pusher to the directory configured in the example code:
ls $TAXI_DIR/serving_model/chicago_taxi_simple
Now serve the created model with TensorFlow Serving. For this example, run the server in a Docker container that is run locally. Instructions for installing Docker locally are found in the Docker install documentation.
In the terminal, run the following script to start a server:
bash $TFX_EXAMPLES/serving/start_model_server_local.sh \ $TAXI_DIR/serving_model/chicago_taxi_simple
This script pulls a TensorFlow Serving serving image and listens for for gRPC
requests on localhost
port 9000. The model server loads the latest model
exported from the Pusher at above path.
To send a request to the server for model inference, run:
bash $TFX_EXAMPLES/serving/classify_local.sh \ $TAXI_DIR/data/simple/data.csv \ $TFX_DIR/pipelines/chicago_taxi_simple/SchemaGen/output/CHANGE_TO_LATEST_DIR/schema.pbtxt
For Gooogle Cloud AI Platform serving example, use
start_model_server_aiplatform.sh
and classify_aiplatform.sh
in a similar way
as above local example with local directory changing to gs://YOUR_BUCKET
.
For more information, see TensorFlow Serving.
Use Kubeflow as orchestrator, check here for details.
Instead of estimator, this example uses native Keras in user module file
taxi_utils_native_keras.py
.
Please see the TFX User Guide to learn more.