0% found this document useful (0 votes)

31 views

Data Platform For Machine Learning

Uploaded by

Robert Ngenzi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Data Platform For Machine Learning

Uploaded by

Robert Ngenzi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

Data Platform for Machine Learning

Pulkit Agrawal, Rajat Arya, Aanchal Bindal, Sandeep Bhatia, Anupriya Gagneja,
Joseph Godlewski, Yucheng Low, Timothy Muss, Mudit Manu Paliwal,
Sethu Raman, Vishrut Shah, Bochao Shen, Laura Sugden, Kaiyu Zhao,
Ming-Chuan Wu∗
Apple Inc.
Cupertino, CA
ABSTRACT cycle through data discovery, data exploration, feature engi-
In this paper, we present a purpose-built data management neering, model training, model evaluation, and back to data
system, MLdp, for all machine learning (ML) datasets. ML discovery.
applications pose some unique requirements different from The contributions of this paper are: 1) to recognize the
common conventional data processing applications, includ- needs and to call out the requirements of an ML data platform,
ing but not limited to: data lineage and provenance tracking, 2) to share our experiences in building MLdp by adopting
rich data semantics and formats, integration with diverse existing database technologies to the new problem as well
ML frameworks and access patterns, trial-and-error driven as by devising new solutions, and 3) to call for actions from
data exploration and evolution, rapid experimentation, re- our communities on future challenges.
producibility of the model training, strict compliance and ACM Reference Format:
privacy regulations, etc. Current ML systems/services, often Pulkit Agrawal, Rajat Arya, Aanchal Bindal, Sandeep Bhatia, Anupriya
named MLaaS, to-date focus on the ML algorithms, and offer Gagneja,, Joseph Godlewski, Yucheng Low, Timothy Muss, Mudit
no integrated data management system. Instead, they require Manu Paliwal,, and Sethu Raman, Vishrut Shah, Bochao Shen, Laura
users to bring their own data and to manage their own data Sugden, Kaiyu Zhao, Ming-Chuan Wu. 2019. Data Platform for Ma-
on either blob storage or on file systems. The burdens of data chine Learning. In 2019 International Conference on Management
of Data (SIGMOD ’19), June 30-July 5, 2019, Amsterdam, Nether-
management tasks, such as versioning and access control,
lands. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/
fall onto the users, and not all compliance features, such as
3299869.3314050
terms of use, privacy measures, and auditing, are available.
MLdp offers a minimalist and flexible data model for all
1 INTRODUCTION
varieties of data, strong version management to guarantee
re-producibility of ML experiments, and integration with Machine learning has become increasingly prevalent in the
major ML frameworks. MLdp also maintains the data prove- industry today. Across all business sectors, systems and soft-
nance to help users track lineage and dependencies among ware features have incorporated machine learning, from the
data versions and models in their ML pipelines. In addition to data center to in your home, car and pocket. Data is the
table-stake features, such as security, availability and scala- key ingredient to training successful machine learning mod-
bility, MLdp’s internal design choices are strongly influenced els. The data available to train machine learning models has
by the goal to support rapid ML experiment iterations, which grown tremendously over the last decade, but no systems
to support these needs holistically had emerged at the time
we sought a solution. In this paper we share the experiences
∗ Primary contact author/email: ming-chuan.wu@apple.com and future challenges of building an integrated data sys-
tem for machine learning, named MLdp in short for ease
of references for the rest of the paper. To understand why
such a system is necessary, it is important to understand the
challenges machine learning places on data systems.
The rest of the paper is organized as follows. In Section 2,
This work is licensed under a Creative Commons Attribution International 4.0 License. we justify the need of an integrated ML data management
SIGMOD ’19, June 30-July 5, 2019, Amsterdam, Netherlands
system by calling out the requirements and cross-examining
© 2019 Copyright held by the owner/author(s).
existing solutions and systems against those requirements.
ACM ISBN 978-1-4503-5643-5/19/06. We discuss the system design in Section 3, which includes the
https://doi.org/10.1145/3299869.3314050 data model, data interface with ML framework integration,

1803
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

physical data layout design, and a distributed caching service

for MLdp . In Section 4, we highlight the important future
extensions as a call-for-action to the community. Finally, we Annotation
conclude in Section 5. Raw Curated Feature
Data w/
Engineering
Data Data Features
Exploration
2 MOTIVATION Annotation Database, Database,
Cloud PySpark, M/R,
In examining the machine learning life cycle, we observe Storage,
Platform,
Visualization
Blob Storage,
DFS, etc
etc
Blob Storage,
DFS, etc
DFS, etc
that there are the following distinct stages: data collection, Tools, etc

annotation, exploration, feature engineering, experimenta-

Evaluation Metrics Experimentation
tion, evaluation, and finally deployment. Variations of these
stages have been discussed in [21], [9] and [6]. At every stage A/B Testing,
Database, Conventional
of the iterative life cycle, data is cleansed, transformed, and Blob Storage, Testing, etc
engineered to fit the needs of subsequent ML stages. The key DFS, etc

insights in our observation from existing solutions are as Figure 1: Conventional ML workflow and data silos.
follows. The emphasis of existing solutions has been on train- Based on these observations, we see a strong need to build
ing, experimentation, evaluation, and deployment phases of a dataset management system that integrates into the ML
the life cycle, as shown in Table 1. However, little emphasis1 life cycle, guarantees compliance, enforces legal terms of
has been made on the needs of an integrated ML data system. use, offers a high-level data access abstraction to allow ML
Instead, during data collection, data are often injected into engineers to focus on ML tasks, and provides tooling for
cloud blob storage systems. The heavy-lifting ETL (extract- easy access to data insights in order to accelerate ML experi-
transform-load) pipelines are often performed using data mentation. The technical challenges to achieve these goals
processing systems, such as MapReduce or Spark, with the can be categorized into the following areas, to be discussed
results stored in distributed file systems or back to the blob next: 1) supporting the engineering teams, 2) supporting the
storage. The crowd-sourced annotation tasks often require machine learning life cycle, and 3) supporting the variety of
data to be exported and imported to/from publicly accessi- ML frameworks and ML data.
ble data storage systems. The derived and transformed data
between different stages are simply stored in storage sys- 2.1 Machine Learning Teams
tems based on the preferences of the data engineers. These
The teams that build machine learning features comprise of
data silos are used as simple, passive data stores, leaving the
ML engineers, data scientists, software engineers, and the
burden of integrating data access with ML tasks to the ML
product legal team. These distinct specialists need a data
engineers. The ML engineers often face the challenges of
system to support them. ML engineers focus on finding the
coping with different data layouts and formats and of dif-
best ML model for the given problem, exploring different
ferent data access interfaces from different silos throughout
features2 of the data, and experimenting with different com-
different stages instead of focusing on ML challenges.
binations of data and ML models. Software engineers focus
As illustrated in Figure 1, developers, using different tools
on building the toolset to support the end-to-end ML model
in every stage in ML life cycle, need to develop an understand-
training pipeline, starting from data cleaning, ETL, feature
ing for data internals, leading to a cascading physical data
engineering, model training, model evaluation, and eventu-
dependency. This intertwined physical dependence further
ally model deployment. Data scientists work closely with
hinders the ability to offer data provenance, data version-
both ML specialists and software engineers on feature en-
ing, compliance, auditing, usage analytics, etc. Even worse,
gineering, annotation efforts, and data quality control. The
it hampers the rate of ML innovations, because every new
product legal team oversees the legal terms of use of the data,
project needs to re-learn these complex dependencies. Fur-
privacy, compliance (such as GDPR3 ), and auditing. Existing
thermore, in large organizations, we observed various teams
ML data solutions do not provide an integrated data platform
solving similar ML problems, often without knowledge of
to support all of above mentioned collaboration. Instead, data
each other. A central data management system will enable
silos are created by each discipline, and discrepancies arise
these teams to discover and share relevant data and to foster
due to the difficulties of maintaining consistency across data
new collaborations between teams.
silos.
2 Features are measurable properties or characteristics extracted or derived
1 Only since recent years, the need of an ML data platform has attracted from raw data. Feature engineering is the process of obtaining features from
attention from companies and our community, e.g., DEEM (International raw data.
Workshop on Data Management for End-to-End Machine Learning). 3 GDPR stands for EU’s General Data Protection Regulation.

1804
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

Furthermore, acquiring and augmenting data is expensive (8) a toolset and a programming interface for data explo-
and time consuming. Hence, it is imperative that teams are ration and discovery.
able to share their data within an organization under ap-
propriate compliance and privacy constraints. In order to It is common to conduct ML experiments across differ-
share the data, stakeholder identification on the source of ent mixes and matches of data and features. In any highly
the data is another key aspect for enterprises [8]. In conven- experimental process, it is essential that one can reproduce
tional service models, data is encapsulated behind service the results as needed. Existing ML solutions, which rely on
interfaces and any change in data is not known to the con- users to bring their own data and data storage system, lack
sumers of the service. In machine learning, data itself is an the support to easily reproduce the training results. The bur-
interface which needs to be tracked and versioned. Hence, den falls onto the users to manually track the dependencies
the ability to identify the ownership, the lineage, and the among data versions, training tasks, and ML models. An ML
provenance of data is critical for such a system. Since data data platform should be capable of tracking the dependencies
evolves through the life of the project, teams require data among data, model, and code, as well as data lineage between
life cycle management features to understand how the data raw data assets and the derived annotations and features. For
has changed. Again, silos make data sharing and discovery, example, in case of errors found in the source dataset, we can
data lineage and provenance tracking, version control, and identify all the dependent and derived data, and notify their
access control difficult, if not impossible. owners to regenerate the labels or annotations, or re-train
the ML models.
2.2 Machine Learning Life Cycle In addition, ML experiments often begin with interactive
data exploration before launching large scale training in
The machine learning life cycle is highly iterative and ex-
the data center. An ML data platform should support both
perimental. While data is being injected during the data
server-side data processing, such as predicate push-down and
collection stage, data curation process starts incrementally
expression evaluation, as well as client-side remote stream-
with ETL pipelines to homogenize syntactic and semantic
ing data access. The transition from training on laptop to
discrepancies, and the cleansed data is pipelined to feature
training in data center should require no code changes in
engineering and annotation efforts. After the initial batches
the applications, so the cycle from exploration to analysis to
of features and annotations are ready, ML engineers start
experimentation is as short as possible. Furthermore, similar
exploring different features and small-scale training experi-
to common large-scale data systems, automated sharding is
ments. Depending on the experiment results, existing feature
essential to support distributed training by leveraging data
engineering and annotation efforts may be paused, contin-
parallelism, so is data caching in order to shorten the data
ued, or new data engineering effort may be started. Only after
distance.
many, sometimes hundreds of, experiments does a promising
mix of data, ML features, and a trained ML model emerge.
In order to assist tasks among these intertwined stages, a 2.3 Machine Learning Frameworks and
ML data platform needs to support Data
(1) a conceptual data model to describe both slowly chang- Many machine learning frameworks are available today, with
ing data assets and volatile features, new ones emerging at times. Most of them are free and open-
(2) simple mechanisms for continuous data injection, source, and many of them are still under active development.
(3) a domain specific language for effective data and fea- Machine learning frameworks provide various learning al-
ture engineering, gorithms and models for different problem domains in ML.
(4) a hybrid data store, that are suited for large data vol- Since each framework has strengths for different models, it
umes of stable raw assets and high concurrent updates is very common for an organization to utilize several frame-
on volatile data, with physical data layout designs that works within a given project (including different versions of
support data versioning, and are optimized for incre- the same framework). Most of the frameworks rely on the
mental updates, delta tracking among versions, and file system to access training data, with some frameworks
ML training access patterns, offering additional data reader interfaces to make I/O more
(5) a data access interface integrated with major ML frame- efficient, such as, TensorFlow and MXNet.
works, allowing ubiquitous data access both within An integrated ML data platform should provide file I/O
data centers as well as on the edge, data access interfaces for common ML frameworks, with the
(6) explicit version management to ensure reproducibility extensibility to plug in custom record-level iterators or data
of ML experiments, readers for specific frameworks, such as RecordIO for MXNet
(7) tracking for data lineage and provenance, and and TFRecordReader for TensorFlow. These data connectors –

1805
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

Solution Integrated Data System Model Training Experimentation Evaluation Deployment

SageMaker No (Bring your own data store: Yes Yes Yes Yes
e.g., S3, EBS, EFS)
Azure ML No (Bring your own data store) Yes Yes Yes Yes
Cloud AI No (Bring your own data store: Yes Yes Yes Yes
e.g., Google Cloud Storage)
TensorFlow N/A Yes No No Yes
Keras N/A Yes No No No
PyTorch N/A Yes No No No
Michelangelo No (HDFS data lake, Cassandra) Yes Yes Yes Yes
FBLearner Flow N/A No Yes No Yes
Colaboratory N/A No Yes Yes No
Databricks N/A No Yes Yes No
MLdp Yes Integrated with major ML Integrated with in-house exe- Integrated with an in- Integrated with an in-
frameworks cution clusters, as well as pub- house solution house solution
lic elastic compute cloud

Table 1: Machine learning platform eco-System

MLdp ML Framework and way. A new ML project might also trigger new annotation
Crowd- Data Store
Execution
Environment
efforts. All of the above lead to increased volume, variety and
Sourcing
Annotation velocity of ML data. The ML data system needs to support
Annotation Annotation
Curated
Evaluation agility in all three dimensions.
Feature
Raw Data Splits
Engineering To sum up, based on the above observations and the expe-
Data Exploration
e Experimentation riences from using existing solutions, we believe the effort to
kag
Pac design and implement a purpose-built data management sys-
tem for ML is well deserved. As shown in Figure 2, MLdp is
Figure 2: Integrated ML and data workflow: MLdp acts as data more than a data storage. The goals of MLdp are to assist ML
interface with data model, data access and life cycle management. tasks, such as annotation, exploration, data and feature engi-
neering. MLdp acts as a data interface providing abstractions
a framework specific implementation of the data access inter- including a data model, data access interface, and data life
face – should provide bi-directional integration. For example, cycle management for ML. MLdp also integrates with major
data stored as proprietary format can be accessed directly ML frameworks’ data access layer allowing ML pipelines to
from the ML data platform, while data stored in platform focus on experimentation and evaluation, leaving the onus
native format can be retrieved in a variety of formats, such of data access and management to MLdp.
as TFRecords.
In addition to variations in the data format, ML applica- 2.4 Related Work
tions use wide varieties of rich data types, such as image, Thus far, we have mentioned many previous works which
video, audio, document, text, numerical data, sensor data, etc. focus on different phases of the ML life cycle, such as ML
These rich data types should be first class citizens in the ML algorithms, model training, and evaluation. ModelHub [17],
data platform so that the system can provide index-ability ProvDB [16], and Model Governance [18] described systems
into these data in order to support efficient search and data that track the lineage of machine learning models. Com-
exploration. pared to these works, MLdp aims to provide a dataset centric
ML data can evolve in three dimensions – variety, vol- system to manage the life cycle of ML datasets, with the
ume and velocity. Typically, raw data assets are large in capability to track dependencies among data, models and
volume and change slowly. For example, individual images training tasks.
in a computer-vision ML (CV/ML) project will rarely change DataLab [23] and SciDB [19] showcased systems that are
after they are acquired and stored, unless there is a problem designed for scientific experiments to track the provenance
in the image. On the other hand, features or annotations for a and versions of underlying data, which is similar to MLdp’s
given image may change at a higher velocity but with lower data provenance and lineage tracking feature. In addition,
volume. Different models may require different annotations. MLdp provides data access interfaces integrated with popular
For example, bounding boxes on the objects in the images machine learning frameworks and data compliance features
are required to train an object detection model, while textual tailored to assist ML data management. Petastorm [7] offers
labels of the scene categories are needed to train a scene data access API and seamless integration with frameworks,
recognition model. Additionally, if previous ML experiments such as TensorFlow. We share a similar vision and offer ad-
did not yield satisfying results, it may trigger existing anno- ditional features such as versioning, compliance, and lineage
tations to be refined, or to be redone in an entirely different tracking to ease the data management burden on the user.

1806
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

consumption. Figure 3 shows major components of MLdp at

a high-level. Due to the length limit of the paper, we will not
Authentication
Data Plane Control Plane
cover all the components. Instead, the discussions will focus
on the designs that cover the following key requirements
Data Layer API MLdp Service API
mentioned in Section 2:
Storage API Audit Manager
• a conceptual data model to naturally describe raw data
Sharding/Indexing Authorization
Storage
Object MLdp
assets versus features/annotations derived from the
Metadata
Management API Metadata raw data,
• a version control scheme to ensure reproducibility of
ML experiments on immutable snapshot of datasets,
MLdp
Manged Storage
Streaming/Batch
Data Injection
• data access interfaces that can be seamlessly integrated
with ML frameworks as well as other data processing
Curated Curated In-ﬂight systems,
Data Data Pipes Data
Data
Store Store Store
External
Data System
• a hybrid data store design that is well-suited for both
continuous data injection with high concurrent up-
Figure 3: MLdp system architecture showing control plane and dates and slowly-changing curated data,
data plane and hybrid storage model. • a storage layout design that enables delta tracking
between different versions, data parallelism for dis-
Decibel [14] and ORPHEUSDB [22] are database systems tributed training, indexing for efficient search and data
that support data branching and versioning for structured exploration, and streaming I/O to support both train-
relational data. Kayak [13], on the other hand, provides a ing on devices or in the data center, and
data preparation framework with inferred catalog for loosely • a distributed cache to accelerate ML training tasks.
structured data in data lakes. MLdp takes the top-down ap-
proach by enforcing a minimalist data model to allow a cen- 3.1 Data Model
tralized catalog for ease of discovery and sharing, and at the 3.1.1 Conceptual data model. MLdp provides four high-level
same time allows flexible schema for semi-structured data. concepts (objects) – dataset, annotation, split, and package. A
Systems such as DataHub [2] and Goods [8] inspired our dataset is a collection of entities that are the main subjects
work in terms of data sharing and discovery. MLdp offers of ML training, e.g., raw images for a CV/ML project. An
additional compliance and privacy control features, with annotation is a collection of labels and/or features describing
data version management and data access interfaces to fully the entities in its associated dataset, e.g., bounding-boxes,
integrate with ML life cycle. object boundaries, or textual labels on raw images. A split
Michelangelo is Uber’s ML platform and shares similar consists of horizontal partitions of the foreign keys refer-
mission statements as MLdp. From [21], however, the data encing its associated dataset. Commonly, a dataset may be
lake approach, backed by HDFS, lacks data versioning and split into a training set, a testing set, and a validation set. Both
other compliance features. annotations and splits are weak objects. They do not exist
Finally, Boehm et al [3, 4], and Kraska et al [11] proposed by themselves. Instead, they are always associated with one
declarative domain specific languages for machine learn- dataset. A dataset can have multiple annotations and splits.
ing systems, named MLBase and SystemML, respectively. Figure 4 shows a sample dataset consisting of a collection
Currently, the design of the MLdp DSL focuses on data and of image files, an annotation that labels every image, and a
feature engineering tasks, and we rely on low-level data API split that divides the images into a training set and a test set.
to integrate with ML frameworks. The reason is simply that The benefits of separating (or, normalizing) annotations
most of the popular ML frameworks offer Python SDKs with and splits out from their datasets are multifold. It allows
easy to use control-flow configuration. Nevertheless, we are different ML projects to label or to split the data differently.
actively evaluating the practicality of declarative ML with For example, to train an object recognition model one may
MLdp’s DSL. want to label the bounding boxes in the images, however,
to train a scene classification model one may want to label
3 SYSTEM ARCHITECTURE AND DESIGN the borders of each objects in the images. The normalization
MLdp provides REST APIs and client SDKs for client-side also allows the same ML application to experiment using dif-
data access, and a domain specific language (DSL) for server- ferent features or labels. Alternatively, one may want to split
side data processing. The service consists of control plane the dataset in different ways to experiment with different
and data plane APIs to assist data management and data learning strategies.

1807
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

dataset/flowers@1.0.0 annotation/label@1.0.0 data do not necessarily imply application consistency. For

ImgId Filename Image ImgId Label
0 precious example, adding new data in patch version could alter the
0 0.png
1 sunflower results of an existing model training. A typical scenario of
2 rose
3 sunflower a patch is when a tiny fraction of the data is mal-formed
1 1.png 4 rose during injection, re-touching those data results in a new
5 sunflower
2 2.png patched version. It is essential that applications pin to a
split/test@1.0.0 specific version to guarantee reproducibility.
3 3.png
Train: ImgId Test: ImgId The versioning scheme in MLdp offers applications the
0 1
4 4.png
2 4 means to define their own version evolution policies. In
5 5.png
3 contrast to common multi-versioned data systems where
5
the versioning is implicit and system driven, the versioning
Figure 4: CV/ML data modeled as a dataset and its associated in MLdp is explicit and data owner driven. MLdp’s explicit
annotation and split. version management allows different ML projects: 1) to share
and to evolve the versions on their own cadence and needs
without disrupting other projects, 2) to pin a specific version
Splits are similar to partial indexes in databases, while
in order to reproduce the training results, and 3) to track
annotations are extracted features, or supplementary prop-
version dependencies between data and trained models.
erties of the associated datasets. Both annotations and splits
In machine learning, data is an interface. Any changes
can evolve without changing the base dataset. In practice,
(either insertion, deletion or updates) in data should be ver-
dataset acquisition and curation are usually costly, labor in-
sioned, just like software is versioned due to code changes.
tensive, and time consuming. Once a dataset is curated, it
To assist the life cycle management, each version of a
serves as the ground truth and will often be shared among
MLdp objects can be in one of four states – draft, published,
different projects/teams. It is desirable that the ground truth
archived, and purged. The “draft” state offers applications
does not change under their feet, but every project/team
the opportunity to validate the soundness of the data before
would like to label and organize the data based on their own
transitioning it into the “published” state. Once published,
needs and cadence.
the particular version of the data becomes immutable in
Another reason for the normalization comes from legal
order to guarantee the reproducibility of trainings. The only
or compliance requirements. Sometimes, labeling or feature
means to update published data is to create a new version
engineering may involve additional data collection which is
of it. Once the data is expired or no longer needed, it can be
done under different contractual agreements than the base
transitioned into the “archived” state, or into the “purged”
dataset. MLdp allows independent permissions and “Terms
state to be completely removed from the persisted storage.
of Use” settings for datasets and annotations.
For example, when a user opts out the user study, all the
Packages are virtual objects, and provide a conceptual
data collected on that user will be deleted resulting in a new
view over datasets, annotations, and/or splits. Like views in
patched version, while all the previous versions containing
databases, packages offer a higher-level abstraction to hide
that user data will be purged.
the physical definitions of individual objects. For example,
one can define an IndoorImage package over the union of the
3.1.3 Physical data model. MLdp represents its data in a tab-
ImageNet [5] and the OpenImages [12] datasets by selecting
ular form without rigid schema requirements on all entities
all the indoor images.
of a given object class. The only schema requirement is the
3.1.2 Versioning and data life cycle management. MLdp of- primary key of a dataset, which uniquely identifies an entity
fers a strong versioning scheme on all four high-level ob- in a dataset. In addition, it defines the foreign key in both
jects. Version evolutions are categorized into schema, re- annotations and splits to reference the associated entities in
vision, and patch, resulting in a three-part version num- the datasets.
ber <schema>.<revision>.<patch> similar to the versioning Columns in a table can be of scalar types, as well as col-
scheme of <major>.<minor>.<patch> used in software. lection types. Scalar types include number, string, date-time,
A new schema version signals that the schema of the data rich multimedia types, and byte stream, while collection
has changed, so code changes may be required to consume types include vector, set, and dictionary (document). MLdp
the new version of the data. Both revision and patch version tables are stored in column-wise fashion. Not only does the
changes denote that the data is updated, deleted, and/or new columnar layout yield a high compression rate, which in turn
entities have been added without schema changes. Existing reduces the I/O bandwidth requirements, but it also allows
applications should continue to work on new revisions or adding and removing columns efficiently. In addition, MLdp
patches. It is worth noting that non-breaking changes to tables are scalable distributed data structures, without the

1808
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

restriction of main memory size. MLdp tables are interop- view matching and index selection, are applicable to MLdp’s
erable with Apache Resilient Distributed Datasets (RDD), use cases.
Apache DataFrames, Pandas, and R Data Frames. Currently, we have integrated the DSL with Python, a
ML datasets often contain a list of raw files. For example, popular language in the ML community, in order to support
to build a human posture and movement classification model, user code and ML frameworks written in Python. Instead of
one entity in the dataset may consist of a set of video files listing a full specification of the MLdp DSL, we will demon-
of the same subject/movement from different angles, plus a strate the DSL in action by the following two examples. More
JSON file containing the accelerometer signals. MLdp treats examples can be found in Appendix A.
files as byte stream columns in the table. It is up to the (1) Create a dataset from a set of raw images and JSON
system to decide whether to store the file contents in-row or files describing the images.
off-row. Since most ML frameworks support file I/Os, MLdp (2) Use a user supplied ML model to create labels and pub-
provides streaming file accesses to those files, regardless of in- lish them as a new version of an existing annotation.
row or off-row storage layout. Moreover, MLdp allows user- Create a dataset The sample code4 in Listing 1 shows
defined access paths, such as primary indexes, secondary how to create a dataset, named human_posture_movement, from
indexes, partial indexes (or, filtered index), etc., for efficient the raw files under the path "/opt/data/hpm". The files are
data exploration and filtering. organized as follows. The prefix to each file is the unique
identifier to a logical entity (asset) in the dataset. An asset in
3.2 Data Interface this example contains a set of JPEG files, and the accelerome-
MLdp offers a high-level data language and a low-level data ter readings in one JSON file.
API. The high-level domain specific language (DSL) offers
/opt/data/hpm
a declarative programming paradigm that is aimed at as-
sisting server-side batch data and feature engineering tasks af02d55.1.jpeg files with the same prefix
in distributed and parallel execution environments closer af02d55.2.jpeg
belong to the same asset
af02d55.json
to the data. The low-level data API offers an imperative
b012366.1.jpeg
programming paradigm that is well integrated with major
b012366.json
ML frameworks and provides streaming I/O for client-side ...
data processing. In order to shorten the distance to data for
client-side data access, MLdp provides data center caching The CREATE dataset...WITH PRIMARY_KEY clause defines
for teams that train models in custom execution clusters. the metadata of the dataset, while the SELECT clause describes
MLdp cache service is a distributed cache service that can the input data. The syntax <qualifier>/<name>@<version>
also be deployed to the same compute cluster where ML denotes the uniform resource identifier (URI) for MLdp objects.
training occurs. In this example, the URI is dataset/human_posture_movement
One of the key observations that applied ML teams real- without the version, since CREATE statement always creates
ized, but is not surprising to the database community, is that a new object with version starting from 1.0.0.5 The FROM
significant amount of time was spent on data and feature en- sub-clause declares the variable binding, _FILE_, to each file
gineering compared to the amount of time spent on training. in the given directory. The files are grouped by the path
While the DSL is still an experimental feature, we are confi- prefix, _FILE_.NAME.split(’.’)[0], which is declared as the
dent that the DSL will help improve the productivity of ML primary key of the dataset. Within each group of files, all the
teams. For the sake of clarity and the familiarity of the SQL JPEG files are put into the Images collection column, and the
language to the community, we will use the SQL-like variant JSON file is put into the Accelerometer column. The function,
of the DSL to illustrate the use cases. Another variant of the DSL().Run(), will compile and execute the statement.
DSL which is more Pythonic by using the builder pattern is
omitted due to the space limitation. # the following statement will create a
# DatasetTable 'hpm' with the
3.2.1 DSL. The declarative nature of the DSL allows users # columns (SessionId, Images, Accelerometer)
status = ml.data.DSL(
to describe the intent, and the programs will be compiled
"CREATE dataset/human_posture_movement
and optimized into execution graphs, which can be either WITH PRIMARY_KEY(SessionId) AS
executed locally or submitted to an elastic compute cluster
4 For the sake of simplicity, we omit settings such as permission, expiration,
for execution. The local execution mode provides testing
etc., and focus on the data definition and data consumption grammar.
and debugging before execution in the cluster. The query 5 The available qualifiers are: dataset, annotation, split, and package,
optimization techniques developed over past decades for For weak objects, i.e., annotations and splits, full qualifiers in the form of
relational databases, such as exploiting interesting properties, dataset/<name>@<version>/(annotation|split) are required.

1809
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

SELECT SessionId, raw files in a logical file system. The mount command imple-
CASE WHEN _FILE_.EXT='jpeg' ments data streaming on-demand. That is, physical blocks
THEN COLLECTION(_FILE_) AS Images
containing the files or the portion of a table being accessed
CASE WHEN _FILE_.EXT='json'
THEN _FILE_ AS Accelerometer
are transmitted to the client machine in time. Currently,
END rudimentary prefetching and local caching are implemented
FROM FILES IN './data/hpm' AS _FILE_ in the mount-client. Since most, if not all, of the ML frame-
GROUP BY _FILE_.NAME.split('.')[0] AS SessionId").Run() works support File I/Os in their data access abstraction, the
Listing 1: Creating a dataset from files under a local folder mounted logical file system provides a basic integration with
most of the ML frameworks from day one. To support ML
Use a pre-trained model to create new labels The applications running on the edge, MLdp also provides direct
sample code in Listing 2 shows how to create a new version file access via REST API.
of an existing annotation on the human_posture_movement Listing 3 shows a Python application that mounts the
dataset. The reserved symbol, @, is used to specify a particular OpenImages dataset [12], and performs corner detection on
version of an object. ALTER...WITH REVISION6 will create a each image by directly reading the image files.
revision based on the specified version. In this example, the # mount the dataset
new version will be human_activity@1.3.0. The ON sub-clause data = ml.data.mount('dataset/OpenImagesV4@1.0.0', './mnt')
specifies the version of the dataset which this annotation
references to. The SELECT...FROM clause defines the input # data.raw_file_path points to the mount-point folder
data source. SessionId is the foreign key to the parent dataset. # that contains the raw files;
for entry in scandir.scandir(data.raw_file_path):
This example also demonstrates user code (Turi Create [1]) # Harris Corner Detector
integration with the MLdp DSL. User code dependencies img = cv2.imread(entry.path, cv2.IMREAD_GRAYSCALE)
must be declared by the import statements. gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
gray = np.float32(gray)
# the following statement will create a new version dst = cv2.cornerHarris(gray,2,3,0.04)
# of the annotation, 'human_activity', with
# the columns (SessionId, Activity) Listing 3: Mounting the OpenImages dataset
status = ml.data.DSL(
"import turicreate as tc; Table API As described in Section 3.1.3, MLdp stores
ALTER annotation/human_activity@1.2.0
data as tables in the columnar format, with the support of
WITH REVISION, FOREIGN_KEY(SessionId)
ON dataset/human_posture_movement@1.0.0 AS user-defined access paths (i.e., the secondary indexes). The
SELECT SessionId, table API allows applications to directly address both user
tc.load_model('dfs://ml/activity_classifier.ml'). tables and secondary indexes.
predict(Accelerometer) AS Activity Listing 4 shows a simple application which uses a sec-
FROM human_posture_movement@1.0.0").Run()
ondary index to locate data of interest, and then performs a
Listing 2: Creating a new version of an annotation that contains key/foreign-key join to retrieve the images from the primary
automatically generated labels by a pre-trained ML model dataset table for image thresholding processing.
# mount the dataset
The pre-trained model, activity_classifier.ml, is used
data = ml.data.mount('dataset/OpenImagesV4@1.0.0', './mnt')
to predict the activity based on the accelerometer signals,
and the prediction is stored as a label in the Activity column. # use the secondary index to select images of interest
img_class_idx = data.indexes['img_class']
3.2.2 Low-level data primitives. MLdp’s data access primi- person_class = img_class_idx.where('Category'='Person')
tives provide direct access to data via both streaming File I/Os
and table API. The streaming on-demand enables effective # fetch the data by joining back to the primary index
data parallelism in distributed training. person_data = data.primary_table.join(person_class,
on='ImageId')
Streaming File I/O It is common that ML datasets con-
tain collections of raw multimedia files that the ML models # now load all the person images for thresholding
directly work on. The MLdp client SDK allows applications for row in person_data:
to mount MLdp objects, and the mount point exposes those img = cv2.imread(row['filename'], IMREAD_GRAYSCALE)
thresh1 = cv2.threshold(img,127,255,THRESH_BINARY)
6 Similar syntax is available for dataset, split and package version updates.
thresh2 = cv2.threshold(img,127,255,THRESH_BINARY_INV)
Available options for version updates are SCHEMA, REVISION, and PATCH. thresh3 = cv2.threshold(img,127,255,THRESH_TRUNC)
Moreover, for the sake of brevity, the examples will omit the full-qualifier
for annotations and splits, when the full-qualifier is obvious in the context. Listing 4: Using Table API with secondary indexes to access data

1810
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

Distributed training A common distributed training velocity. The in-flight store only keeps the current version
configuration involves multiple worker nodes which train of its data. Snapshots can be taken and published to MLdp’s
the same model on different partitions of the data. Those curated data store, which is a versioned data store based on
workers will update their parameters on one or more pa- a distributed cloud storage system. The curated data store
rameter servers, which will then broadcast the aggregated is read-optimized, and yet supports efficient append-only
parameters to the workers for subsequent training iterations. updates and sub-optimal in-situ updates based on copy-on-
Many ML frameworks support distributed training na- write. Changes to a snapshot in the curated store will result
tively by allowing a user specified configuration describing in a new version of the snapshot. A published snapshot will
how workers and parameter server(s) are allocated, and how be kept in the system to ensure reproducibility of ML exper-
data will be partitioned and shuffled to each worker. In other iments until it is archived or purged. How MLdp optimizes
cases without framework support, one has to implement both workloads in different data stores lies in the physical
the sharding, shuffling, and coordination by oneself. In the data layout design, which will be discussed in detail in Sec-
following, we will illustrate how to integrate MLdp with dis- tion 3.3.2.
tributed training using TensorFlow [20]. Another example Data movement between the in-flight and curated data
that uses MXNet [15] will be listed in Appendix B. stores is managed by a subsystem, named “data-pipe”. Each
Listing 5 shows how each worker in TensorFlow accesses logical data block in both data stores maintains a unique iden-
a slice of input data by its task_index. Since MLdp streams tifier, a logical checksum, and a timestamp of last modifica-
the data on-demand, each worker will only incur the I/O tion. Data-pipe uses this information to track deltas between
costs proportional to the actual data loaded. different versions of the same dataset.
Matured datasets can be removed from the in-flight store
# mount the dataset
after storing the latest snapshot in the curated store. On the
data = ml.data.mount('dataset/OpenImagesV4@1.0.0', './mnt')
filenames = data.primary_table.select_columns(['id','path']) other hand, if needed, a copy of a snapshot can be moved
back to the in-flight store for further modification at a high
input_partition = tf.strided_slice(filenames, velocity and volume. After the modification is complete, it
[task_index], can be published to the curated data store as a new version.
[filenames.num_rows()],
Despite the multiple data stores, MLdp offers a unified
strides = [num_workers])
data access interface. The visibility of the two different data
# training logic here ... stores is purely for administrative reasons to ease the man-
agement of data life cycle by the data owners. It is also worth
Listing 5: Distributed training integrated with TensorFlow noting that using data from the in-flight store for ML experi-
ments is discouraged, since the experiment results may not
be reproducible due to the fact that data in the in-flight store
3.3 Storage and Data Layout Design may be overwritten.
MLdp’s design benefits from many key learnings found in
database systems. In this section, we will discuss MLdp’s 3.3.2 Scalable Data Layout. MLdp stores its data in parti-
storage layer design in order to meet the following require- tions, managed by the system. The partitioning scheme can-
ments for supporting ML life cycle. not be directly specified by the users. However, users may
• A hybrid data store that supports both high velocity define a sort key on the data in MLdp . The sort key will be
updates at the data curation stage and high throughput used as the prefix of the range partition key. Since there is
reads at the training stage. no uniqueness requirement on the user-defined sort key, in
• A scalable physical data layout that can support ever- order to provide a stable sorting order based on data injec-
growing data volume, and efficiently record and track tion time, the system will append timestamp to the partition
deltas between different versions of the same object. key. If no sort key is defined, the system will automatically
• Partitioned indices that support dynamic range queries, use the hash of the primary key as the range partition key.
point queries, and efficient streaming on-demand for The choices of the sort keys depend on the sequential ac-
distributed training. cess patterns to the data, similar to the problem of physical
database design in relational databases.
3.3.1 Hybrid Data Store. At early stages of data collection In case of data skew in the user-defined sort key, the ap-
and data curation, raw data assets and features are stored in pended timestamp column helps alleviate the partition skew
an in-flight data store, as shown in Figure 3. The in-flight problem. The timestamp provides sufficient entropy to split
data store uses a distributed key-value store that supports a partition either based on heat or based on volume. In addi-
efficient in-situ updates and appends concurrently at a high tion, range partitioning will allow the data volume to scale

1811
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

out efficiently without the issue of global data shuffling that Label == 'outdoor'

naive hash partition schemes suffer from. Sorted <12,13,15,16,20,27,300>

==> [Block3, Block8, ..., Block 92]
Each logical partition is further divided into a sequence of
physical data blocks. The size of the data blocks is variable Secondary Index
and can be adjusted based on access patterns. Both splits [300,27,12,20,13,15,16] Range Partitioned Index
and merges of data blocks are localized to the neighboring
Logical Logical Logical
blocks, with minimum data copying and movement. This Partition Partition
. . . . . . Partition
design choice is particularly influenced by the fact that pub-
lished versions of the MLdp data are immutable. Version 92
3
evolutions typically touch a fraction of the original data. 8

With the characteristics of minimum and localized changes, Physical Blocks

old and new versions can share common data blocks whose Figure 7: Using secondary index to map keys into data block IDs
data remain unchanged between versions. and retrieve data of interest.

Streaming On-Demand ML training experiments may

Range Partitioned Index
target only a subset of the entire dataset, e.g., to train a model
to classify the dog breeds, we might be only interested in the
Logical Logical
Partition
Partition . . . . . .
Logical
Partition
Logical
Partition
dog images from the entire computer vision dataset. After
A Dataset identifying the image IDs, the actual images might be scat-
tered across many partitions, the data block layout design
will allow MLdp’s client to stream only those data blocks of
Physical Blocks interest. In addition, many training tasks have a predeter-
mined access sequence, a fine-tuned data block size gives the
Figure 5: MLdp physical data layout. system a fine-grained control on prefetching optimization.
Figure 5 illustrates the data layout based on range par- In addition, streaming I/O improves the resource utilization,
titioning. MLdp’s storage engine maintains an additional esp., the highly contended GPUs, by reducing the idle-time
index on the range partition key to efficiently locate a par- waiting for the entire training data. Before the streaming I/O
ticular partition/data block based on user predicates. When feature was available, each training task had a long initial
a new version is created with incremental changes to the idle-time, busy-waiting for the entire data to be downloaded.
original version, only the affected data blocks are created Range Scan and Point Query Each data block and
with copy-on-write. As shown in Figure 6, updates trigger partition contains aggregated information about the key
a copy of the original data block, followed by a split, and a ranges within. The data blocks are linearly linked to support
new version is created with minimum data movement. efficient scans, while the index over the key ranges allows
efficient point queries.
dataset/OpenImages@1.0.0 Secondary Indexes MLdp allows users to materialize
...... search results, similar to materialized views in databases.
Physical Blocks copy-on-write Secondary indexes are simpler variations of generic materi-
alized views. The leaf-nodes of the secondary indexes store a
dataset/OpenImages@1.1.0
collection of partition keys. Since MLdp employs range par-
Figure 6: Create a new version of a dataset via copy-on-write. titioning, the system can easily sort and map the keys into
partition IDs and data block IDs without duplication. This
3.3.3 Performance Optimization. MLdp’s data layout design further improves the I/O throghput and latency by batch-
enables the following optimization strategies that benefit ing multiple key requests into a single block I/O. Figure 7
typical ML access patterns. illustrates the use of a secondary index to batch block I/Os.
Data Parallelism Typical data and feature engineering
tasks are highly parallelizable. The system can exploit the 3.4 Distributed Cache
interesting partition properties as well as the existing parti- Most ML applications perform client-side data processing,
tion boundaries to optimize the task execution. In addition, i.e., bringing data to compute. In order to shorten the data dis-
for distributed training where data is divided into subsets for tance, MLdp provides a transparent distributed cache in the
individual workers, the partitioned input provides a good data center collocated with the compute cluster of ML tasks.
starting point before the data needs to be randomized and The cache service is transparent to applications, since appli-
shuffled. cations will not directly address the cache service endpoint,

1812
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

instead they always connect to the MLdp API endpoint. If a new data acquisition journey. MLdp provides basic catalog
MLdp finds a cache service that is collocated with the execu- search capabilities, such as queries by names, descriptions,
tion cluster where the application is running, it will notify samples of the datasets, etc. Data discoverability for ML goes
the MLdp client to redirect all subsequent data API calls to beyond the basic catalog search.
the cache cluster. The MLdp client has a built-in fail-safe in Discoverability is a key requirement for ML data platforms.
case the cache service becomes unavailable, the data API Catalog searches allow users to locate “known” information,
calls fall back to the MLdp service endpoint. and data exploration allows users to uncover “unknown” in-
Currently, many different execution environments are formation. For example, given is an ML project to train a
used by different teams, and more are being added as ML model that can predict dog breeds. Catalog searches can
projects/teams proliferate in various domains. The cache quickly locate datasets that contains images, or even animal
service is designed to be deployable to any virtual cluster images if a textual label of “animal family” is available in
environment. The goal of such a design choice is to be able to the dataset. The next step may be to find out all the images
setup the cache service as soon as the execution environment that might seem to have a dog-like shape in it, or to under-
is ready. stand the distribution of certain properties of the images,
Another design goal of the MLdp cache service is to achieve such as lighting, the camera angles of the shots, colors of
read scale-out, in addition to the reduction of data latency. the coats of the animals in the images, etc. Some of these
The system throughput increases by scaling out existing data exploration activities are ML tasks by themselves with
cache services, or by setting up new cache deployments. We the help of pre-trained models. Oftentimes, it also involves
made a conscious design choice that the MLdp cache service labor-intensive crowd-sourced annotation efforts to find the
only caches read-only snapshots of the data, i.e., the pub- bounding boxes, object boundaries, etc., before we can find
lished versions of data. The decision favors a simple design out the data distribution of the image features which may
to guarantee strong consistency of the data. The anomalies in turn prompt refinements in the annotation efforts. Even
caused by the eventual consistency model impede the repro- in the midst of training experiments, ML specialists may
ducibility guarantee. If mutable data were also cached, in need to understand if the training dataset has a data skew
order to ensure transactional consistency of the cached data, on certain breeds, or is missing others, leading to a learning
data under higher volume of updates not only will not ben- skew.
efit from caching, but the frequent cache invalidation puts Although much of the work involved to support data dis-
counterproductive overheads to the cache service. covery is domain specific, there are many system-level fea-
tures needed in order to provide an ease-of-use program-
4 FUTURE WORK ming experience and fast turn-around time for interactive
data exploration. One example is to have tooling support,
The inception of MLdp is driven by the lessons learned from
such as interactive data studio, data visualization, etc. An-
using previously available ML solutions, which led to data
other example is to start with a common ML domain, e.g.,
silos. As the complexity and the limitations imposed by data
computer-vision ML (CV/ML), by building a background pro-
silos start to become the bottleneck of ML projects, it also
cess to learn the ontology of the dataset. By materializing
becomes evident that we need to build a data platform that
the pre-computed distribution of the data assets using the
integrates with the ML workflow, and provides a unified data
ontology, the system can provide real-time data exploration
abstraction through all steps of the life cycle. So far, we dis-
on common queries.
cussed many of the important design aspects of MLdp, the
After the data of interest is identified, the user may want
design goals, and the rationale behind them. However, the
to publish it as a materialized dataset into MLdp. Since the
system is far from complete. In this section, we will call out
data may come from different datasets with different formats,
future work, which is categorized into the following areas:
it becomes evident that a unified presentation of the hetero-
data discoverability, system improvements, and eco-system
geneous data format is required. The DSL should support
integration. For each of the three areas, we will highlight
user-defined functions, user-defined data type plug-ins to
the strategic features with the goal of maximizing the pro-
help with data transformation.
ductivity of ML teams.
4.2 System Improvements
4.1 Data Discovery and Exploration Two areas of system improvement to boost the productivity
ML projects often have a time-consuming human-in-the- of ML teams are: 1) to reduce data latency and to increase
loop phase of finding the right data for training tasks. As data throughput, and 2) to reduce human-in-the-loop time.
more datasets have been loaded into MLdp, new projects may As discussed in Section 3.4, MLdp’s cache service offers
benefit from exploring the existing data, instead of starting a basic object-based read-cache to reduce data distance and

1813
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

to alleviate cross data center network bandwidth bottleneck ML frameworks that need to be supported. However, the
with collocated cache services. Object-based caching will drawback is that the conversion happens every time at run-
cache data at the object level, i.e., the cached data map di- time. On the other hand, the benefit of the second approach
rectly to a persisted object in the data system. For example, is that for ML applications that use TensorFlow , it incurs
if the dataset OpenImages@1.0.0 is cached, any access to the almost no run-time overhead. However, the drawback is
dataset results in a cache hit. We plan to enhance the MLdp that for ML applications that use other ML frameworks than
cache service by allowing intent-based caching. That is, the TensorFlow, they have to convert the data at run-time. The
system can dynamically decide to cache frequently used number of connectors needed will be quadratic to the number
query results. Similar to view matching in databases, at data of supported ML frameworks.
access time, MLdp’s cache system will try to match the user Another important on-going integration is the workflow
query with the cached query definitions. If the result of the integration between MLdp and annotation platforms. Large
user query is subsumed by that of a cached query, then the scale annotation efforts are usually crowd-sourced using
user query will directly access the cached data (with optional external third-party storage as staging areas for data export
residual predicates, if applicable). and import. As mentioned earlier that annotation efforts are
Another improvement is to have better prefetching pre- usually interleaved with data exploration iteratively. Having
diction and local buffering. Effective prefetching can further a seamless smooth workflow between annotation and data
improve read latency by pipelining data I/Os with training exploration will promote productivity of the ML teams.
computations. However, in model training, effective prefetch-
ing is challenging, since inputs are usually required to be
randomly shuffled before feeding into the training model. 5 CONCLUDING REMARKS
Reasoning an effective prefetching policy and being adaptive Machine learning reemerges and flourishes within the last
at run-time is a work in-progress with our ML partner teams. decade. It combines all four scientific paradigms – theoreti-
The second area of system improvement is to have full cal science, experimental science, computational science, and
integration of the DSL with commonly used parallel data data-intensive science. [10] Machine learning is the study of
processing systems, such as Spark or PySpark, to accelerate mathematical models that leverage computational advances
the heavy-lifting data and feature engineering tasks. Our to learn the patterns within mass data, and repeat that exper-
experiences show that data and feature engineering tasks iment until we reach satisfactory results. Data, at the core of
contribute a significant portion of time to the ML project. the fourth paradigm, plays a critical role to the rate of con-
We believe a declarative programming paradigm of the DSL vergence. From past experiences, we observe that existing
will allow 1) users to focus on their business logic without ML solutions do not offer well-integrated data systems into
dealing with the nuisances of large scale data processing, the machine learning workflow. Simple storage systems or
and 2) systemic optimizations for parallel data processing. distributed file systems are used as passive data stores, result-
ing in data silos and leaving every step in the ML workflow
exposed to physical data dependence. The rate of innovation
4.3 Eco-System Integration is then hampered by the intertwined physical dependen-
As discussed in Section 3.2, many frameworks define their cies. Furthermore, the silos create a barrier to offer holistic
own data formats, such as TFRecord in TensorFlow, and solutions to data management.
RecordIO in MXNet, etc. It is an on-going effort to have MLdp In this paper, we propose a purpose-built ML data platform
integrated deeper with major ML frameworks so that user whose design centers around ML data life cycle management
applications remain compatible at the higher API level. The and ML workflow integration. Its data model enables col-
goal is to have minimum code changes to the user applica- laboration through data sharing and versioning, innovation
tions so that data migration to MLdp is smooth. through independent data evolution, and flexibility through
There are two design considerations. The first considera- minimalist schema requirements. Its data interface design
tion is to keep the MLdp data stored in MLdp’s own format, aims to provide ubiquitous data access, including ease of
and use a format connector to convert into target format, e.g., discovery, ease of use, and interoperability.
TFRecord, at run-time. The second consideration is to store We believe a data system with these properties will enable
MLdp data in the native target format, i.e., TFRecord. In this faster ML innovation for organizations. There are still many
case, we only need a lightweight connector, which directly challenges yet to be addressed including, but not limited to,
hydrates the on-disk data into in-memory TFRecords. declarative ML data language, system internal optimization,
The potential benefit of the first approach is that MLdp’s ML framework integration, etc. Finally, we hope that a system
native format becomes the mediator format, and the number like MLdp can provide a basis for further exploration of how
of connectors needed is a linear function to the number of to support machine learning needs with data systems.

1814
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

REFERENCES Heidelberg, 2011. Springer Berlin Heidelberg.

[1] Apple. Turi create. https://github.com/apple/turicreate/, 2018; accessed [20] TensorFlow. An open source machine learning framework for ev-
November 28, 2018. eryone. https://www.tensorflow.org/, 2018; accessed November 28,
[2] A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. 2018.
Elmore, S. Madden, and A. G. Parameswaran. Datahub: Collabo- [21] Uber. Meet Michelangelo: Uber’s machine learning platform. https:
rative data science & dataset version management at scale. CoRR, //eng.uber.com/michelangelo/, 2017; accessed November 28, 2018.
abs/1409.0798, 2014. [22] L. Xu, S. Huang, S. Hui, A. J. Elmore, and A. Parameswaran. Orpheusdb:
[3] M. Boehm, M. W. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. a lightweight approach to relational dataset versioning. In Proceedings
Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, A. C. Surve, and of the 2017 ACM International Conference on Management of Data, pages
S. Tatikonda. Systemml: Declarative machine learning on spark. In 1655–1658. ACM, 2017.
PVLDB, volume 9, 2016. [23] Y. Zhang, F. Xu, E. Frise, S. Wu, B. Yu, and W. Xu. Datalab: A version
[4] M. Boehm, A. V. Evfimievski, N. Pansare, and B. Reinwald. Declarative data management and analytics system. In Proceedings of the 2Nd
machine learning - a classification of basic properties and types. In International Workshop on BIG Data Software Engineering, BIGDSE ’16,
CoRR, abs/1605.05826, 2016. pages 12–18, New York, NY, USA, 2016. ACM.
[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A
large-scale hierarchical image database. In Proceedings on Computer Appendix A DSL EXAMPLES
Vision and Pattern Recognition. IEEE Computer Society, June 2009.
[6] Facebook. Introducing FBLearner Flow: Face- Below are three further examples of using DSL for creating
book’s AI backbone. https://code.fb.com/core-data/ a split and a package, and training an activity classifier.
introducing-fblearner-flow-facebook-s-ai-backbone/, 2018; ac- Create a split The sample code in Listing 6 creates a split
cessed November 28, 2018. of the human_posture_movement dataset into a training subset
[7] R. Gruener, O. Cheng, and Y. Litvin. Introducing Petastorm: Uber ATG’s
and a testing subset. The ON clause defines the dataset which
data access library for deep learning. https://eng.uber.com/petastorm/,
2018; accessed November 28, 2018. this split references to, and the FROM clause specifies the data
[8] A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. source, which is the join between human_activity@1.3.0 and
Whang. Goods: Organizing Google’s datasets. SIGMOD, 2016. human_posture_movement@1.0.0. The optional WHERE clause
[9] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, specifies the filter conditions. The split, named outdoor, only
M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis,
contains entities labelled as one of the three outdoor activ-
M. Smelyanskiy, L. Xiong, and X. Wang. Applied machine learning
at facebook: A datacenter infrastructure perspective. In 2018 IEEE ities. Note that a split does not contain any user defined
International Symposium on High Performance Computer Architecture columns. Instead, it only contains the reference key (foreign
(HPCA), pages 620–629, Feb 2018. key) to the corresponding dataset. As a result, the SELECT
[10] T. Hey, S. Tansley, and K. Tolle. The Fourth Paradigm: Data-Intensive clause is not supported in the CREATE split or ALTER split
Scientific Discovery. Microsoft Research, October 2009.
statements. Finally, the argument, perc = 0.8, to the function
[11] T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. J. Franklin, and M. Jor-
dan. Mlbase: A distributed machine-learning system. In CIDR, 2013. RANDOM_SPLIT_BY_COLUMN() specifies that 80% of entities will
[12] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, be included in the training set, and the rest will be included
S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari. The open in the testing set.
images dataset v4: Unified image classification, object detection, and
visual relationship detection at scale. arXiv:1811.00982, 2018. # the following statement will create
[13] A. Maccioni and R. Torlone. Crossing the finish line faster when # the split 'outdoor', which contains two partitions
paddling the data lake with kayak. Proceedings of the VLDB Endowment, # 'train' and 'test'
10(12):1853–1856, 2017. status = ml.data.DSL(
[14] M. Maddox, D. Goehring, A. J. Elmore, S. Madden, A. Parameswaran, "CREATE split/outdoor(train, test)
and A. Deshpande. Decibel: The relational dataset branching system. WITH RANDOM_SPLIT_BY_COLUMN(column='SessionId',
Proceedings of the VLDB Endowment, 9(9):624–635, 2016. perc=0.8)
[15] Apache MXNet. Mxnet data api. https://mxnet.incubator.apache.org/ ON dataset/human_posture_movement@1.0.0
versions/master/api/python/io/io.html, 2018; accessed November 28, FROM human_posture_movement@1.0.0 JOIN
2018. human_activity@1.3.0 ON SessionId
[16] H. Miao, A. Chavan, and A. Deshpande. Provdb: Lifecycle management WHERE Activity in {'biking', 'jogging',
of collaborative analysis workflows. In Proceedings of the 2nd Workshop 'hiking'}").Run()
on Human-In-the-Loop Data Analytics, page 7. ACM, 2017.
[17] H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards unified data and Listing 6: Creating a split
lifecycle management for deep learning. In 2017 IEEE 33rd International
Conference on Data Engineering (ICDE), pages 571–582. IEEE, 2017. Create a package The code in Listing 7 creates the pack-
[18] V. Sridhar, S. Subramanian, D. Arteaga, S. Sundararaman, D. Roselli, and age, named outdoor_activity, which is defined as a virtual
N. Talagala. Model governance: Reducing the anarchy of production
{ML}. In 2018 USENIX Annual Technical Conference (USENIX ATC 18),
view over a three-way join among human_posture_movement,
pages 351–358, Boston, MA, 2018. USENIX Association. human_activity, and outdoor on the primary key and for-
[19] M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The architec- eign keys. The SELECT list defines columns of the view. The
ture of scidb. In J. Bayard Cushing, J. French, and S. Bowers, editors, package contains two tablets – train_data and test_data.
Scientific and Statistical Database Management, pages 1–16, Berlin,

1815
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

From the above examples, we can see that the DSL lever-
# the following statement will create
# a package 'outdoor_activity'
ages most of the SQL expressiveness to simplify the tasks
status = ml.data.DSL( of data operations. One of the open design issue for MLdp
"CREATE package/outdoor_activity(train, test) AS DSL in the near future is to incorporate declarative machine
SELECT SessionId, Images, Accelerometer, Activity learning into the language. [3, 4]
FROM (dataset/human_posture_movement@1.2.0 AS d
JOIN annotation/human_activity@1.3.0 AS a
ON d.SessionId = a.SessionId)
Appendix B AN EXAMPLE WITH MXNET
JOIN split/outdoor@1.0.0 AS s The following example uses MXNet [15] data loading API
ON d.SessionId = s.SessionId").Run() – ImageRecordIter. In MXNet, one can use ImageRecordIter
Listing 7: Creating a package to specify what partition of the input to read. After mounting
the targeted MLdp dataset in command-line, the im2rec.py
Train an activity classifier Listing 8 shows a simple tool from MXNet is used to generate the list of input files.
model training example. It loads the package, outdoor_activity, The file list is then used by ImageRecordIter, together with
into both train_data and test_data tablets. Next, it creates the parameters – part_index and num_parts, to pipeline the
and trains the model using the training data. Finally, it eval- partition of the input data that each worker trains on, as
uates the model performance using the testing data. shown in Listing 9.

# the following statement will load the package # MXNet provides tools/im2rec.py to generate training
# and train the activity_classifier using the # and validation lists, 'openimages_train.lst' and
# training data # 'openimages_val.lst', by the following command:
# >>> python tools/im2rec.py openimages ./data --list True
--recursive True --train-ratio .85 --exts .png
train_data, test_data = ml.data.DSL(
"SELECT SessionId, Images, Accelerometer, Activity store = kv.create('dist')
FROM package/outdoor_activity@1.0.0").Run() trainer = gluon.Trainer(..., kvstore = store)

model = turicreate.activity_classifier.create( # at each worker

train_data, data = ImageRecordIter(
session_id = 'SessionId', path_imglist = "./openimages_train.lst",
target = 'Activity') num_parts = store.num_workers,
part_index = store.rank,
metrics = model.evaluate(test_data) ...)
Listing 8: Train an activity classifier Listing 9: Distributed training integrated with MXNet

1816