Data Platform For Machine Learning
Data Platform For Machine Learning
1803
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands
insights in our observation from existing solutions are as Figure 1: Conventional ML workflow and data silos.
follows. The emphasis of existing solutions has been on train- Based on these observations, we see a strong need to build
ing, experimentation, evaluation, and deployment phases of a dataset management system that integrates into the ML
the life cycle, as shown in Table 1. However, little emphasis1 life cycle, guarantees compliance, enforces legal terms of
has been made on the needs of an integrated ML data system. use, offers a high-level data access abstraction to allow ML
Instead, during data collection, data are often injected into engineers to focus on ML tasks, and provides tooling for
cloud blob storage systems. The heavy-lifting ETL (extract- easy access to data insights in order to accelerate ML experi-
transform-load) pipelines are often performed using data mentation. The technical challenges to achieve these goals
processing systems, such as MapReduce or Spark, with the can be categorized into the following areas, to be discussed
results stored in distributed file systems or back to the blob next: 1) supporting the engineering teams, 2) supporting the
storage. The crowd-sourced annotation tasks often require machine learning life cycle, and 3) supporting the variety of
data to be exported and imported to/from publicly accessi- ML frameworks and ML data.
ble data storage systems. The derived and transformed data
between different stages are simply stored in storage sys- 2.1 Machine Learning Teams
tems based on the preferences of the data engineers. These
The teams that build machine learning features comprise of
data silos are used as simple, passive data stores, leaving the
ML engineers, data scientists, software engineers, and the
burden of integrating data access with ML tasks to the ML
product legal team. These distinct specialists need a data
engineers. The ML engineers often face the challenges of
system to support them. ML engineers focus on finding the
coping with different data layouts and formats and of dif-
best ML model for the given problem, exploring different
ferent data access interfaces from different silos throughout
features2 of the data, and experimenting with different com-
different stages instead of focusing on ML challenges.
binations of data and ML models. Software engineers focus
As illustrated in Figure 1, developers, using different tools
on building the toolset to support the end-to-end ML model
in every stage in ML life cycle, need to develop an understand-
training pipeline, starting from data cleaning, ETL, feature
ing for data internals, leading to a cascading physical data
engineering, model training, model evaluation, and eventu-
dependency. This intertwined physical dependence further
ally model deployment. Data scientists work closely with
hinders the ability to offer data provenance, data version-
both ML specialists and software engineers on feature en-
ing, compliance, auditing, usage analytics, etc. Even worse,
gineering, annotation efforts, and data quality control. The
it hampers the rate of ML innovations, because every new
product legal team oversees the legal terms of use of the data,
project needs to re-learn these complex dependencies. Fur-
privacy, compliance (such as GDPR3 ), and auditing. Existing
thermore, in large organizations, we observed various teams
ML data solutions do not provide an integrated data platform
solving similar ML problems, often without knowledge of
to support all of above mentioned collaboration. Instead, data
each other. A central data management system will enable
silos are created by each discipline, and discrepancies arise
these teams to discover and share relevant data and to foster
due to the difficulties of maintaining consistency across data
new collaborations between teams.
silos.
2 Features are measurable properties or characteristics extracted or derived
1 Only since recent years, the need of an ML data platform has attracted from raw data. Feature engineering is the process of obtaining features from
attention from companies and our community, e.g., DEEM (International raw data.
Workshop on Data Management for End-to-End Machine Learning). 3 GDPR stands for EU’s General Data Protection Regulation.
1804
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands
Furthermore, acquiring and augmenting data is expensive (8) a toolset and a programming interface for data explo-
and time consuming. Hence, it is imperative that teams are ration and discovery.
able to share their data within an organization under ap-
propriate compliance and privacy constraints. In order to It is common to conduct ML experiments across differ-
share the data, stakeholder identification on the source of ent mixes and matches of data and features. In any highly
the data is another key aspect for enterprises [8]. In conven- experimental process, it is essential that one can reproduce
tional service models, data is encapsulated behind service the results as needed. Existing ML solutions, which rely on
interfaces and any change in data is not known to the con- users to bring their own data and data storage system, lack
sumers of the service. In machine learning, data itself is an the support to easily reproduce the training results. The bur-
interface which needs to be tracked and versioned. Hence, den falls onto the users to manually track the dependencies
the ability to identify the ownership, the lineage, and the among data versions, training tasks, and ML models. An ML
provenance of data is critical for such a system. Since data data platform should be capable of tracking the dependencies
evolves through the life of the project, teams require data among data, model, and code, as well as data lineage between
life cycle management features to understand how the data raw data assets and the derived annotations and features. For
has changed. Again, silos make data sharing and discovery, example, in case of errors found in the source dataset, we can
data lineage and provenance tracking, version control, and identify all the dependent and derived data, and notify their
access control difficult, if not impossible. owners to regenerate the labels or annotations, or re-train
the ML models.
2.2 Machine Learning Life Cycle In addition, ML experiments often begin with interactive
data exploration before launching large scale training in
The machine learning life cycle is highly iterative and ex-
the data center. An ML data platform should support both
perimental. While data is being injected during the data
server-side data processing, such as predicate push-down and
collection stage, data curation process starts incrementally
expression evaluation, as well as client-side remote stream-
with ETL pipelines to homogenize syntactic and semantic
ing data access. The transition from training on laptop to
discrepancies, and the cleansed data is pipelined to feature
training in data center should require no code changes in
engineering and annotation efforts. After the initial batches
the applications, so the cycle from exploration to analysis to
of features and annotations are ready, ML engineers start
experimentation is as short as possible. Furthermore, similar
exploring different features and small-scale training experi-
to common large-scale data systems, automated sharding is
ments. Depending on the experiment results, existing feature
essential to support distributed training by leveraging data
engineering and annotation efforts may be paused, contin-
parallelism, so is data caching in order to shorten the data
ued, or new data engineering effort may be started. Only after
distance.
many, sometimes hundreds of, experiments does a promising
mix of data, ML features, and a trained ML model emerge.
In order to assist tasks among these intertwined stages, a 2.3 Machine Learning Frameworks and
ML data platform needs to support Data
(1) a conceptual data model to describe both slowly chang- Many machine learning frameworks are available today, with
ing data assets and volatile features, new ones emerging at times. Most of them are free and open-
(2) simple mechanisms for continuous data injection, source, and many of them are still under active development.
(3) a domain specific language for effective data and fea- Machine learning frameworks provide various learning al-
ture engineering, gorithms and models for different problem domains in ML.
(4) a hybrid data store, that are suited for large data vol- Since each framework has strengths for different models, it
umes of stable raw assets and high concurrent updates is very common for an organization to utilize several frame-
on volatile data, with physical data layout designs that works within a given project (including different versions of
support data versioning, and are optimized for incre- the same framework). Most of the frameworks rely on the
mental updates, delta tracking among versions, and file system to access training data, with some frameworks
ML training access patterns, offering additional data reader interfaces to make I/O more
(5) a data access interface integrated with major ML frame- efficient, such as, TensorFlow and MXNet.
works, allowing ubiquitous data access both within An integrated ML data platform should provide file I/O
data centers as well as on the edge, data access interfaces for common ML frameworks, with the
(6) explicit version management to ensure reproducibility extensibility to plug in custom record-level iterators or data
of ML experiments, readers for specific frameworks, such as RecordIO for MXNet
(7) tracking for data lineage and provenance, and and TFRecordReader for TensorFlow. These data connectors –
1805
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands
MLdp ML Framework and way. A new ML project might also trigger new annotation
Crowd- Data Store
Execution
Environment
efforts. All of the above lead to increased volume, variety and
Sourcing
Annotation velocity of ML data. The ML data system needs to support
Annotation Annotation
Curated
Evaluation agility in all three dimensions.
Feature
Raw Data Splits
Engineering To sum up, based on the above observations and the expe-
Data Exploration
e Experimentation riences from using existing solutions, we believe the effort to
kag
Pac design and implement a purpose-built data management sys-
tem for ML is well deserved. As shown in Figure 2, MLdp is
Figure 2: Integrated ML and data workflow: MLdp acts as data more than a data storage. The goals of MLdp are to assist ML
interface with data model, data access and life cycle management. tasks, such as annotation, exploration, data and feature engi-
neering. MLdp acts as a data interface providing abstractions
a framework specific implementation of the data access inter- including a data model, data access interface, and data life
face – should provide bi-directional integration. For example, cycle management for ML. MLdp also integrates with major
data stored as proprietary format can be accessed directly ML frameworks’ data access layer allowing ML pipelines to
from the ML data platform, while data stored in platform focus on experimentation and evaluation, leaving the onus
native format can be retrieved in a variety of formats, such of data access and management to MLdp.
as TFRecords.
In addition to variations in the data format, ML applica- 2.4 Related Work
tions use wide varieties of rich data types, such as image, Thus far, we have mentioned many previous works which
video, audio, document, text, numerical data, sensor data, etc. focus on different phases of the ML life cycle, such as ML
These rich data types should be first class citizens in the ML algorithms, model training, and evaluation. ModelHub [17],
data platform so that the system can provide index-ability ProvDB [16], and Model Governance [18] described systems
into these data in order to support efficient search and data that track the lineage of machine learning models. Com-
exploration. pared to these works, MLdp aims to provide a dataset centric
ML data can evolve in three dimensions – variety, vol- system to manage the life cycle of ML datasets, with the
ume and velocity. Typically, raw data assets are large in capability to track dependencies among data, models and
volume and change slowly. For example, individual images training tasks.
in a computer-vision ML (CV/ML) project will rarely change DataLab [23] and SciDB [19] showcased systems that are
after they are acquired and stored, unless there is a problem designed for scientific experiments to track the provenance
in the image. On the other hand, features or annotations for a and versions of underlying data, which is similar to MLdp’s
given image may change at a higher velocity but with lower data provenance and lineage tracking feature. In addition,
volume. Different models may require different annotations. MLdp provides data access interfaces integrated with popular
For example, bounding boxes on the objects in the images machine learning frameworks and data compliance features
are required to train an object detection model, while textual tailored to assist ML data management. Petastorm [7] offers
labels of the scene categories are needed to train a scene data access API and seamless integration with frameworks,
recognition model. Additionally, if previous ML experiments such as TensorFlow. We share a similar vision and offer ad-
did not yield satisfying results, it may trigger existing anno- ditional features such as versioning, compliance, and lineage
tations to be refined, or to be redone in an entirely different tracking to ease the data management burden on the user.
1806
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands
1807
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands
1808
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands
restriction of main memory size. MLdp tables are interop- view matching and index selection, are applicable to MLdp’s
erable with Apache Resilient Distributed Datasets (RDD), use cases.
Apache DataFrames, Pandas, and R Data Frames. Currently, we have integrated the DSL with Python, a
ML datasets often contain a list of raw files. For example, popular language in the ML community, in order to support
to build a human posture and movement classification model, user code and ML frameworks written in Python. Instead of
one entity in the dataset may consist of a set of video files listing a full specification of the MLdp DSL, we will demon-
of the same subject/movement from different angles, plus a strate the DSL in action by the following two examples. More
JSON file containing the accelerometer signals. MLdp treats examples can be found in Appendix A.
files as byte stream columns in the table. It is up to the (1) Create a dataset from a set of raw images and JSON
system to decide whether to store the file contents in-row or files describing the images.
off-row. Since most ML frameworks support file I/Os, MLdp (2) Use a user supplied ML model to create labels and pub-
provides streaming file accesses to those files, regardless of in- lish them as a new version of an existing annotation.
row or off-row storage layout. Moreover, MLdp allows user- Create a dataset The sample code4 in Listing 1 shows
defined access paths, such as primary indexes, secondary how to create a dataset, named human_posture_movement, from
indexes, partial indexes (or, filtered index), etc., for efficient the raw files under the path "/opt/data/hpm". The files are
data exploration and filtering. organized as follows. The prefix to each file is the unique
identifier to a logical entity (asset) in the dataset. An asset in
3.2 Data Interface this example contains a set of JPEG files, and the accelerome-
MLdp offers a high-level data language and a low-level data ter readings in one JSON file.
API. The high-level domain specific language (DSL) offers
/opt/data/hpm
a declarative programming paradigm that is aimed at as-
sisting server-side batch data and feature engineering tasks af02d55.1.jpeg files with the same prefix
in distributed and parallel execution environments closer af02d55.2.jpeg
belong to the same asset
af02d55.json
to the data. The low-level data API offers an imperative
b012366.1.jpeg
programming paradigm that is well integrated with major
b012366.json
ML frameworks and provides streaming I/O for client-side ...
data processing. In order to shorten the distance to data for
client-side data access, MLdp provides data center caching The CREATE dataset...WITH PRIMARY_KEY clause defines
for teams that train models in custom execution clusters. the metadata of the dataset, while the SELECT clause describes
MLdp cache service is a distributed cache service that can the input data. The syntax <qualifier>/<name>@<version>
also be deployed to the same compute cluster where ML denotes the uniform resource identifier (URI) for MLdp objects.
training occurs. In this example, the URI is dataset/human_posture_movement
One of the key observations that applied ML teams real- without the version, since CREATE statement always creates
ized, but is not surprising to the database community, is that a new object with version starting from 1.0.0.5 The FROM
significant amount of time was spent on data and feature en- sub-clause declares the variable binding, _FILE_, to each file
gineering compared to the amount of time spent on training. in the given directory. The files are grouped by the path
While the DSL is still an experimental feature, we are confi- prefix, _FILE_.NAME.split(’.’)[0], which is declared as the
dent that the DSL will help improve the productivity of ML primary key of the dataset. Within each group of files, all the
teams. For the sake of clarity and the familiarity of the SQL JPEG files are put into the Images collection column, and the
language to the community, we will use the SQL-like variant JSON file is put into the Accelerometer column. The function,
of the DSL to illustrate the use cases. Another variant of the DSL().Run(), will compile and execute the statement.
DSL which is more Pythonic by using the builder pattern is
omitted due to the space limitation. # the following statement will create a
# DatasetTable 'hpm' with the
3.2.1 DSL. The declarative nature of the DSL allows users # columns (SessionId, Images, Accelerometer)
status = ml.data.DSL(
to describe the intent, and the programs will be compiled
"CREATE dataset/human_posture_movement
and optimized into execution graphs, which can be either WITH PRIMARY_KEY(SessionId) AS
executed locally or submitted to an elastic compute cluster
4 For the sake of simplicity, we omit settings such as permission, expiration,
for execution. The local execution mode provides testing
etc., and focus on the data definition and data consumption grammar.
and debugging before execution in the cluster. The query 5 The available qualifiers are: dataset, annotation, split, and package,
optimization techniques developed over past decades for For weak objects, i.e., annotations and splits, full qualifiers in the form of
relational databases, such as exploiting interesting properties, dataset/<name>@<version>/(annotation|split) are required.
1809
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands
SELECT SessionId, raw files in a logical file system. The mount command imple-
CASE WHEN _FILE_.EXT='jpeg' ments data streaming on-demand. That is, physical blocks
THEN COLLECTION(_FILE_) AS Images
containing the files or the portion of a table being accessed
CASE WHEN _FILE_.EXT='json'
THEN _FILE_ AS Accelerometer
are transmitted to the client machine in time. Currently,
END rudimentary prefetching and local caching are implemented
FROM FILES IN './data/hpm' AS _FILE_ in the mount-client. Since most, if not all, of the ML frame-
GROUP BY _FILE_.NAME.split('.')[0] AS SessionId").Run() works support File I/Os in their data access abstraction, the
Listing 1: Creating a dataset from files under a local folder mounted logical file system provides a basic integration with
most of the ML frameworks from day one. To support ML
Use a pre-trained model to create new labels The applications running on the edge, MLdp also provides direct
sample code in Listing 2 shows how to create a new version file access via REST API.
of an existing annotation on the human_posture_movement Listing 3 shows a Python application that mounts the
dataset. The reserved symbol, @, is used to specify a particular OpenImages dataset [12], and performs corner detection on
version of an object. ALTER...WITH REVISION6 will create a each image by directly reading the image files.
revision based on the specified version. In this example, the # mount the dataset
new version will be human_activity@1.3.0. The ON sub-clause data = ml.data.mount('dataset/OpenImagesV4@1.0.0', './mnt')
specifies the version of the dataset which this annotation
references to. The SELECT...FROM clause defines the input # data.raw_file_path points to the mount-point folder
data source. SessionId is the foreign key to the parent dataset. # that contains the raw files;
for entry in scandir.scandir(data.raw_file_path):
This example also demonstrates user code (Turi Create [1]) # Harris Corner Detector
integration with the MLdp DSL. User code dependencies img = cv2.imread(entry.path, cv2.IMREAD_GRAYSCALE)
must be declared by the import statements. gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
gray = np.float32(gray)
# the following statement will create a new version dst = cv2.cornerHarris(gray,2,3,0.04)
# of the annotation, 'human_activity', with
# the columns (SessionId, Activity) Listing 3: Mounting the OpenImages dataset
status = ml.data.DSL(
"import turicreate as tc; Table API As described in Section 3.1.3, MLdp stores
ALTER annotation/human_activity@1.2.0
data as tables in the columnar format, with the support of
WITH REVISION, FOREIGN_KEY(SessionId)
ON dataset/human_posture_movement@1.0.0 AS user-defined access paths (i.e., the secondary indexes). The
SELECT SessionId, table API allows applications to directly address both user
tc.load_model('dfs://ml/activity_classifier.ml'). tables and secondary indexes.
predict(Accelerometer) AS Activity Listing 4 shows a simple application which uses a sec-
FROM human_posture_movement@1.0.0").Run()
ondary index to locate data of interest, and then performs a
Listing 2: Creating a new version of an annotation that contains key/foreign-key join to retrieve the images from the primary
automatically generated labels by a pre-trained ML model dataset table for image thresholding processing.
# mount the dataset
The pre-trained model, activity_classifier.ml, is used
data = ml.data.mount('dataset/OpenImagesV4@1.0.0', './mnt')
to predict the activity based on the accelerometer signals,
and the prediction is stored as a label in the Activity column. # use the secondary index to select images of interest
img_class_idx = data.indexes['img_class']
3.2.2 Low-level data primitives. MLdp’s data access primi- person_class = img_class_idx.where('Category'='Person')
tives provide direct access to data via both streaming File I/Os
and table API. The streaming on-demand enables effective # fetch the data by joining back to the primary index
data parallelism in distributed training. person_data = data.primary_table.join(person_class,
on='ImageId')
Streaming File I/O It is common that ML datasets con-
tain collections of raw multimedia files that the ML models # now load all the person images for thresholding
directly work on. The MLdp client SDK allows applications for row in person_data:
to mount MLdp objects, and the mount point exposes those img = cv2.imread(row['filename'], IMREAD_GRAYSCALE)
thresh1 = cv2.threshold(img,127,255,THRESH_BINARY)
6 Similar syntax is available for dataset, split and package version updates.
thresh2 = cv2.threshold(img,127,255,THRESH_BINARY_INV)
Available options for version updates are SCHEMA, REVISION, and PATCH. thresh3 = cv2.threshold(img,127,255,THRESH_TRUNC)
Moreover, for the sake of brevity, the examples will omit the full-qualifier
for annotations and splits, when the full-qualifier is obvious in the context. Listing 4: Using Table API with secondary indexes to access data
1810
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands
Distributed training A common distributed training velocity. The in-flight store only keeps the current version
configuration involves multiple worker nodes which train of its data. Snapshots can be taken and published to MLdp’s
the same model on different partitions of the data. Those curated data store, which is a versioned data store based on
workers will update their parameters on one or more pa- a distributed cloud storage system. The curated data store
rameter servers, which will then broadcast the aggregated is read-optimized, and yet supports efficient append-only
parameters to the workers for subsequent training iterations. updates and sub-optimal in-situ updates based on copy-on-
Many ML frameworks support distributed training na- write. Changes to a snapshot in the curated store will result
tively by allowing a user specified configuration describing in a new version of the snapshot. A published snapshot will
how workers and parameter server(s) are allocated, and how be kept in the system to ensure reproducibility of ML exper-
data will be partitioned and shuffled to each worker. In other iments until it is archived or purged. How MLdp optimizes
cases without framework support, one has to implement both workloads in different data stores lies in the physical
the sharding, shuffling, and coordination by oneself. In the data layout design, which will be discussed in detail in Sec-
following, we will illustrate how to integrate MLdp with dis- tion 3.3.2.
tributed training using TensorFlow [20]. Another example Data movement between the in-flight and curated data
that uses MXNet [15] will be listed in Appendix B. stores is managed by a subsystem, named “data-pipe”. Each
Listing 5 shows how each worker in TensorFlow accesses logical data block in both data stores maintains a unique iden-
a slice of input data by its task_index. Since MLdp streams tifier, a logical checksum, and a timestamp of last modifica-
the data on-demand, each worker will only incur the I/O tion. Data-pipe uses this information to track deltas between
costs proportional to the actual data loaded. different versions of the same dataset.
Matured datasets can be removed from the in-flight store
# mount the dataset
after storing the latest snapshot in the curated store. On the
data = ml.data.mount('dataset/OpenImagesV4@1.0.0', './mnt')
filenames = data.primary_table.select_columns(['id','path']) other hand, if needed, a copy of a snapshot can be moved
back to the in-flight store for further modification at a high
input_partition = tf.strided_slice(filenames, velocity and volume. After the modification is complete, it
[task_index], can be published to the curated data store as a new version.
[filenames.num_rows()],
Despite the multiple data stores, MLdp offers a unified
strides = [num_workers])
data access interface. The visibility of the two different data
# training logic here ... stores is purely for administrative reasons to ease the man-
agement of data life cycle by the data owners. It is also worth
Listing 5: Distributed training integrated with TensorFlow noting that using data from the in-flight store for ML experi-
ments is discouraged, since the experiment results may not
be reproducible due to the fact that data in the in-flight store
3.3 Storage and Data Layout Design may be overwritten.
MLdp’s design benefits from many key learnings found in
database systems. In this section, we will discuss MLdp’s 3.3.2 Scalable Data Layout. MLdp stores its data in parti-
storage layer design in order to meet the following require- tions, managed by the system. The partitioning scheme can-
ments for supporting ML life cycle. not be directly specified by the users. However, users may
• A hybrid data store that supports both high velocity define a sort key on the data in MLdp . The sort key will be
updates at the data curation stage and high throughput used as the prefix of the range partition key. Since there is
reads at the training stage. no uniqueness requirement on the user-defined sort key, in
• A scalable physical data layout that can support ever- order to provide a stable sorting order based on data injec-
growing data volume, and efficiently record and track tion time, the system will append timestamp to the partition
deltas between different versions of the same object. key. If no sort key is defined, the system will automatically
• Partitioned indices that support dynamic range queries, use the hash of the primary key as the range partition key.
point queries, and efficient streaming on-demand for The choices of the sort keys depend on the sequential ac-
distributed training. cess patterns to the data, similar to the problem of physical
database design in relational databases.
3.3.1 Hybrid Data Store. At early stages of data collection In case of data skew in the user-defined sort key, the ap-
and data curation, raw data assets and features are stored in pended timestamp column helps alleviate the partition skew
an in-flight data store, as shown in Figure 3. The in-flight problem. The timestamp provides sufficient entropy to split
data store uses a distributed key-value store that supports a partition either based on heat or based on volume. In addi-
efficient in-situ updates and appends concurrently at a high tion, range partitioning will allow the data volume to scale
1811
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands
out efficiently without the issue of global data shuffling that Label == 'outdoor'
1812
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands
instead they always connect to the MLdp API endpoint. If a new data acquisition journey. MLdp provides basic catalog
MLdp finds a cache service that is collocated with the execu- search capabilities, such as queries by names, descriptions,
tion cluster where the application is running, it will notify samples of the datasets, etc. Data discoverability for ML goes
the MLdp client to redirect all subsequent data API calls to beyond the basic catalog search.
the cache cluster. The MLdp client has a built-in fail-safe in Discoverability is a key requirement for ML data platforms.
case the cache service becomes unavailable, the data API Catalog searches allow users to locate “known” information,
calls fall back to the MLdp service endpoint. and data exploration allows users to uncover “unknown” in-
Currently, many different execution environments are formation. For example, given is an ML project to train a
used by different teams, and more are being added as ML model that can predict dog breeds. Catalog searches can
projects/teams proliferate in various domains. The cache quickly locate datasets that contains images, or even animal
service is designed to be deployable to any virtual cluster images if a textual label of “animal family” is available in
environment. The goal of such a design choice is to be able to the dataset. The next step may be to find out all the images
setup the cache service as soon as the execution environment that might seem to have a dog-like shape in it, or to under-
is ready. stand the distribution of certain properties of the images,
Another design goal of the MLdp cache service is to achieve such as lighting, the camera angles of the shots, colors of
read scale-out, in addition to the reduction of data latency. the coats of the animals in the images, etc. Some of these
The system throughput increases by scaling out existing data exploration activities are ML tasks by themselves with
cache services, or by setting up new cache deployments. We the help of pre-trained models. Oftentimes, it also involves
made a conscious design choice that the MLdp cache service labor-intensive crowd-sourced annotation efforts to find the
only caches read-only snapshots of the data, i.e., the pub- bounding boxes, object boundaries, etc., before we can find
lished versions of data. The decision favors a simple design out the data distribution of the image features which may
to guarantee strong consistency of the data. The anomalies in turn prompt refinements in the annotation efforts. Even
caused by the eventual consistency model impede the repro- in the midst of training experiments, ML specialists may
ducibility guarantee. If mutable data were also cached, in need to understand if the training dataset has a data skew
order to ensure transactional consistency of the cached data, on certain breeds, or is missing others, leading to a learning
data under higher volume of updates not only will not ben- skew.
efit from caching, but the frequent cache invalidation puts Although much of the work involved to support data dis-
counterproductive overheads to the cache service. covery is domain specific, there are many system-level fea-
tures needed in order to provide an ease-of-use program-
4 FUTURE WORK ming experience and fast turn-around time for interactive
data exploration. One example is to have tooling support,
The inception of MLdp is driven by the lessons learned from
such as interactive data studio, data visualization, etc. An-
using previously available ML solutions, which led to data
other example is to start with a common ML domain, e.g.,
silos. As the complexity and the limitations imposed by data
computer-vision ML (CV/ML), by building a background pro-
silos start to become the bottleneck of ML projects, it also
cess to learn the ontology of the dataset. By materializing
becomes evident that we need to build a data platform that
the pre-computed distribution of the data assets using the
integrates with the ML workflow, and provides a unified data
ontology, the system can provide real-time data exploration
abstraction through all steps of the life cycle. So far, we dis-
on common queries.
cussed many of the important design aspects of MLdp, the
After the data of interest is identified, the user may want
design goals, and the rationale behind them. However, the
to publish it as a materialized dataset into MLdp. Since the
system is far from complete. In this section, we will call out
data may come from different datasets with different formats,
future work, which is categorized into the following areas:
it becomes evident that a unified presentation of the hetero-
data discoverability, system improvements, and eco-system
geneous data format is required. The DSL should support
integration. For each of the three areas, we will highlight
user-defined functions, user-defined data type plug-ins to
the strategic features with the goal of maximizing the pro-
help with data transformation.
ductivity of ML teams.
4.2 System Improvements
4.1 Data Discovery and Exploration Two areas of system improvement to boost the productivity
ML projects often have a time-consuming human-in-the- of ML teams are: 1) to reduce data latency and to increase
loop phase of finding the right data for training tasks. As data throughput, and 2) to reduce human-in-the-loop time.
more datasets have been loaded into MLdp, new projects may As discussed in Section 3.4, MLdp’s cache service offers
benefit from exploring the existing data, instead of starting a basic object-based read-cache to reduce data distance and
1813
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands
to alleviate cross data center network bandwidth bottleneck ML frameworks that need to be supported. However, the
with collocated cache services. Object-based caching will drawback is that the conversion happens every time at run-
cache data at the object level, i.e., the cached data map di- time. On the other hand, the benefit of the second approach
rectly to a persisted object in the data system. For example, is that for ML applications that use TensorFlow , it incurs
if the dataset OpenImages@1.0.0 is cached, any access to the almost no run-time overhead. However, the drawback is
dataset results in a cache hit. We plan to enhance the MLdp that for ML applications that use other ML frameworks than
cache service by allowing intent-based caching. That is, the TensorFlow, they have to convert the data at run-time. The
system can dynamically decide to cache frequently used number of connectors needed will be quadratic to the number
query results. Similar to view matching in databases, at data of supported ML frameworks.
access time, MLdp’s cache system will try to match the user Another important on-going integration is the workflow
query with the cached query definitions. If the result of the integration between MLdp and annotation platforms. Large
user query is subsumed by that of a cached query, then the scale annotation efforts are usually crowd-sourced using
user query will directly access the cached data (with optional external third-party storage as staging areas for data export
residual predicates, if applicable). and import. As mentioned earlier that annotation efforts are
Another improvement is to have better prefetching pre- usually interleaved with data exploration iteratively. Having
diction and local buffering. Effective prefetching can further a seamless smooth workflow between annotation and data
improve read latency by pipelining data I/Os with training exploration will promote productivity of the ML teams.
computations. However, in model training, effective prefetch-
ing is challenging, since inputs are usually required to be
randomly shuffled before feeding into the training model. 5 CONCLUDING REMARKS
Reasoning an effective prefetching policy and being adaptive Machine learning reemerges and flourishes within the last
at run-time is a work in-progress with our ML partner teams. decade. It combines all four scientific paradigms – theoreti-
The second area of system improvement is to have full cal science, experimental science, computational science, and
integration of the DSL with commonly used parallel data data-intensive science. [10] Machine learning is the study of
processing systems, such as Spark or PySpark, to accelerate mathematical models that leverage computational advances
the heavy-lifting data and feature engineering tasks. Our to learn the patterns within mass data, and repeat that exper-
experiences show that data and feature engineering tasks iment until we reach satisfactory results. Data, at the core of
contribute a significant portion of time to the ML project. the fourth paradigm, plays a critical role to the rate of con-
We believe a declarative programming paradigm of the DSL vergence. From past experiences, we observe that existing
will allow 1) users to focus on their business logic without ML solutions do not offer well-integrated data systems into
dealing with the nuisances of large scale data processing, the machine learning workflow. Simple storage systems or
and 2) systemic optimizations for parallel data processing. distributed file systems are used as passive data stores, result-
ing in data silos and leaving every step in the ML workflow
exposed to physical data dependence. The rate of innovation
4.3 Eco-System Integration is then hampered by the intertwined physical dependen-
As discussed in Section 3.2, many frameworks define their cies. Furthermore, the silos create a barrier to offer holistic
own data formats, such as TFRecord in TensorFlow, and solutions to data management.
RecordIO in MXNet, etc. It is an on-going effort to have MLdp In this paper, we propose a purpose-built ML data platform
integrated deeper with major ML frameworks so that user whose design centers around ML data life cycle management
applications remain compatible at the higher API level. The and ML workflow integration. Its data model enables col-
goal is to have minimum code changes to the user applica- laboration through data sharing and versioning, innovation
tions so that data migration to MLdp is smooth. through independent data evolution, and flexibility through
There are two design considerations. The first considera- minimalist schema requirements. Its data interface design
tion is to keep the MLdp data stored in MLdp’s own format, aims to provide ubiquitous data access, including ease of
and use a format connector to convert into target format, e.g., discovery, ease of use, and interoperability.
TFRecord, at run-time. The second consideration is to store We believe a data system with these properties will enable
MLdp data in the native target format, i.e., TFRecord. In this faster ML innovation for organizations. There are still many
case, we only need a lightweight connector, which directly challenges yet to be addressed including, but not limited to,
hydrates the on-disk data into in-memory TFRecords. declarative ML data language, system internal optimization,
The potential benefit of the first approach is that MLdp’s ML framework integration, etc. Finally, we hope that a system
native format becomes the mediator format, and the number like MLdp can provide a basis for further exploration of how
of connectors needed is a linear function to the number of to support machine learning needs with data systems.
1814
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands
1815
Industry 3: Data Platforms SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands
From the above examples, we can see that the DSL lever-
# the following statement will create
# a package 'outdoor_activity'
ages most of the SQL expressiveness to simplify the tasks
status = ml.data.DSL( of data operations. One of the open design issue for MLdp
"CREATE package/outdoor_activity(train, test) AS DSL in the near future is to incorporate declarative machine
SELECT SessionId, Images, Accelerometer, Activity learning into the language. [3, 4]
FROM (dataset/human_posture_movement@1.2.0 AS d
JOIN annotation/human_activity@1.3.0 AS a
ON d.SessionId = a.SessionId)
Appendix B AN EXAMPLE WITH MXNET
JOIN split/outdoor@1.0.0 AS s The following example uses MXNet [15] data loading API
ON d.SessionId = s.SessionId").Run() – ImageRecordIter. In MXNet, one can use ImageRecordIter
Listing 7: Creating a package to specify what partition of the input to read. After mounting
the targeted MLdp dataset in command-line, the im2rec.py
Train an activity classifier Listing 8 shows a simple tool from MXNet is used to generate the list of input files.
model training example. It loads the package, outdoor_activity, The file list is then used by ImageRecordIter, together with
into both train_data and test_data tablets. Next, it creates the parameters – part_index and num_parts, to pipeline the
and trains the model using the training data. Finally, it eval- partition of the input data that each worker trains on, as
uates the model performance using the testing data. shown in Listing 9.
# the following statement will load the package # MXNet provides tools/im2rec.py to generate training
# and train the activity_classifier using the # and validation lists, 'openimages_train.lst' and
# training data # 'openimages_val.lst', by the following command:
# >>> python tools/im2rec.py openimages ./data --list True
--recursive True --train-ratio .85 --exts .png
train_data, test_data = ml.data.DSL(
"SELECT SessionId, Images, Accelerometer, Activity store = kv.create('dist')
FROM package/outdoor_activity@1.0.0").Run() trainer = gluon.Trainer(..., kvstore = store)
1816