Motivation
• Three related trends:
o Rapid growth of massive datasets
o Pervasiveness of distributed and cloud-based computing
infrastructure
o Use of statistical methods
▪ For a variety of common problems, e.g., classification,
regression, collaborative filtering, and clustering
• Make machines as intelligent as humans
o Many people believe best way to do that is mimic how humans learn
• Many applications
o E.g., Handwriting recognition, personalized recommendations,
speech recognition, face detection, spam filtering etc.
ML Challenges
• Contemporary issues in modern machine learning:
o Privacy
o Fairness
o Interpretability
o Big data
▪ Data is growing exponentially fast in size
• In many cases, faster than Moore's law
▪ Classical machine learning techniques are not always
suitable for modern datasets
ML with Large Dataset:
• Premise:
o There exists some pattern/behavior of interest
▪ The pattern/behavior is difficult to describe
o There is data (sometimes a lot of it!)
▪ More data usually helps
o Use data efficiently/intelligently to “learn” the pattern
• Definition:
o A computer program learns if its performance P, at some task T,
improves with experience E
o E.g.,
T:= The weather forecasting task
P:= The probability of correctly predicting a future date's
weather
E:= The process of the algorithm examining a large amount of
historical weather data
Example:
• Housing price prediction given housing data (regression)
• Spam detection given emails (classification)
• Recommendation systems(clustering)
• Feature extraction (dimensionality reduction)
Goals
• Use raw data
o To train statistical models in common machine learning pipelines
for classification, regression, and exploratory data analysis
• Work with Big data
• Learn variety of distributed machine learning algorithms and data
processing techniques that are well-suited for large datasets
Terminology:
• Datasets (usually) consist of
o Observations
▪ Individual entries used in learning or evaluating a learned
model
o Features
▪ Attributes used to represent an observation during learning
• Raw data is typically in an arbitrary input format
• Incorporate domain knowledge when representing each
of these observations
o Represent each of our observations via numeric
features
• Unsupervised learning can be used as a preprocessing
step for feature extraction
Note: Success of a supervised learning pipeline crucially
depends on the choice of features
o Labels
▪ Values or categories associated with an observation
Two common learning settings
• Supervised learning:
o Learning from labeled observations
o Labels teach the algorithm to learn a mapping from observations
to labels
o Two kinds
▪ Classification (assign a category, e.g., spam detection),
▪ Regression (predict a real value, e.g., stock)
• Unsupervised learning
o Learn solely from unlabeled observations
o Find latent structure in the features alone
o Used
▪ To better understand our data,
• E.g., to discover hidden patterns, or to perform
exploratory data analysis
▪ It can be a means to an end
• It can be in some sense a preprocessing step before
we perform supervised learning task
▪ E.g.,
• Clustering (partition observations into homogeneous
regions)
• Dimensionality reduction (transform an initial
feature representation into a more concise one)
Examples:
Supervised learning
• Regression Problem Example: Predict housing prices
o Collect housing prices and size (in feet) data
• Example problem: "Given this data, for a house of 750 square feet. What
is the price of the house?"
• Approaches
o Straight line through data
▪ Maybe $150,000
o Second order polynomial
▪ Maybe $200,000
o Each of these approaches represent a way of doing supervised
learning
o We give the algorithm a dataset where a "right answer" was
provided
o Here, we know actual prices for houses
▪ The idea is we can learn what makes the price a certain
value from the training data
▪ The algorithm should then produce more right answers based
on new training data where we don't know the price already
▪ i.e., predict the price
• So, this a regression problem
o Predict continuous valued output (price)
o No real discrete delineation
Classification Problem Example: Cancer detection based on tumor size
• Can you estimate prognosis based on tumor size?
o This is an example of a classification problem
▪ Classify data into one of two discrete classes
• No in between, either malignant or not
▪ In classification problems, can have a discrete number of
possible values for the output
• e.g., maybe have four values
o 0 - benign
o 1 - type 1
o 2 - type 2
o 3 - type 4
• In classification problems we can plot data in different ways
• Use only one attribute (size)
o In other problems may have multiple attributes
o We may also, for example, know age and tumor size
• We try and define separate classes by
o Drawing a straight line between the two groups
o Using a more complex function to define the two groups
o Then, when we have an individual with a specific tumor size and
who is a specific age, we can hopefully use that information to
place them into one of your classes
• We might have many features to consider
o Clump thickness
o Uniformity of cell size
o Uniformity of cell shape
Unsupervised learning
• In unsupervised learning, we get unlabeled data
o Can you structure it?
• One way of doing this would be to cluster data into two groups
o This is a clustering algorithm
• Example of clustering algorithm
o Google news
▪ Groups news stories into cohesive groups
ML Pipeline
• Raw training dataset
o Data preprocessing
▪ Get rid of unnecessary data
o Feature engineering
▪ Transform observations into a form appropriate for the
machine learning method
• Example: bag of words model
o Model training
▪ Suppose we want to compare multiple hyperparameter settings
𝛩1 , … , 𝛩𝑘
▪ For 𝑘 = 1, 2, … , 𝐾
• Train a model on 𝐷𝑡𝑟𝑎𝑖𝑛 , using 𝛩𝑘
• Evaluate each model on 𝐷𝑣𝑎𝑙 and find the best
hyperparameter setting, 𝛩𝑘∗
o Hyperparameter tuning
▪ Most machine learning/optimization methods will have
values/design choices that need to be specified/made in
order to run
• Architecture
• Batch size
• Learning rate/step size
• Termination criteria etc.
o Model evaluation
▪ How do you know if you’ve learned a good model?
▪ If a model is trained by minimizing the training error,
then the training error at termination is (typically)
overly optimistic about the model’s performance
▪ The model may overfit to training data
▪ Likewise, the validation error is also (typically)
optimistic about the model’s performance
• Usually less so than the training error
▪ Idea: use a held-out test dataset to assess our model’s
ability to generalize to unseen observations
▪ Suppose we want to compare multiple hyperparameter settings
𝛩1 , … , 𝛩𝑘
▪ For 𝑘 = 1, 2, … , 𝐾
• Train a model on 𝐷𝑡𝑟𝑎𝑖𝑛 , using 𝛩𝑘
• Evaluate each model on 𝐷𝑣𝑎𝑙 and find the best
hyperparameter setting, 𝛩𝑘∗
• Compute the error of a model trained with 𝛩𝑘∗ on 𝐷𝑡𝑒𝑠𝑡
Large Datasets:
• Dataset can be big in two ways:
o Large 𝑚 (# of observations)
o Large 𝑛 (# of features)
• Examples:
o Image processing
▪ Large 𝑚: potentially massive number of observations (e.g.,
pictures on the internet)
▪ Use-cases: object recognition, annotation generation
o Medical data
▪ Large 𝑛: potentially massive feature set (e.g., genome
sequence, electronic medical records, etc.)
▪ Use-cases: personalized medicine, diagnosis prediction
o Business analytics
▪ Large 𝑚 (e.g., all customers & all products) and 𝑛 (e.g.,
customer data, product specifications, transaction records,
etc.)
▪ Use-cases: product recommendations, customer segmentation
• Large 𝑚:
o Typically, we consider exponential time complexity (e.g., 𝑂(2𝑚 ))
bad and polynomial complexity (e.g., 𝑂(𝑚 3)) good
o However, if 𝑚 is massive, then even 𝑂(𝑚) can be problematic!
o Strategies:
▪ Speed up processing e.g., stochastic gradient descent vs.
gradient descent
▪ Make approximations/subsample the dataset
▪ Exploit parallelism
• Scale up
o Scales to a high-end computer
o Simple
• Scale out
o Scales well on standard hardware
o Added complexity of network communication
• Large 𝑛:
o High-dimensional datasets present numerous issues:
• Curse of dimensionality
• Overfitting
• Computational issues
o Strategies:
▪ Learn low-dimensional representations
▪ Perform feature selection to eliminate “low-yield” features
Units of Data
Unit Value Scalet Value Scale
Kilobyte(KB) 1000 bytes A paragraph of text
Megabyte(MB) 1000 KB A short novel
Gigabyte(GB) 1000 GB Beethoven’s 5th symphony
Terabyte(TB) 1000 TB All the x-rays in a large hospital
Petabyte(PB) 1000 PB ≈ ½ of all US research libraries
1
Exabyte (EB) 1000 EB ≈ 5 of the words humans have ever spoken
Communication Hierarchy
Single Node:
CPU
• A single core of a CPU can perform roughly 2B cycles per second
o The processing speed, or clock speed, for a single core is not
changing by much
• However, multi-core CPUs have emerged and improved the overall
performance of CPUs
o The number of cores per CPU is growing rapidly
Main memory
• Main memory has storage capacity of 10 to 100 GB
o This capacity is growing quickly
• Communication between RAM and the CPU is fast, at roughly 50 GB/s
Disk
• Disk storage capacity is growing exponentially, but communication speed
is not
o Communication between the hard drive and the CPU is roughly 500
times slower than between main memory and the CPU
• This communication issue to some extent is mitigated by the fact that a
node usually has multiple disks
• Each disk can communicate in parallel with the CPU
• However, if we consider a node with 10 disks communicating in parallel
with the CPU, then reading from disk is still 50 times slower than
reading from RAM
Distributed network:
• Typical network speed is 1 GB per second
Top-of-rack architecture
• Most common setup for commodity clusters
o Nodes are physically stored in racks
• Nodes on the same rack can directly communicate with each other at a
rate of 1 GB per second
• However, nodes on different racks can't communicate directly with each
other, and thus communicate at a slower rate
• Cost of RAM has been steadily decreasing
• Network latency is quite large relative to memory latency
o It's generally better to send a few large messages rather than
many small messages
• Train multiple models simultaneously and batch their communication
o In particular, when tuning hyperparameters, we must train several
models, and
o Train them together to reduce latency
Strategies to reduce communication costs
• We need to design algorithms that take advantage of the fact that
parallelism makes our computation faster
o While the disk and network communication slow us down
Perform parallel and in memory computation
• Persisting in memory reduces our communication burden
o Attractive option when working with iterative algorithms that
read the same data multiple times, as is the case in gradient
descent
o In fact, iterative computation is quite common for several
machine learning algorithms
• As our data grows large though, a standard computing node likely won't
be able to keep all the data in memory
o One way to deal with this situation is to scale up the computing
node and create a powerful multicore machine with several CPUs,
and a huge amount of RAM
o This strategy advantageous in that we can sidestep any network
communication when working with a single multicore machine
o However, these machines can be quite expensive as they require
specialized hardware, and are thus not as widely accessible as
commodity computing nodes
o Multicore machines can indeed handle fairly large data sets, and
they're an attractive option in many settings
o However, this approach does have scalability limitations, as
we'll eventually hit a wall when the data grows large enough
• As an alternative approach, we can scale out and work in a distributed
environment
o Intuitively, we can work with a large number of commodity nodes
o We can coordinate the efforts across these nodes by connecting
them via a network
o This approach can scale to massive problems since we're working
with readily available commodity computing nodes, and can add
additional nodes as our data grows in size
o However, we must deal with network communication to coordinate
the efforts of the worker nodes
▪ As an example, for gradient descent we perform iterations
of gradient descent
▪ Our training data, for instance, stored in a RDD called
train, can be quite large and is stored in a distributed
fashion across worker nodes
▪ In each iteration, we need to read the training data to
compute the gradient in order to compute the gradient
update
▪ A naive implementation would require us to store the data
on each worker, on disk, and to read from disk on each
iteration
o However, we can drastically speed up our implementation by
persisting this RDD in distributed memory across all of our
iterations
▪ Which allows us to read from memory, instead of disk when
computing the gradient updates, updates on each iteration
▪ Although, we may still need to communicate results from the
map step for each worker, to the driver node
▪ This communication can be expensive where generally we
would like to mitigate network communication as much as
possible, while leveraging distributed computing
Idea: use more RAM into each machine and hold more data in main memory
Reduce communication
• Let's first consider what we might need to communicate
• In a machine learning setting, we operate on raw data, we extract
features, and we train models, which are represented via their
parameters
• We also create intermediate objects throughout the development of
learning pipelines
• We could potentially communicate any of these objects across the
network
• Simple strategy: keep large objects local
o In other words, we should design or distribute algorithms such
that, whenever possible, we never have to communicate the largest
objects