0% found this document useful (0 votes)
17 views10 pages

5.1 Large Scale ML

The document discusses the rapid growth of massive datasets, the importance of distributed computing, and the application of statistical methods in machine learning (ML). It outlines challenges in ML such as privacy, fairness, and interpretability, and distinguishes between supervised and unsupervised learning with examples. Additionally, it emphasizes the need for efficient data processing techniques and strategies to handle large datasets in a distributed environment.

Uploaded by

159 Sripad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views10 pages

5.1 Large Scale ML

The document discusses the rapid growth of massive datasets, the importance of distributed computing, and the application of statistical methods in machine learning (ML). It outlines challenges in ML such as privacy, fairness, and interpretability, and distinguishes between supervised and unsupervised learning with examples. Additionally, it emphasizes the need for efficient data processing techniques and strategies to handle large datasets in a distributed environment.

Uploaded by

159 Sripad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Motivation

• Three related trends:


o Rapid growth of massive datasets
o Pervasiveness of distributed and cloud-based computing
infrastructure
o Use of statistical methods
▪ For a variety of common problems, e.g., classification,
regression, collaborative filtering, and clustering
• Make machines as intelligent as humans
o Many people believe best way to do that is mimic how humans learn
• Many applications
o E.g., Handwriting recognition, personalized recommendations,
speech recognition, face detection, spam filtering etc.

ML Challenges
• Contemporary issues in modern machine learning:
o Privacy
o Fairness
o Interpretability
o Big data
▪ Data is growing exponentially fast in size
• In many cases, faster than Moore's law
▪ Classical machine learning techniques are not always
suitable for modern datasets

ML with Large Dataset:


• Premise:
o There exists some pattern/behavior of interest
▪ The pattern/behavior is difficult to describe
o There is data (sometimes a lot of it!)
▪ More data usually helps
o Use data efficiently/intelligently to “learn” the pattern
• Definition:
o A computer program learns if its performance P, at some task T,
improves with experience E
o E.g.,
T:= The weather forecasting task
P:= The probability of correctly predicting a future date's
weather
E:= The process of the algorithm examining a large amount of
historical weather data

Example:
• Housing price prediction given housing data (regression)
• Spam detection given emails (classification)
• Recommendation systems(clustering)
• Feature extraction (dimensionality reduction)
Goals

• Use raw data


o To train statistical models in common machine learning pipelines
for classification, regression, and exploratory data analysis
• Work with Big data
• Learn variety of distributed machine learning algorithms and data
processing techniques that are well-suited for large datasets

Terminology:
• Datasets (usually) consist of
o Observations
▪ Individual entries used in learning or evaluating a learned
model
o Features
▪ Attributes used to represent an observation during learning
• Raw data is typically in an arbitrary input format
• Incorporate domain knowledge when representing each
of these observations
o Represent each of our observations via numeric
features
• Unsupervised learning can be used as a preprocessing
step for feature extraction
Note: Success of a supervised learning pipeline crucially
depends on the choice of features

o Labels
▪ Values or categories associated with an observation

Two common learning settings

• Supervised learning:
o Learning from labeled observations
o Labels teach the algorithm to learn a mapping from observations
to labels
o Two kinds
▪ Classification (assign a category, e.g., spam detection),
▪ Regression (predict a real value, e.g., stock)
• Unsupervised learning
o Learn solely from unlabeled observations
o Find latent structure in the features alone
o Used
▪ To better understand our data,
• E.g., to discover hidden patterns, or to perform
exploratory data analysis
▪ It can be a means to an end
• It can be in some sense a preprocessing step before
we perform supervised learning task
▪ E.g.,
• Clustering (partition observations into homogeneous
regions)
• Dimensionality reduction (transform an initial
feature representation into a more concise one)
Examples:
Supervised learning

• Regression Problem Example: Predict housing prices


o Collect housing prices and size (in feet) data

• Example problem: "Given this data, for a house of 750 square feet. What
is the price of the house?"

• Approaches
o Straight line through data
▪ Maybe $150,000
o Second order polynomial
▪ Maybe $200,000
o Each of these approaches represent a way of doing supervised
learning
o We give the algorithm a dataset where a "right answer" was
provided
o Here, we know actual prices for houses
▪ The idea is we can learn what makes the price a certain
value from the training data
▪ The algorithm should then produce more right answers based
on new training data where we don't know the price already
▪ i.e., predict the price
• So, this a regression problem
o Predict continuous valued output (price)
o No real discrete delineation
Classification Problem Example: Cancer detection based on tumor size

• Can you estimate prognosis based on tumor size?


o This is an example of a classification problem
▪ Classify data into one of two discrete classes
• No in between, either malignant or not
▪ In classification problems, can have a discrete number of
possible values for the output
• e.g., maybe have four values
o 0 - benign
o 1 - type 1
o 2 - type 2
o 3 - type 4
• In classification problems we can plot data in different ways

• Use only one attribute (size)


o In other problems may have multiple attributes
o We may also, for example, know age and tumor size

• We try and define separate classes by


o Drawing a straight line between the two groups
o Using a more complex function to define the two groups
o Then, when we have an individual with a specific tumor size and
who is a specific age, we can hopefully use that information to
place them into one of your classes
• We might have many features to consider
o Clump thickness
o Uniformity of cell size
o Uniformity of cell shape
Unsupervised learning
• In unsupervised learning, we get unlabeled data
o Can you structure it?
• One way of doing this would be to cluster data into two groups
o This is a clustering algorithm
• Example of clustering algorithm
o Google news
▪ Groups news stories into cohesive groups

ML Pipeline

• Raw training dataset


o Data preprocessing
▪ Get rid of unnecessary data
o Feature engineering
▪ Transform observations into a form appropriate for the
machine learning method
• Example: bag of words model
o Model training
▪ Suppose we want to compare multiple hyperparameter settings
𝛩1 , … , 𝛩𝑘
▪ For 𝑘 = 1, 2, … , 𝐾
• Train a model on 𝐷𝑡𝑟𝑎𝑖𝑛 , using 𝛩𝑘
• Evaluate each model on 𝐷𝑣𝑎𝑙 and find the best
hyperparameter setting, 𝛩𝑘∗
o Hyperparameter tuning
▪ Most machine learning/optimization methods will have
values/design choices that need to be specified/made in
order to run
• Architecture
• Batch size
• Learning rate/step size
• Termination criteria etc.
o Model evaluation
▪ How do you know if you’ve learned a good model?
▪ If a model is trained by minimizing the training error,
then the training error at termination is (typically)
overly optimistic about the model’s performance
▪ The model may overfit to training data
▪ Likewise, the validation error is also (typically)
optimistic about the model’s performance
• Usually less so than the training error
▪ Idea: use a held-out test dataset to assess our model’s
ability to generalize to unseen observations
▪ Suppose we want to compare multiple hyperparameter settings
𝛩1 , … , 𝛩𝑘
▪ For 𝑘 = 1, 2, … , 𝐾
• Train a model on 𝐷𝑡𝑟𝑎𝑖𝑛 , using 𝛩𝑘
• Evaluate each model on 𝐷𝑣𝑎𝑙 and find the best
hyperparameter setting, 𝛩𝑘∗
• Compute the error of a model trained with 𝛩𝑘∗ on 𝐷𝑡𝑒𝑠𝑡

Large Datasets:
• Dataset can be big in two ways:
o Large 𝑚 (# of observations)
o Large 𝑛 (# of features)
• Examples:
o Image processing
▪ Large 𝑚: potentially massive number of observations (e.g.,
pictures on the internet)
▪ Use-cases: object recognition, annotation generation
o Medical data
▪ Large 𝑛: potentially massive feature set (e.g., genome
sequence, electronic medical records, etc.)
▪ Use-cases: personalized medicine, diagnosis prediction
o Business analytics
▪ Large 𝑚 (e.g., all customers & all products) and 𝑛 (e.g.,
customer data, product specifications, transaction records,
etc.)
▪ Use-cases: product recommendations, customer segmentation
• Large 𝑚:
o Typically, we consider exponential time complexity (e.g., 𝑂(2𝑚 ))
bad and polynomial complexity (e.g., 𝑂(𝑚 3)) good
o However, if 𝑚 is massive, then even 𝑂(𝑚) can be problematic!
o Strategies:
▪ Speed up processing e.g., stochastic gradient descent vs.
gradient descent
▪ Make approximations/subsample the dataset
▪ Exploit parallelism
• Scale up
o Scales to a high-end computer
o Simple
• Scale out
o Scales well on standard hardware
o Added complexity of network communication
• Large 𝑛:
o High-dimensional datasets present numerous issues:
• Curse of dimensionality
• Overfitting
• Computational issues
o Strategies:
▪ Learn low-dimensional representations
▪ Perform feature selection to eliminate “low-yield” features

Units of Data
Unit Value Scalet Value Scale
Kilobyte(KB) 1000 bytes A paragraph of text
Megabyte(MB) 1000 KB A short novel
Gigabyte(GB) 1000 GB Beethoven’s 5th symphony
Terabyte(TB) 1000 TB All the x-rays in a large hospital
Petabyte(PB) 1000 PB ≈ ½ of all US research libraries
1
Exabyte (EB) 1000 EB ≈ 5 of the words humans have ever spoken

Communication Hierarchy
Single Node:
CPU

• A single core of a CPU can perform roughly 2B cycles per second


o The processing speed, or clock speed, for a single core is not
changing by much
• However, multi-core CPUs have emerged and improved the overall
performance of CPUs
o The number of cores per CPU is growing rapidly

Main memory

• Main memory has storage capacity of 10 to 100 GB


o This capacity is growing quickly
• Communication between RAM and the CPU is fast, at roughly 50 GB/s

Disk

• Disk storage capacity is growing exponentially, but communication speed


is not
o Communication between the hard drive and the CPU is roughly 500
times slower than between main memory and the CPU
• This communication issue to some extent is mitigated by the fact that a
node usually has multiple disks
• Each disk can communicate in parallel with the CPU
• However, if we consider a node with 10 disks communicating in parallel
with the CPU, then reading from disk is still 50 times slower than
reading from RAM

Distributed network:

• Typical network speed is 1 GB per second

Top-of-rack architecture

• Most common setup for commodity clusters


o Nodes are physically stored in racks
• Nodes on the same rack can directly communicate with each other at a
rate of 1 GB per second
• However, nodes on different racks can't communicate directly with each
other, and thus communicate at a slower rate

• Cost of RAM has been steadily decreasing


• Network latency is quite large relative to memory latency
o It's generally better to send a few large messages rather than
many small messages
• Train multiple models simultaneously and batch their communication
o In particular, when tuning hyperparameters, we must train several
models, and
o Train them together to reduce latency

Strategies to reduce communication costs

• We need to design algorithms that take advantage of the fact that


parallelism makes our computation faster
o While the disk and network communication slow us down

Perform parallel and in memory computation

• Persisting in memory reduces our communication burden


o Attractive option when working with iterative algorithms that
read the same data multiple times, as is the case in gradient
descent
o In fact, iterative computation is quite common for several
machine learning algorithms
• As our data grows large though, a standard computing node likely won't
be able to keep all the data in memory
o One way to deal with this situation is to scale up the computing
node and create a powerful multicore machine with several CPUs,
and a huge amount of RAM
o This strategy advantageous in that we can sidestep any network
communication when working with a single multicore machine
o However, these machines can be quite expensive as they require
specialized hardware, and are thus not as widely accessible as
commodity computing nodes
o Multicore machines can indeed handle fairly large data sets, and
they're an attractive option in many settings
o However, this approach does have scalability limitations, as
we'll eventually hit a wall when the data grows large enough
• As an alternative approach, we can scale out and work in a distributed
environment
o Intuitively, we can work with a large number of commodity nodes
o We can coordinate the efforts across these nodes by connecting
them via a network
o This approach can scale to massive problems since we're working
with readily available commodity computing nodes, and can add
additional nodes as our data grows in size
o However, we must deal with network communication to coordinate
the efforts of the worker nodes
▪ As an example, for gradient descent we perform iterations
of gradient descent
▪ Our training data, for instance, stored in a RDD called
train, can be quite large and is stored in a distributed
fashion across worker nodes
▪ In each iteration, we need to read the training data to
compute the gradient in order to compute the gradient
update
▪ A naive implementation would require us to store the data
on each worker, on disk, and to read from disk on each
iteration
o However, we can drastically speed up our implementation by
persisting this RDD in distributed memory across all of our
iterations
▪ Which allows us to read from memory, instead of disk when
computing the gradient updates, updates on each iteration
▪ Although, we may still need to communicate results from the
map step for each worker, to the driver node
▪ This communication can be expensive where generally we
would like to mitigate network communication as much as
possible, while leveraging distributed computing

Idea: use more RAM into each machine and hold more data in main memory

Reduce communication

• Let's first consider what we might need to communicate


• In a machine learning setting, we operate on raw data, we extract
features, and we train models, which are represented via their
parameters
• We also create intermediate objects throughout the development of
learning pipelines
• We could potentially communicate any of these objects across the
network
• Simple strategy: keep large objects local
o In other words, we should design or distribute algorithms such
that, whenever possible, we never have to communicate the largest
objects

You might also like