0% found this document useful (0 votes)

17 views10 pages

5.1 Large Scale ML

The document discusses the rapid growth of massive datasets, the importance of distributed computing, and the application of statistical methods in machine learning (ML). It outlines challenges in ML such as privacy, fairness, and interpretability, and distinguishes between supervised and unsupervised learning with examples. Additionally, it emphasizes the need for efficient data processing techniques and strategies to handle large datasets in a distributed environment.

Uploaded by

159 Sripad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views10 pages

5.1 Large Scale ML

Uploaded by

159 Sripad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Motivation

• Three related trends:

o Rapid growth of massive datasets
o Pervasiveness of distributed and cloud-based computing
infrastructure
o Use of statistical methods
▪ For a variety of common problems, e.g., classification,
regression, collaborative filtering, and clustering
• Make machines as intelligent as humans
o Many people believe best way to do that is mimic how humans learn
• Many applications
o E.g., Handwriting recognition, personalized recommendations,
speech recognition, face detection, spam filtering etc.

ML Challenges
• Contemporary issues in modern machine learning:
o Privacy
o Fairness
o Interpretability
o Big data
▪ Data is growing exponentially fast in size
• In many cases, faster than Moore's law
▪ Classical machine learning techniques are not always
suitable for modern datasets

ML with Large Dataset:

• Premise:
o There exists some pattern/behavior of interest
▪ The pattern/behavior is difficult to describe
o There is data (sometimes a lot of it!)
▪ More data usually helps
o Use data efficiently/intelligently to “learn” the pattern
• Definition:
o A computer program learns if its performance P, at some task T,
improves with experience E
o E.g.,
T:= The weather forecasting task
P:= The probability of correctly predicting a future date's
weather
E:= The process of the algorithm examining a large amount of
historical weather data

Example:
• Housing price prediction given housing data (regression)
• Spam detection given emails (classification)
• Recommendation systems(clustering)
• Feature extraction (dimensionality reduction)
Goals

• Use raw data

o To train statistical models in common machine learning pipelines
for classification, regression, and exploratory data analysis
• Work with Big data
• Learn variety of distributed machine learning algorithms and data
processing techniques that are well-suited for large datasets

Terminology:
• Datasets (usually) consist of
o Observations
▪ Individual entries used in learning or evaluating a learned
model
o Features
▪ Attributes used to represent an observation during learning
• Raw data is typically in an arbitrary input format
• Incorporate domain knowledge when representing each
of these observations
o Represent each of our observations via numeric
features
• Unsupervised learning can be used as a preprocessing
step for feature extraction
Note: Success of a supervised learning pipeline crucially
depends on the choice of features

o Labels
▪ Values or categories associated with an observation

Two common learning settings

• Supervised learning:
o Learning from labeled observations
o Labels teach the algorithm to learn a mapping from observations
to labels
o Two kinds
▪ Classification (assign a category, e.g., spam detection),
▪ Regression (predict a real value, e.g., stock)
• Unsupervised learning
o Learn solely from unlabeled observations
o Find latent structure in the features alone
o Used
▪ To better understand our data,
• E.g., to discover hidden patterns, or to perform
exploratory data analysis
▪ It can be a means to an end
• It can be in some sense a preprocessing step before
we perform supervised learning task
▪ E.g.,
• Clustering (partition observations into homogeneous
regions)
• Dimensionality reduction (transform an initial
feature representation into a more concise one)
Examples:
Supervised learning

• Regression Problem Example: Predict housing prices

o Collect housing prices and size (in feet) data

• Example problem: "Given this data, for a house of 750 square feet. What
is the price of the house?"

• Approaches
o Straight line through data
▪ Maybe $150,000
o Second order polynomial
▪ Maybe $200,000
o Each of these approaches represent a way of doing supervised
learning
o We give the algorithm a dataset where a "right answer" was
provided
o Here, we know actual prices for houses
▪ The idea is we can learn what makes the price a certain
value from the training data
▪ The algorithm should then produce more right answers based
on new training data where we don't know the price already
▪ i.e., predict the price
• So, this a regression problem
o Predict continuous valued output (price)
o No real discrete delineation
Classification Problem Example: Cancer detection based on tumor size

• Can you estimate prognosis based on tumor size?

o This is an example of a classification problem
▪ Classify data into one of two discrete classes
• No in between, either malignant or not
▪ In classification problems, can have a discrete number of
possible values for the output
• e.g., maybe have four values
o 0 - benign
o 1 - type 1
o 2 - type 2
o 3 - type 4
• In classification problems we can plot data in different ways

• Use only one attribute (size)

o In other problems may have multiple attributes
o We may also, for example, know age and tumor size

• We try and define separate classes by

o Drawing a straight line between the two groups
o Using a more complex function to define the two groups
o Then, when we have an individual with a specific tumor size and
who is a specific age, we can hopefully use that information to
place them into one of your classes
• We might have many features to consider
o Clump thickness
o Uniformity of cell size
o Uniformity of cell shape
Unsupervised learning
• In unsupervised learning, we get unlabeled data
o Can you structure it?
• One way of doing this would be to cluster data into two groups
o This is a clustering algorithm
• Example of clustering algorithm
o Google news
▪ Groups news stories into cohesive groups

ML Pipeline

• Raw training dataset

o Data preprocessing
▪ Get rid of unnecessary data
o Feature engineering
▪ Transform observations into a form appropriate for the
machine learning method
• Example: bag of words model
o Model training
▪ Suppose we want to compare multiple hyperparameter settings
𝛩1 , … , 𝛩𝑘
▪ For 𝑘 = 1, 2, … , 𝐾
• Train a model on 𝐷𝑡𝑟𝑎𝑖𝑛 , using 𝛩𝑘
• Evaluate each model on 𝐷𝑣𝑎𝑙 and find the best
hyperparameter setting, 𝛩𝑘∗
o Hyperparameter tuning
▪ Most machine learning/optimization methods will have
values/design choices that need to be specified/made in
order to run
• Architecture
• Batch size
• Learning rate/step size
• Termination criteria etc.
o Model evaluation
▪ How do you know if you’ve learned a good model?
▪ If a model is trained by minimizing the training error,
then the training error at termination is (typically)
overly optimistic about the model’s performance
▪ The model may overfit to training data
▪ Likewise, the validation error is also (typically)
optimistic about the model’s performance
• Usually less so than the training error
▪ Idea: use a held-out test dataset to assess our model’s
ability to generalize to unseen observations
▪ Suppose we want to compare multiple hyperparameter settings
𝛩1 , … , 𝛩𝑘
▪ For 𝑘 = 1, 2, … , 𝐾
• Train a model on 𝐷𝑡𝑟𝑎𝑖𝑛 , using 𝛩𝑘
• Evaluate each model on 𝐷𝑣𝑎𝑙 and find the best
hyperparameter setting, 𝛩𝑘∗
• Compute the error of a model trained with 𝛩𝑘∗ on 𝐷𝑡𝑒𝑠𝑡

Large Datasets:
• Dataset can be big in two ways:
o Large 𝑚 (# of observations)
o Large 𝑛 (# of features)
• Examples:
o Image processing
▪ Large 𝑚: potentially massive number of observations (e.g.,
pictures on the internet)
▪ Use-cases: object recognition, annotation generation
o Medical data
▪ Large 𝑛: potentially massive feature set (e.g., genome
sequence, electronic medical records, etc.)
▪ Use-cases: personalized medicine, diagnosis prediction
o Business analytics
▪ Large 𝑚 (e.g., all customers & all products) and 𝑛 (e.g.,
customer data, product specifications, transaction records,
etc.)
▪ Use-cases: product recommendations, customer segmentation
• Large 𝑚:
o Typically, we consider exponential time complexity (e.g., 𝑂(2𝑚 ))
bad and polynomial complexity (e.g., 𝑂(𝑚 3)) good
o However, if 𝑚 is massive, then even 𝑂(𝑚) can be problematic!
o Strategies:
▪ Speed up processing e.g., stochastic gradient descent vs.
gradient descent
▪ Make approximations/subsample the dataset
▪ Exploit parallelism
• Scale up
o Scales to a high-end computer
o Simple
• Scale out
o Scales well on standard hardware
o Added complexity of network communication
• Large 𝑛:
o High-dimensional datasets present numerous issues:
• Curse of dimensionality
• Overfitting
• Computational issues
o Strategies:
▪ Learn low-dimensional representations
▪ Perform feature selection to eliminate “low-yield” features

Units of Data
Unit Value Scalet Value Scale
Kilobyte(KB) 1000 bytes A paragraph of text
Megabyte(MB) 1000 KB A short novel
Gigabyte(GB) 1000 GB Beethoven’s 5th symphony
Terabyte(TB) 1000 TB All the x-rays in a large hospital
Petabyte(PB) 1000 PB ≈ ½ of all US research libraries
1
Exabyte (EB) 1000 EB ≈ 5 of the words humans have ever spoken

Communication Hierarchy
Single Node:
CPU

• A single core of a CPU can perform roughly 2B cycles per second

o The processing speed, or clock speed, for a single core is not
changing by much
• However, multi-core CPUs have emerged and improved the overall
performance of CPUs
o The number of cores per CPU is growing rapidly

Main memory

• Main memory has storage capacity of 10 to 100 GB

o This capacity is growing quickly
• Communication between RAM and the CPU is fast, at roughly 50 GB/s

Disk

• Disk storage capacity is growing exponentially, but communication speed

is not
o Communication between the hard drive and the CPU is roughly 500
times slower than between main memory and the CPU
• This communication issue to some extent is mitigated by the fact that a
node usually has multiple disks
• Each disk can communicate in parallel with the CPU
• However, if we consider a node with 10 disks communicating in parallel
with the CPU, then reading from disk is still 50 times slower than
reading from RAM

Distributed network:

• Typical network speed is 1 GB per second

Top-of-rack architecture

• Most common setup for commodity clusters

o Nodes are physically stored in racks
• Nodes on the same rack can directly communicate with each other at a
rate of 1 GB per second
• However, nodes on different racks can't communicate directly with each
other, and thus communicate at a slower rate

• Cost of RAM has been steadily decreasing

• Network latency is quite large relative to memory latency
o It's generally better to send a few large messages rather than
many small messages
• Train multiple models simultaneously and batch their communication
o In particular, when tuning hyperparameters, we must train several
models, and
o Train them together to reduce latency

Strategies to reduce communication costs

• We need to design algorithms that take advantage of the fact that

parallelism makes our computation faster
o While the disk and network communication slow us down

Perform parallel and in memory computation

• Persisting in memory reduces our communication burden

o Attractive option when working with iterative algorithms that
read the same data multiple times, as is the case in gradient
descent
o In fact, iterative computation is quite common for several
machine learning algorithms
• As our data grows large though, a standard computing node likely won't
be able to keep all the data in memory
o One way to deal with this situation is to scale up the computing
node and create a powerful multicore machine with several CPUs,
and a huge amount of RAM
o This strategy advantageous in that we can sidestep any network
communication when working with a single multicore machine
o However, these machines can be quite expensive as they require
specialized hardware, and are thus not as widely accessible as
commodity computing nodes
o Multicore machines can indeed handle fairly large data sets, and
they're an attractive option in many settings
o However, this approach does have scalability limitations, as
we'll eventually hit a wall when the data grows large enough
• As an alternative approach, we can scale out and work in a distributed
environment
o Intuitively, we can work with a large number of commodity nodes
o We can coordinate the efforts across these nodes by connecting
them via a network
o This approach can scale to massive problems since we're working
with readily available commodity computing nodes, and can add
additional nodes as our data grows in size
o However, we must deal with network communication to coordinate
the efforts of the worker nodes
▪ As an example, for gradient descent we perform iterations
of gradient descent
▪ Our training data, for instance, stored in a RDD called
train, can be quite large and is stored in a distributed
fashion across worker nodes
▪ In each iteration, we need to read the training data to
compute the gradient in order to compute the gradient
update
▪ A naive implementation would require us to store the data
on each worker, on disk, and to read from disk on each
iteration
o However, we can drastically speed up our implementation by
persisting this RDD in distributed memory across all of our
iterations
▪ Which allows us to read from memory, instead of disk when
computing the gradient updates, updates on each iteration
▪ Although, we may still need to communicate results from the
map step for each worker, to the driver node
▪ This communication can be expensive where generally we
would like to mitigate network communication as much as
possible, while leveraging distributed computing

Idea: use more RAM into each machine and hold more data in main memory

Reduce communication

• Let's first consider what we might need to communicate

• In a machine learning setting, we operate on raw data, we extract
features, and we train models, which are represented via their
parameters
• We also create intermediate objects throughout the development of
learning pipelines
• We could potentially communicate any of these objects across the
network
• Simple strategy: keep large objects local
o In other words, we should design or distribute algorithms such
that, whenever possible, we never have to communicate the largest
objects

Machine Learning Basics and Steps
No ratings yet
Machine Learning Basics and Steps
13 pages
Basic Concepts of Machine Learning For Beginners
No ratings yet
Basic Concepts of Machine Learning For Beginners
102 pages
Fundamentals of Machine Learning II
No ratings yet
Fundamentals of Machine Learning II
13 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
12 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
35 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
14 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
54 pages
Lecture 2 Unit 1
No ratings yet
Lecture 2 Unit 1
60 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
24 pages
Machine Learning Overview and Concepts
No ratings yet
Machine Learning Overview and Concepts
10 pages
Comprehensive Guide to Machine Learning
No ratings yet
Comprehensive Guide to Machine Learning
10 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
54 pages
Machine Learning Basics and kNN Guide
No ratings yet
Machine Learning Basics and kNN Guide
60 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
4 pages
Ch7 Introduction To Machine Learning
No ratings yet
Ch7 Introduction To Machine Learning
29 pages
Foundations of Machine Learning Overview
No ratings yet
Foundations of Machine Learning Overview
469 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
47 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
47 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
27 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
32 pages
2021 Machine Learning Intro
No ratings yet
2021 Machine Learning Intro
43 pages
Data in ML
No ratings yet
Data in ML
26 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
15 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
68 pages
Module 1
No ratings yet
Module 1
22 pages
Lesson 4 - Introduction Machine Learning
No ratings yet
Lesson 4 - Introduction Machine Learning
44 pages
Topic 1
No ratings yet
Topic 1
39 pages
DL Module 1
No ratings yet
DL Module 1
11 pages
ML - Unit 1
No ratings yet
ML - Unit 1
68 pages
Comprehensive Guide to Machine Learning
No ratings yet
Comprehensive Guide to Machine Learning
102 pages
July4 SaketAnand FriendlyIntroToML
No ratings yet
July4 SaketAnand FriendlyIntroToML
84 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
25 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
65 pages
Introduction to Machine Learning Concepts
100% (1)
Introduction to Machine Learning Concepts
54 pages
Classification vs Regression in ML
No ratings yet
Classification vs Regression in ML
15 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
51 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
51 pages
Data Analyst Interview Questionaries
No ratings yet
Data Analyst Interview Questionaries
16 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
Machine Learning For Data Science Unit-4
No ratings yet
Machine Learning For Data Science Unit-4
16 pages
Unit I MACHINE LEARNING
No ratings yet
Unit I MACHINE LEARNING
87 pages
Machine Learning Basics and Techniques
No ratings yet
Machine Learning Basics and Techniques
19 pages
Machine Learning Overview and Applications
No ratings yet
Machine Learning Overview and Applications
41 pages
Generative vs Discriminative Classifiers
No ratings yet
Generative vs Discriminative Classifiers
53 pages
Chapter 5 Machine Learning
No ratings yet
Chapter 5 Machine Learning
96 pages
Module 1 Part - 1
No ratings yet
Module 1 Part - 1
42 pages
ML Unit 1
No ratings yet
ML Unit 1
9 pages
Introduction to Machine Learning Course
No ratings yet
Introduction to Machine Learning Course
60 pages
Andrew NG Complete Machine Learning
100% (1)
Andrew NG Complete Machine Learning
170 pages
Unit 5
No ratings yet
Unit 5
30 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Machine Learning
No ratings yet
Machine Learning
13 pages
Machine Learning: Regression & Algorithms
No ratings yet
Machine Learning: Regression & Algorithms
11 pages
Machine Learning Syllabus Overview
No ratings yet
Machine Learning Syllabus Overview
70 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
176 pages
Data Science & ML Course Guide
No ratings yet
Data Science & ML Course Guide
83 pages
MCA Machine Learning Question Bank
No ratings yet
MCA Machine Learning Question Bank
139 pages
Machine Learning Basics: Supervised Learning
No ratings yet
Machine Learning Basics: Supervised Learning
6 pages
Progress Report Till 24 May 12pm
No ratings yet
Progress Report Till 24 May 12pm
38 pages
Mis Study Material
No ratings yet
Mis Study Material
60 pages
DSA Mastery in 60 Days
No ratings yet
DSA Mastery in 60 Days
37 pages
Cybersecurity Maturity Model for Higher Ed
No ratings yet
Cybersecurity Maturity Model for Higher Ed
38 pages
Lock and Unlock User Account After Failed SSH Logins
No ratings yet
Lock and Unlock User Account After Failed SSH Logins
13 pages
Aktivasi Filmora 9
100% (2)
Aktivasi Filmora 9
2 pages
Airbnb Pricing Prediction Analysis
No ratings yet
Airbnb Pricing Prediction Analysis
2 pages
螢幕截圖 2024-11-13 下午4.39.38
No ratings yet
螢幕截圖 2024-11-13 下午4.39.38
47 pages
Ms Word in Sapscript: Unable To Find Template, The Graphical Editor Is Not Showed
No ratings yet
Ms Word in Sapscript: Unable To Find Template, The Graphical Editor Is Not Showed
2 pages
UNIT I OVERVIEW OF BLOCKCHAIN - Blockchain Notes
100% (1)
UNIT I OVERVIEW OF BLOCKCHAIN - Blockchain Notes
22 pages
User Agents
No ratings yet
User Agents
18 pages
Binance Deposit History Report 2025-5-21!9!53
No ratings yet
Binance Deposit History Report 2025-5-21!9!53
1 page
MUNI MP420 Multi-Gas Detector Guide
No ratings yet
MUNI MP420 Multi-Gas Detector Guide
31 pages
Smart Watch Phone User Guide
No ratings yet
Smart Watch Phone User Guide
3 pages
CM1K Pattern Recognition Chip Overview
No ratings yet
CM1K Pattern Recognition Chip Overview
2 pages
SJ-20180108102206-002-NetNumen U31 (ICT) R22 (V12.18.10) Management System Operation Guide - 806904
No ratings yet
SJ-20180108102206-002-NetNumen U31 (ICT) R22 (V12.18.10) Management System Operation Guide - 806904
129 pages
PPS Unit 1 by Multi Atoms
No ratings yet
PPS Unit 1 by Multi Atoms
23 pages
VoiceConsole Cloud Deployment 61 ProductDescription
No ratings yet
VoiceConsole Cloud Deployment 61 ProductDescription
22 pages
Maths Prelims Paper
No ratings yet
Maths Prelims Paper
8 pages
The Evolution of Java EE
No ratings yet
The Evolution of Java EE
4 pages
Unit Manager - Analytics Role
No ratings yet
Unit Manager - Analytics Role
1 page
MFP TB 1638
No ratings yet
MFP TB 1638
3 pages
Mid-Level React Developer Needed
No ratings yet
Mid-Level React Developer Needed
2 pages
Chapter 2 Software Engineering
No ratings yet
Chapter 2 Software Engineering
7 pages
Lecture 2 - K-Nearest Neighbours For Classification
No ratings yet
Lecture 2 - K-Nearest Neighbours For Classification
19 pages
Screenshot 2024-10-11 at 14.41.08
No ratings yet
Screenshot 2024-10-11 at 14.41.08
54 pages
Ete Project
No ratings yet
Ete Project
22 pages
Data Processing Ss1
No ratings yet
Data Processing Ss1
2 pages
Arduino Based Tracking System Using GPS and GSM: Thin Thin Htwe, Dr. Kyaw Kyaw Hlaing
No ratings yet
Arduino Based Tracking System Using GPS and GSM: Thin Thin Htwe, Dr. Kyaw Kyaw Hlaing
5 pages
Leveraging Machine Learning For Intelligent Agriculture: Discover Internet of Things
No ratings yet
Leveraging Machine Learning For Intelligent Agriculture: Discover Internet of Things
21 pages

5.1 Large Scale ML

Uploaded by

5.1 Large Scale ML

Uploaded by

Motivation

• Three related trends:

ML with Large Dataset:

• Use raw data

Two common learning settings

• Regression Problem Example: Predict housing prices

• Can you estimate prognosis based on tumor size?

• Use only one attribute (size)

• We try and define separate classes by

• Raw training dataset

• A single core of a CPU can perform roughly 2B cycles per second

• Main memory has storage capacity of 10 to 100 GB

• Disk storage capacity is growing exponentially, but communication speed

• Typical network speed is 1 GB per second

• Most common setup for commodity clusters

• Cost of RAM has been steadily decreasing

Strategies to reduce communication costs

• We need to design algorithms that take advantage of the fact that

Perform parallel and in memory computation

• Persisting in memory reduces our communication burden

• Let's first consider what we might need to communicate

You might also like