Introduction to
Machine
Learning
1
What is Machine Learning?
“Learning is any process by which a system
improves performance from experience.”
- Herbert Simon
Definition by Tom Mitchell (1998):
Machine Learning is the study of algorithms that
• improve their performance P
• at some task T
• with experience E.
A well-defined learning task is given by <P, T,
E>.
3
Traditional
Programming
Data Output
Computer
Program
Machine
Learning
Data Program
Computer
Output
4
When Do We Use Machine
Learning?
ML is used when:
• Human expertise does not exist (navigating on Mars)
• Humans can’t explain their expertise (speech recognition)
• Models must be customized (personalized medicine)
• Models are based on huge amounts of data (genomics)
Learning isn’t always useful:
• There is no need to “learn” to calculate payroll
5
A classic example of a task that requires machine
learning: It is very hard to say what makes
a2
6
Some more examples of tasks that are
best solved by using a learning
algorithm
• Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
• Generating patterns:
– Generating images or motion sequences
• Recognizing anomalies:
– Unusual credit card transactions
– Unusual patterns of sensor readings in a nuclear power
plant
• Prediction:
7
– Future stock prices or currency exchange rates
8
Sample Applications
• Web search
• Computational biology
• Finance
• E-commerce
• Space exploration
• Robotics
• Information extraction
• Social networks
• Debugging software
• [Your favorite area]
8
Slide credit: Pedro
Samuel’s Checkers-Player
“Machine Learning: Field of study that gives
computers the ability to learn without being
explicitly programmed.” -Arthur Samuel
(1959)
9
Defining the Learning Task
Improve on task T, with respect to
performance metric P, based on
experience E
T: Playing checkers
P: Percentage of games won against an arbitrary
opponent E: Playing practice games against itself
T: Recognizing hand-written words
P: Percentage of words correctly classified
E: Database of human-labeled images of handwritten words
T: Driving on four-lane highways using vision sensors
P: Average distance traveled before a human-judged error
E: A sequence of images and steering commands
recorded while observing a human driver.
T: Categorize email messages as spam or
legitimate. P: Percentage of email messages
correctly classified. E: Database of emails, some
with human-given labels
10
State of the Art Applications of
Machine Learning
11
Autonomous Cars
• Nevada made it legal for
autonomous cars to drive
on roads in June 2011
• As of 2013, four states
(Nevada, Florida, California,
and Michigan) have legalized
autonomous cars
Penn’s Autonomous Car
12
(Ben Franklin Racing Team)
Autonomous Car Sensors
13
Autonomous Car Technology
Path
Plannin
g
Laser Terrain
Mapping
Learning from Human Adaptive
Drivers Vision
Sebastian
Stanley
Images and movies taken from Sebastian Thrun’s multimedia w1e4bsite.
Deep Learning in the Headlines
1
5
Deep Belief Net on Face Images
object models
object parts
(combination
of edges)
edges
Andrew Ng
Based on
materials by
pixels
16
Learning of Object Parts
17
Slide credit: Andrew
Training on Multiple Objects
Trained on 4 classes (cars,
faces, motorbikes, airplanes).
Second layer: Shared-
features and object-specific
features.
Third layer: More specific
features.
18
Slide credit: Andrew
Scene Labeling via Deep
Learning
[Farabet et al. ICML 2012, PAMI 2013] 19
Inference from Deep Learned
Models
Generating posterior samples from faces by “filling in”experiments
(cf. Lee and Mumford, 2003). Combine bottom-up and top-down inference.
Input images
Samples from
feedforward
Inference
(control)
Samples
from Full
posterior
inference
20
Machine Learning in
Automatic Speech Recognition
A Typical Speech Recognition System
ML used to predict of phone states from the sound spectrogram
Deep learning has state-of-the-art results
# Hidden Layers 1 2 4 8 10 12
Word Error Rate % 16. 12. 11. 10. 11. 11.
0 8 4 9 0 1
Baseline GMM performance = 15.4%
[Zeiler et al. “On rectified linear units for speech
recognition” ICASSP 2013]
21
Impact of Deep Learning in Speech
Technology
22
Types of Learning
23
Types of Learning
• Supervised (inductive) learning
– Given: training data + desired outputs (labels)
• Unsupervised learning
– Given: training data (without desired outputs)
• Semi-supervised learning
– Given: training data + a few desired outputs
• Reinforcement learning
– Rewards from sequence of actions
24
Supervised Learning:
Regression
• Given (x1, y1), (x2, y2), ..., (xn, yn)
• Learn a function f(x) to predict y given x
– y is real-valued == regression
9
8
Extent (1,000,000 sq
September Arctic Sea Ice
7
6
5
4
km)
3
2
1
0
1970 1980 1990 2000 2010 2020
Year
26
Data from G. Witt. Journal of Statistics Education, Volume 21, Number 1 (2013)
Supervised Learning:
Classification
• Given (x1, y1), (x2, y2), ..., (xn, yn)
• Learn a function f(x) to predict y given x
– y is categorical == classification
Breast Cancer (Malignant / Benign)
1(Malignant)
0(Benign)
Tumor
Size
27
Based on example by Andrew Ng
Supervised Learning:
Classification
• Given (x1, y1), (x2, y2), ..., (xn, yn)
• Learn a function f(x) to predict y given x
– y is categorical == classification
Breast Cancer (Malignant / Benign)
1(Malignant)
0(Benign)
Tumor
Size
Tumor Size 28
Supervised Learning:
Classification
• Given (x1, y1), (x2, y2), ..., (xn, yn)
• Learn a function f(x) to predict y given x
– y is categorical == classification
Breast Cancer (Malignant / Benign)
1(Malignant)
0(Benign)
Tumor
Predict Size
Benign Predict Malignant
Tumor Size 29
Supervised Learning
• x can be multi-dimensional
– Each dimension corresponds to an attribute
- Clump Thickness
- Uniformity of Cell Size
Ag - Uniformity of Cell Shape
e
…
Tumor Size
30
Based on example by Andrew Ng
Unsupervised Learning
• Given x1, x2, ..., xn (without labels)
• Output hidden structure behind the x’s
– E.g., clustering
31
Unsupervised Learning
Genomics application: group individuals by genetic similarity
Gene
s
Individuals 32
[Source: Daphne Koller]
Unsupervised Learning
Organize computing Social network analysis
clusters
Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of dison
Wisconsin, Ma )
Market Astronomical data analysis 33
Slid segmentation
e credit: Andrew Ng
Unsupervised Learning
• Independent component analysis –
separate a combined signal into its
original sources
34
Image credit: statsoft.com Audio from
Unsupervised Learning
• Independent component analysis –
separate a combined signal into its
original sources
35
Image credit: statsoft.com Audio from
Reinforcement Learning
• Given a sequence of states and actions
with (delayed) rewards, output a policy
– Policy is a mapping from states actions
that tells you what to do in a given state
• Examples:
– Credit assignment problem
– Game playing
– Robot in a maze
– Balance a pole on your hand
36
The Agent-Environment Interface
Agent and environment interact at discrete time steps : t 0, 1, 2, K
Agent observes state at step t : st S
produces action at step t : at A(st )
gets resulting reward : rt1
and resulting next state : st 1
... rt +1 rt +2 rt +3 s ...
st a st at st at t +3a
+1 +2 t
t +1 +2 +3
37
Slide credit: Sutton & Barto
Reinforcement Learning
https://www.youtube.com/watch?v=4cgWya-wjgY
38
Inverse Reinforcement Learning
• Learn policy from user demonstrations
Stanford Autonomous Helicopter
http://heli.stanford.edu/
https://www.youtube.com/watch?v=VCdxqn0fc
39
nE
40
Framing a Learning Problem
41
Designing a Learning System
• Choose the training experience
• Choose exactly what is to be learned
– i.e. the target function
• Choose how to represent the target function
• Choose a learning algorithm to infer the
target function from the experience
Training Learner
data
Environment
/ Experience Knowledg
e
Testing
data Performance
Element 41
Moone
Based on slide by Ray
y
Training vs. Test Distribution
• We generally assume that the training
and test examples are independently
drawn from the same overall distribution
of data
– We call this “i.i.d” which stands for
“independent and identically distributed”
• If examples are not independent, requires
collective classification
• If test distribution is different, requires
transfer learning
42
ML in a Nutshell
• Tens of thousands of machine
learning algorithms
– Hundreds new every year
• Every ML algorithm has three components:
– Representation
– Optimization
– Evaluation
43
Slide credit: Pedro Domingos
Various Function
Representations
• Numerical functions
– Linear regression
– Neural networks
– Support vector machines
• Symbolic functions
– Decision trees
– Rules in propositional logic
– Rules in first-order predicate logic
• Instance-based functions
– Nearest-neighbor
– Case-based
• Probabilistic Graphical Models
– Naïve Bayes
– Bayesian networks
– Hidden-Markov Models (HMMs)
– Probabilistic Context Free Grammars (PCFGs)
44
Slide credit: Ray
– Markov networks
45
Slide credit: Ray
Various Search/Optimization
Algorithms
• Gradient descent
– Perceptron
– Backpropagation
• Dynamic Programming
– HMM Learning
– PCFG Learning
• Divide and Conquer
– Decision tree induction
– Rule learning
• Evolutionary Computation
– Genetic Algorithms (GAs)
– Genetic Programming (GP)
– Neuro-evolution
46
Slide credit: Ray
Evaluation
• Accurac
y
• Precision and recall
• Squared error
• Likelihood
• Posterior probability
• Cost / Utility
• Margin
• Entropy
• K-L divergence
• etc.
4
Slide credit: Pedro 7
ML in Practice
• Understand domain, prior knowledge, and goals
• Data integration, selection, cleaning, pre-processing,
Loop etc.
• Learn models
• Interpret results
• Consolidate and deploy discovered knowledge
4
Based on a slide by Pedro 8
Lessons Learned about Learning
• Learning can be viewed as using direct or
indirect experience to approximate a chosen
target function.
• Function approximation can be viewed as a
search through a space of hypotheses
(representations of functions) for one that best
fits a set of training data.
• Different learning methods assume different
hypothesis spaces (representation languages)
and/or employ different search techniques.
Slide credit: Ray
49
Slide credit: Ray
A Brief History
of Machine
Learning
50
History of Machine Learning
• 1950s
– Samuel’s checker player
– Selfridge’s Pandemonium
• 1960s:
– Neural networks: Perceptron
– Pattern recognition
– Learning in the limit theory
– Minsky and Papert prove limitations of Perceptron
• 1970s:
– Symbolic concept induction
– Winston’s arch learner
– Expert systems and the knowledge acquisition bottleneck
– Quinlan’s ID3
– Michalski’s AQ and soybean diagnosis
– Scientific discovery with BACON
– Mathematical discovery with AM
51
Slide credit: Ray
History of Machine Learning
(cont.)
• 1980s:
– Advanced decision tree and rule learning
– Explanation-based Learning (EBL)
– Learning and planning and problem solving
– Utility problem
– Analogy
– Cognitive architectures
– Resurgence of neural networks (connectionism, backpropagation)
– Valiant’s PAC Learning Theory
– Focus on experimental methodology
• 1990s
– Data mining
– Adaptive software agents and web applications
– Text learning
– Reinforcement learning (RL)
– Inductive Logic Programming (ILP)
52
Slide credit: Ray
– Ensembles: Bagging, Boosting, and Stacking
– Bayes Net learning
53
Slide credit: Ray
History of Machine Learning
(cont.)
• 2000s
– Support vector machines & kernel methods
– Graphical models
– Statistical relational learning
– Transfer learning
– Sequence labeling
– Collective classification and structured outputs
– Computer Systems Applications (Compilers, Debugging, Graphics, Security)
– E-mail management
– Personalized assistants that learn
– Learning in robotics and vision
• 2010s
– Deep learning systems
– Learning for big data
– Bayesian methods
– Multi-task & lifelong learning
53
Based on slide by Ray
– Applications to vision, speech, social networks, learning to read,
etc.
– ???
54
Based on slide by Ray
What We’ll Cover in this
Course
• Supervised learning • Unsupervised learning
– Decision tree induction – Clustering
– Linear regression – Dimensionality
– Logistic regression reduction
– Support vector • Reinforcement
machines & kernel
methods learning
– Model ensembles – Temporal
difference
– Bayesian learning learning
– Neural networks & – Q learning
deep learning
– Learning theory • Evaluation
• Applications
Our focus will be on applying machine learning to real
applications
54