Welcome (back) to IST 380 !
Today: putting the programming in
Data Science Programming…
Functions!
I have a sinking
feeling about all
of this fun?
Congratulations, Ravens!
Assignments…
Homework #1 is due tomorrow evening (2/5)
Getting started with R (tutorial + "quiz" + text)
There will be time this evening, if you'd like to use it…
started? finished? thoughts so far?
Homework #2 is due next Tuesday (2/12)
Pr #1: working through the text examples
Pr #2: writing some additional functions + a chance
to consider probability problems
Predictive?
Pr #3: writing a predictive model… It was 101
years ago!
Following up…
Following up…
http://www.propublica.org/article/everything-we-know-so-far-about-obamas-big-data-operation
Following up…
http://www.propublica.org/article/everything-we-know-so-far-about-obamas-big-data-operation
Palm trees!
R in the
NYT
http://www.nytimes.com/2009/01/07/
technology/business-computing/07program.html
Our path
g 2
… R's toolsets, in
m
m
now building gra i l l s
r o Sk
larger pieces… P
descriptive
statistics
Subject
Central Limit Expertise
Theorem
Functions
predictions I predict we'll get here…
R Reference Card
A security blanket for some of us…
RStudio An IDE that wraps R IDE like to know
what this means!
You might want to start with Chapter 9's Integrated Development Environment…
RStudio An IDE that wraps R IDE like to know
what this means!
Editable
files/scripts Live data
Plots and help
Console
interactions
You might want to start with Chapter 9's Integrated Development Environment…
summary
Descriptive statistics hist
quantile
sd, mean
state populations
Chapter 6 reviews statistical descriptions using these data
rgeom
Generative statistics runif
rnorm …
sample
replicate
distribution of samples of state populations
Chapter 7 reviews repeated sampling and the resulting distribution of means
Try it!
(1) Load in the state's populations
Is the mean or median greater? Why?
(2) Which state is closest to the median?
Extra: Which is closest to the 42% quantile?
Create a sample of 16 states from the list.
Create a distribution of 100 such samples – histogram it!
(3) Increase the number of samples until you obtain
sample-distribution mean within ~1% of the real mean.
Find 5% and 95% quantiles of that distribution.
(p<.05)
Big Ideas:
Central Limit Theorem
Law of Large Numbers
Monte Carlo Methods
Central Limit Theorem
the mean of a large number of independent random
variables, each with finite mean and variance, will be
approximately normally distributed.
state populations means of 4000 samples
(size 16) from the states
Central Limit Theorem
Take N samples from this population and find the
mean of each one. For large N, those sample means
will form a bell curve around the true mean
state populations means of 4000 samples
(size 16) from the states
Central Limit Theorem
Take N samples from this population and find the
mean of each one. For large N, those sample means
will form a bell curve around the true mean
o n e
s i s of …
h i
T l ca s e
c i a
spe
state populations means of 4000 samples
(size 16) from the states
Law of Large Numbers
in the limit, the average of the results obtained from a large
number of random trials of a process will converge to the
expected value for that process
chances of rolling doubles?
Law of Large Numbers
whose take-home message is: Try it lots of times and
just see what happens!
Monte Carlo Methods
The two Monte Carlos
Making random
numbers work for you!
Monte Carlo casino, Monaco
Monte Carlo
methods, Math/CS
The two Monte Carlos
Stanislaw Ulam
(Los Alamos badge)
Bond. (James Bond)
Making random
numbers work for you!
Monte Carlo casino, Monaco
Monte Carlo
methods, Math/CS
Monty Hall
Let's make a deal '63-'86
Sept. 1990
inspiring the “ Monty Hall paradox”
0 1 42 3 4 5 6 7 8 9
A
H
0 1 42 3 4 5 6 7 8 9
A
H
Hw #2: Monte Carlo Monty Hall
… and a second example:
Both envelopes hold some positive amount of money (in a check or IOU),
but one of these two envelopes holds twice as much money as the other.
Should you switch or stay?
Functions in R (Chapter 9)
Functions in R
A function to add two inputs…
thoughts? oddities? niceties?
Functions in R
A "guessing-game" function… Let's fix it!
42 in 72-point font!
Slide credits: thanks to JHU's R. D. Peng
nested conditionals…
What's going
on here?
How are these two
conditionals different?
cat is better
than print
seq_along creates
a list of indices
Thoughts?
Could we write a Monty Hall function?
MH <- function()
… that runs one three-curtain trial?
MHall(chosen_curtain=1, sors="switch", verbose=TRUE)
… another to allow us to turn off printing?
Mhall_N(chosen_curtain=1, sors="switch", N=300)
… another to run it N times?
MystEnv_N(first_env=10, sors="switch", N=10)
… and another to try the envelope-switching?
So, what is Machine Learning?
The goal of machine learning … or
predictive statistics/analytics,
is to find a function
that yields an output from a previously-unseen input
- based on the data available about the process in question.
This week's final problem asks you to write such a
function – for the Titanic survivor dataset.
The Titanic
April 15, 1912
1502 out of the
2224 passengers
died in the sinking
What characteristics did
the survivors share?
The Data
here are the
11 columns
There are 742 rows and 11 columns in the training data.
Our goal
… is to write a function that takes in a row of new data and
outputs whether that passenger would survive (1) or not (0).
A first predictor
A second predictor
Does the data match the
famous emergency cry?
Testing our functions…
Try it!
Help is available either with hw#1
(getting started with R)
or hw#2 (writing functions/Titanic)
this evening during lab time…
Good luck with everything this week!
Lab !
CS vs. IS and IT ?
greater integration
system-wide issues
smaller details
machine specifics
www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf
CS vs. IS and IT ?
Where will IS go?
CS vs. IS and IT ?
IT ?
Where will IT go?
IT ?
The bigger picture
Weeks 10-12 Weeks 13-15
Objects Final Projects
Week 10 Week 13
classes vs. objects final projects
Week 11 Week 14
methods and data final projects
Week 12 Week 15
inheritance final exam
• Neighbor's name
Data?!
• A place they consider home
• Are they working at a company now? Where?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
state reminders…
• Neighbor's name Zachary Dodds
Data!
• A place they consider home Pittsburgh, PA
• Are they working at a company now? Where? Harvey Mudd
• How many U.S. states have they visited? 44
• Their favorite unhealthy food… ? M&Ms
• Do they have any "Data Science"
(statistics, machine learning, CS)
background? mostly CS for me…
• Neighbor's name Zachary Dodds
Data!
• A place they consider home Pittsburgh, PA
• Are they working at a company now? Where? Harvey Mudd
• How many U.S. states have they visited? 44
• Their favorite unhealthy food… ? M&Ms
• Do they have any "Data Science" This class is truly
seminar-style:
(statistics, machine learning, CS) we're devloping
expertise in this
background? mostly CS for me… field together.
be sure to set up your login + profile for the submission site…