0% found this document useful (0 votes)
15 views32 pages

Free Data Science Course Material 2018

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views32 pages

Free Data Science Course Material 2018

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DATA SCIENCE

COURSE MATERIAL
Sneak a peek at the content of our data
science training

MARCH 20, 2018 | THE CRUNCH ACADEMY


WHAT'S INSIDE?
Core modules of the training

1. Introduction to Data Science


2. Python programming basics
3. Python essentials for Data Science
4. Descriptive Analytics and Data Visualization using Python
5. Hypothesis Testing & Predictive Analytics
6. Predictive Analytics using Python and Scikit-Learn 101
7. Predictive Analytics using Python and Scikit-Learn 102
1. INTRODUCTION TO DATA SCIENCE
Learn what the driving forces in the data science ecosystem are

Every self-respecting data scientist has to know the basic principles. Hence, before we jump into the
nitty-gritty details we give a solid overview of everything this space has to offer. This is accompanied
with an introduction to the basic tools that will be used in the hands-on courses that will follow.
How is analytics evolving in the business PRESCRIPTIVE ANALYTICS
perspective?
Making decisions using facts and figures rather than gut
PREDICTIVE ANALYTICS
feeling has always been one of the focal points of top
performing companies. Moving from traditional descriptive
and diagnostic analytics to more advanced predictive and
prescriptive techniques is a natural evolution. DIAGNOSTIC ANALYTICS

Today's businesses usually have some type of descriptive (what


has happened?) and diagnostic (why has this happened?) DESCRIPTIVE ANALYTICS
analytics in place. The move towards a data scientific approach
applies in large part moving to techniques that not only tell you
about the past, but make predictions about the future.

At the extreme end of this spectrum, companies are starting to


give decision making power to algorithms (i.e. prescriptive
analytics). While still a while off for the majority of companies,
experimenting with these techniques is of paramount
importance to stay ahead of the curve.

INTRODUCTION TO DATA SCIENCE Visual is based on Erl, T., Khattak, W., & Buhler, P.
(2016). Big Data Fundamentals: Concepts, Drivers &
Techniques. Prentice Hall Press.
What are key Data Science applications and techniques?

CLASSIFICATION REGRESSION CLUSTERING

Assigning observations to the correct Competing for first place as the most Many companies have excellent tools
category is one of the most frequently frequently encountered problem is regression: to communicate with segments of
encountered problems in predictive try to predict a value on a continuous spectrum. customers. However, these tools have
analytics. Albeit customers with a high Problems ranging from predicting the lifetime no real impact if you don't know what
churn risk, recognizing hand written value of customers to the likelihood that a these segments should be. Clustering
digits or detecting riots in images of patient will respond positively to a medial techniques help you to see structure
crowds. This course starts out with treatment. Again we start with the basics in where there used to be none. Again
logistics regression as the most basic of linear an polynomial regression, moving to starting with simple techniques like k-
these techniques, and moves to support vector machines, tree-based methods means clustering we build up towards
incrementally complex techniques that and neural networks to tackle these problems. the most recent and advanced
allow you to solve highly complex techniques.
problems.
What are key Data Science applications and techniques?

DIMENSIONALITY
MODEL SELECTION PREPROCESSING
REDUCTION

Lots of data is not always a blessing. Testing and validating models is a time More than 80% of a data scientist's time
For example in situations where the consuming task. However, with the right tools is spent in the preprocessing phase.
number of inputs is so excessive that and techniques it does not have to be! These During the remaining 20% most data
any model starts to overfit. Fortunately sessions will show you how to use cross scientists with that they had spent
dimensionality reduction techniques validations to set your parameters to their more time on preprocessing. While we
can help you to derive what is truly best possible values, and even testing cannot solve this issue, we will teach
important from your dataset, giving you multiple models in an automated way. This you techniques that shave as much of
the exact right amount of variables. way you can be certain that the model you this time off as possible, leaving you
During these sessions we will cover select is the best match for the problem you more time to spend on solving
both interpretable and non- are trying to tackle. problems on a meaningful scale rather
interpretable techniques for doing this than on the level of the data quality.
(virtual sensors).
2. PYTHON PROGRAMMING BASICS
Catch up on your programming skills

Without any doubt Python is the best programming language for data scientists today.

The primary reason for this is that all important packages are available for Python. The secondary reason
is that Python makes putting models in production and communication with all kinds of systems easy.
Not to mention that it is one of the fastest growing programming languages in general.
Time to get
down and dirty
with the data

PYTHON PROGRAMMING BASICS


DATA SCIENCE CHEAT SHEET

INFO FIRST THINGS FIRST!


EXTRA
We’ll use the following shorthands in this cheat sheet: Import these to start:
We give our Data Science graduates a more
df - A pandas DataFrame object import pandas as pd
extensive cheat sheet! So join our training
s - A pandas Series object Import numpy as np
4. DESCRIPTIVE ANALYTICS AND DATA
VISUALIZATION USING PYTHON
First things first, know what you're working with!

The promise of artificial intelligent algorithms seems to tell people that smart algorithms and data are all
you need to get results. The reality is that it still takes a substantial amount of work by expert data
scientists to ensure that algorithms can perform.

The goal of this class is to give people more experience in making descriptive analyses using Python.
This includes “simple” hypothesis testing, as well as the visualization of these results in an attractive
manner. The visualizations for this part are created using Seaborn and the basic Matplotlib functionality
of Python.
Basic statistics recap: Measure of Concordance
Even simple metrics like correlation can be highly deceiving. Obvious relationships between two
variables like this equation: Y = x^2 + x^3 + x^4 can result in correlation scores that seem to show no
relationship whatsoever.

-0.04
Seemingly there is no relation
between these variables...

DON'T RELY SOLELY ON CORRELATION TO UNCOVER


ALL RELATIONSHIPS BETWEEN YOUR VARIABLES
Tip & Tricks
for Creating
Effective Visuals

DESCRIPTIVE ANALYTICS AND DATA VISUALIZATION


WHAT NOT TO DO:
Do not use too many colors and/or categories

DO NOT USE TOO MANY COLORS AND/OR CATEGORIES

If your visualization needs a manual to understand, it is probably a bad idea…


WHAT TO DO:
Less is More!

The objective of a chart is to make numbers intuitive,


make sure that a five year old is able to understand
5. HYPOTHESIS TESTING
& PREDICTIVE ANALYTICS
Now it really begins!

This module is the first that moves into the domain of Machine Learning and creating some semblance
of predictive models. A key learning in this session is the interrelationship between traditional statistics
and machine learning and how these two domains have been moving more towards each other in
recent years.
FROM SAMPLE TO POPULATION

SAMPLE POPULATION
Estimator?
Bad luck?
Mean Mean
SD SD
Correlation Correlation
... ...

'Roman Alphabet' 'Greek Alphabet'


Data - Model
Reality - Theory
Observations - Predictions
MACHINE LEARNING PROBLEMS

UNDERFIT

When a model underfits it is unable to capture the


full complexity of the situation. This means that a
x x model will perform poorly when you rely on it to
x
x x x
make predictions.
Price

The solution in this case is to increase the


x complexity of the model to hope to capture the full
complexity in the situation.
x
Predict known data bad (high bias)
MSE of training set is high
Size Bad! -> increase complexity
MACHINE LEARNING PROBLEMS

OVERFIT
Overfitting typically happens when you use a
model that is too complex for the data at hand.
x Especially in situations where the data is quite
x x complex it can be nontrivial to detect that your
x x model is overfitting. In a sense overfitting is also
x
Price

more dangerous than underfitting since you will


expect your model to perform substantially better
x than it will actually do in practice (whereas in the
underfitting case at least you will not have
x unrealistic expectations when it comes to model
performance).

Size Using a correct test design using cross-validation


can help you select a model that is not too
complex for the data at hand, and prevent
overfitting.

Predict known data 'too well' (low bias)


Predict unknown data bad (high variance)
MACHINE LEARNING SOLUTION

BALANCE
A balanced model is what every data scientist
should strive to create. This model does not overfit
x x (too much) and captures the actual value from the
x
x x x
data.
Price

Size
BIAS - VARIANCE TRADE-OFF

Low bias + low variance: the optimal model that hits


close to home every time
Low Variance High Variance
Low bias + high variance: your model is right on
average, but can be quite far off target for specific
observations.
Low Bias

High bias + low variance: your model is performing


consistently, but it is consistently wrong.

High bias + high variance: your model is both off


target and contains substantial amounts of noise.

An important note is that high bias is a sign of


High Bias

underfitting, whereas a high variance is a sign of


overfitting. As such there is an ideal balance to be
found where a model is both as accurate and
consistent as possible. Where this balance lies
depends on the specifics of the data, but in large part
also on the specifics of what you want to do with the
model.
BIAS - VARIANCE TRADE-OFF

Working with a training and a test set is one of


the simplest ways of preventing overfitting. As
the complexity of a model increases the error in
both training (= the data that the model is
shown) and the test set (= the data that is held
back from the model to test its performance)
Test Error

will decrease.

However at some point along this complexity


axis it will become apparent that increasing
model complexity only improves performance
on the training data and the performance on the
testing data is starting to deteriorate. At this
Model Complexity
point in time you are overfitting.
6 & 7. PREDICTIVE ANALYTICS USING
PYTHON AND SCIKIT-LEARN 101 & 102
Let's do this!

Are you ready to dive fully into the domain of Data Science? Sklearn contains tools to make your life as a
Data Scientist as easy as possible!
FOR DUMMIES: HOW TO BUILD A MODEL IN SKLEARN

STEP BY STEP
1. What do you wish to model? 6. Train your model.
(Un)Supervised (see next slide)
7. Evaluate the model on your training set.
Datatypes?
Check for underfit
2. Now look at your data. 8. Test your model

3. Determine your modeling technique. 9. Evaluate the model on your test set.
Check for overfit
4. If needed, transform variables.
10. Determine your final model
5. Start building your datasets in the following
order: 11. If possible, validate on validation set

Train 12. Interpret your model!


Test
Validate

That's it! Try it out!


SKLEARN MODELS OVERVIEW

SUPERVISED UNSUPERVISED
Linear regression Clustering
Polynomial regression PCA
Ridge Regression
Logistic regression
Random forest
Support Vector Machine (see next slide)
What is SVM?

Abbreviation? S(upport) V(ector) M(achine)


What is it used for? Classifying/clustering algorithm

The options depening on your kernel:

‘Simple’ algorithm (linear)


Complex algorithm (non-linear)

How does it work?

Transform (non-linear) into higher dimensions


Find best hyperplane borders
Transform back

http://efavdb.com/svm-classification/
SKLEARN VALIDATIONS OVERVIEW

METHODS

1. Simple procedure: random training and test set


2. Advanced procedure: stratified training and test set
3. Even more advanced procedure: k fold cross validation
What is K Fold Cross Validation?

What is it?
Cross Validation is a very useful technique for assessing the performance of your models. It helps in
knowing how the model would generalize to an independent data set.

When should you use this?


You want to use this technique to estimate how accurate the predictions your model will give in
practice.

Test data Training data

Iteration 1

Iteration 2

Iteration 3
...

...
...

...

...
Iteration k=4

All data
ADVANCED MODULE HIGHLIGHTS
Feed your curiosity

Get inspired by the more advanced features of Data Science and use this knowledge to keep on
challenging yourself to learn more. Your basics are on point, now it's up to you!
Test your Data Science model

Test how good or bad your model will perform in practice by means of our stress test

Two objectives:

Primary objective The model is able to handle issues automatically and keep functioning
Secondary objective The model warns you when it cannot handle things on it’s own anymore
Optimizing your neural network

Loss/cost function has a slope?

Gradient (derivation)

Step down the slope

Change w1
Change w2
Change w1?

Only way is up

--> Optimized!
NLP: The Good & The Bad

Very Good Moderate Painstakingly Bad

Spam Translating Interpreting complex


Recognizing parts of Sentiment analysis speech
sentences Extracting “unexpected” Having a true
components (à la Siri) conversation (Turing test)
Verbs, nouns,…
Locations Recognizing plagiarism
Names (when paraphrasing)

CURIOUS ABOUT
THE TRAININGS?
Download our training overview here

1. The Machine Learning Academy for programmers


2. Data Science Course for Analysts
3. Data Science for Business

You might also like