0% found this document useful (0 votes)

15 views32 pages

Free Data Science Course Material 2018

Uploaded by

Susanta Kumar Sahoo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views32 pages

Free Data Science Course Material 2018

Uploaded by

Susanta Kumar Sahoo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

DATA SCIENCE

COURSE MATERIAL
Sneak a peek at the content of our data
science training

MARCH 20, 2018 | THE CRUNCH ACADEMY

WHAT'S INSIDE?
Core modules of the training

1. Introduction to Data Science

2. Python programming basics
3. Python essentials for Data Science
4. Descriptive Analytics and Data Visualization using Python
5. Hypothesis Testing & Predictive Analytics
6. Predictive Analytics using Python and Scikit-Learn 101
7. Predictive Analytics using Python and Scikit-Learn 102
1. INTRODUCTION TO DATA SCIENCE
Learn what the driving forces in the data science ecosystem are

Every self-respecting data scientist has to know the basic principles. Hence, before we jump into the
nitty-gritty details we give a solid overview of everything this space has to offer. This is accompanied
with an introduction to the basic tools that will be used in the hands-on courses that will follow.
How is analytics evolving in the business PRESCRIPTIVE ANALYTICS
perspective?
Making decisions using facts and figures rather than gut
PREDICTIVE ANALYTICS
feeling has always been one of the focal points of top
performing companies. Moving from traditional descriptive
and diagnostic analytics to more advanced predictive and
prescriptive techniques is a natural evolution. DIAGNOSTIC ANALYTICS

Today's businesses usually have some type of descriptive (what

has happened?) and diagnostic (why has this happened?) DESCRIPTIVE ANALYTICS
analytics in place. The move towards a data scientific approach
applies in large part moving to techniques that not only tell you
about the past, but make predictions about the future.

At the extreme end of this spectrum, companies are starting to

give decision making power to algorithms (i.e. prescriptive
analytics). While still a while off for the majority of companies,
experimenting with these techniques is of paramount
importance to stay ahead of the curve.

INTRODUCTION TO DATA SCIENCE Visual is based on Erl, T., Khattak, W., & Buhler, P.
(2016). Big Data Fundamentals: Concepts, Drivers &
Techniques. Prentice Hall Press.
What are key Data Science applications and techniques?

CLASSIFICATION REGRESSION CLUSTERING

Assigning observations to the correct Competing for first place as the most Many companies have excellent tools
category is one of the most frequently frequently encountered problem is regression: to communicate with segments of
encountered problems in predictive try to predict a value on a continuous spectrum. customers. However, these tools have
analytics. Albeit customers with a high Problems ranging from predicting the lifetime no real impact if you don't know what
churn risk, recognizing hand written value of customers to the likelihood that a these segments should be. Clustering
digits or detecting riots in images of patient will respond positively to a medial techniques help you to see structure
crowds. This course starts out with treatment. Again we start with the basics in where there used to be none. Again
logistics regression as the most basic of linear an polynomial regression, moving to starting with simple techniques like k-
these techniques, and moves to support vector machines, tree-based methods means clustering we build up towards
incrementally complex techniques that and neural networks to tackle these problems. the most recent and advanced
allow you to solve highly complex techniques.
problems.
What are key Data Science applications and techniques?

DIMENSIONALITY
MODEL SELECTION PREPROCESSING
REDUCTION

Lots of data is not always a blessing. Testing and validating models is a time More than 80% of a data scientist's time
For example in situations where the consuming task. However, with the right tools is spent in the preprocessing phase.
number of inputs is so excessive that and techniques it does not have to be! These During the remaining 20% most data
any model starts to overfit. Fortunately sessions will show you how to use cross scientists with that they had spent
dimensionality reduction techniques validations to set your parameters to their more time on preprocessing. While we
can help you to derive what is truly best possible values, and even testing cannot solve this issue, we will teach
important from your dataset, giving you multiple models in an automated way. This you techniques that shave as much of
the exact right amount of variables. way you can be certain that the model you this time off as possible, leaving you
During these sessions we will cover select is the best match for the problem you more time to spend on solving
both interpretable and non- are trying to tackle. problems on a meaningful scale rather
interpretable techniques for doing this than on the level of the data quality.
(virtual sensors).
2. PYTHON PROGRAMMING BASICS
Catch up on your programming skills

Without any doubt Python is the best programming language for data scientists today.

The primary reason for this is that all important packages are available for Python. The secondary reason
is that Python makes putting models in production and communication with all kinds of systems easy.
Not to mention that it is one of the fastest growing programming languages in general.
Time to get
down and dirty
with the data

PYTHON PROGRAMMING BASICS

DATA SCIENCE CHEAT SHEET

INFO FIRST THINGS FIRST!

EXTRA
We’ll use the following shorthands in this cheat sheet: Import these to start:
We give our Data Science graduates a more
df - A pandas DataFrame object import pandas as pd
extensive cheat sheet! So join our training
s - A pandas Series object Import numpy as np
4. DESCRIPTIVE ANALYTICS AND DATA
VISUALIZATION USING PYTHON
First things first, know what you're working with!

The promise of artificial intelligent algorithms seems to tell people that smart algorithms and data are all
you need to get results. The reality is that it still takes a substantial amount of work by expert data
scientists to ensure that algorithms can perform.

The goal of this class is to give people more experience in making descriptive analyses using Python.
This includes “simple” hypothesis testing, as well as the visualization of these results in an attractive
manner. The visualizations for this part are created using Seaborn and the basic Matplotlib functionality
of Python.
Basic statistics recap: Measure of Concordance
Even simple metrics like correlation can be highly deceiving. Obvious relationships between two
variables like this equation: Y = x^2 + x^3 + x^4 can result in correlation scores that seem to show no
relationship whatsoever.

-0.04
Seemingly there is no relation
between these variables...

DON'T RELY SOLELY ON CORRELATION TO UNCOVER

ALL RELATIONSHIPS BETWEEN YOUR VARIABLES
Tip & Tricks
for Creating
Effective Visuals

DESCRIPTIVE ANALYTICS AND DATA VISUALIZATION

WHAT NOT TO DO:
Do not use too many colors and/or categories

DO NOT USE TOO MANY COLORS AND/OR CATEGORIES

If your visualization needs a manual to understand, it is probably a bad idea…

WHAT TO DO:
Less is More!

The objective of a chart is to make numbers intuitive,

make sure that a five year old is able to understand
5. HYPOTHESIS TESTING
& PREDICTIVE ANALYTICS
Now it really begins!

This module is the first that moves into the domain of Machine Learning and creating some semblance
of predictive models. A key learning in this session is the interrelationship between traditional statistics
and machine learning and how these two domains have been moving more towards each other in
recent years.
FROM SAMPLE TO POPULATION

SAMPLE POPULATION
Estimator?
Bad luck?
Mean Mean
SD SD
Correlation Correlation
... ...

'Roman Alphabet' 'Greek Alphabet'

Data - Model
Reality - Theory
Observations - Predictions
MACHINE LEARNING PROBLEMS

UNDERFIT

When a model underfits it is unable to capture the

full complexity of the situation. This means that a
x x model will perform poorly when you rely on it to
x
x x x
make predictions.
Price

The solution in this case is to increase the

x complexity of the model to hope to capture the full
complexity in the situation.
x
Predict known data bad (high bias)
MSE of training set is high
Size Bad! -> increase complexity
MACHINE LEARNING PROBLEMS

OVERFIT
Overfitting typically happens when you use a
model that is too complex for the data at hand.
x Especially in situations where the data is quite
x x complex it can be nontrivial to detect that your
x x model is overfitting. In a sense overfitting is also
x
Price

more dangerous than underfitting since you will

expect your model to perform substantially better
x than it will actually do in practice (whereas in the
underfitting case at least you will not have
x unrealistic expectations when it comes to model
performance).

Size Using a correct test design using cross-validation

can help you select a model that is not too
complex for the data at hand, and prevent
overfitting.

Predict known data 'too well' (low bias)

Predict unknown data bad (high variance)
MACHINE LEARNING SOLUTION

BALANCE
A balanced model is what every data scientist
should strive to create. This model does not overfit
x x (too much) and captures the actual value from the
x
x x x
data.
Price

Size
BIAS - VARIANCE TRADE-OFF

Low bias + low variance: the optimal model that hits

close to home every time
Low Variance High Variance
Low bias + high variance: your model is right on
average, but can be quite far off target for specific
observations.
Low Bias

High bias + low variance: your model is performing

consistently, but it is consistently wrong.

High bias + high variance: your model is both off

target and contains substantial amounts of noise.

An important note is that high bias is a sign of

High Bias

underfitting, whereas a high variance is a sign of

overfitting. As such there is an ideal balance to be
found where a model is both as accurate and
consistent as possible. Where this balance lies
depends on the specifics of the data, but in large part
also on the specifics of what you want to do with the
model.
BIAS - VARIANCE TRADE-OFF

Working with a training and a test set is one of

the simplest ways of preventing overfitting. As
the complexity of a model increases the error in
both training (= the data that the model is
shown) and the test set (= the data that is held
back from the model to test its performance)
Test Error

will decrease.

However at some point along this complexity

axis it will become apparent that increasing
model complexity only improves performance
on the training data and the performance on the
testing data is starting to deteriorate. At this
Model Complexity
point in time you are overfitting.
6 & 7. PREDICTIVE ANALYTICS USING
PYTHON AND SCIKIT-LEARN 101 & 102
Let's do this!

Are you ready to dive fully into the domain of Data Science? Sklearn contains tools to make your life as a
Data Scientist as easy as possible!
FOR DUMMIES: HOW TO BUILD A MODEL IN SKLEARN

STEP BY STEP
1. What do you wish to model? 6. Train your model.
(Un)Supervised (see next slide)
7. Evaluate the model on your training set.
Datatypes?
Check for underfit
2. Now look at your data. 8. Test your model

3. Determine your modeling technique. 9. Evaluate the model on your test set.
Check for overfit
4. If needed, transform variables.
10. Determine your final model
5. Start building your datasets in the following
order: 11. If possible, validate on validation set

Train 12. Interpret your model!

Test
Validate

That's it! Try it out!

SKLEARN MODELS OVERVIEW

SUPERVISED UNSUPERVISED
Linear regression Clustering
Polynomial regression PCA
Ridge Regression
Logistic regression
Random forest
Support Vector Machine (see next slide)
What is SVM?

Abbreviation? S(upport) V(ector) M(achine)

What is it used for? Classifying/clustering algorithm

The options depening on your kernel:

‘Simple’ algorithm (linear)

Complex algorithm (non-linear)

How does it work?

Transform (non-linear) into higher dimensions

Find best hyperplane borders
Transform back

http://efavdb.com/svm-classification/
SKLEARN VALIDATIONS OVERVIEW

METHODS

1. Simple procedure: random training and test set

2. Advanced procedure: stratified training and test set
3. Even more advanced procedure: k fold cross validation
What is K Fold Cross Validation?

What is it?
Cross Validation is a very useful technique for assessing the performance of your models. It helps in
knowing how the model would generalize to an independent data set.

When should you use this?

You want to use this technique to estimate how accurate the predictions your model will give in
practice.

Test data Training data

Iteration 1

Iteration 2

Iteration 3
...

...
...

...

...
Iteration k=4

All data
ADVANCED MODULE HIGHLIGHTS
Feed your curiosity

Get inspired by the more advanced features of Data Science and use this knowledge to keep on
challenging yourself to learn more. Your basics are on point, now it's up to you!
Test your Data Science model

Test how good or bad your model will perform in practice by means of our stress test

Two objectives:

Primary objective The model is able to handle issues automatically and keep functioning
Secondary objective The model warns you when it cannot handle things on it’s own anymore
Optimizing your neural network

Loss/cost function has a slope?

Gradient (derivation)

Step down the slope

Change w1
Change w2
Change w1?

Only way is up

--> Optimized!
NLP: The Good & The Bad

Very Good Moderate Painstakingly Bad

Spam Translating Interpreting complex

Recognizing parts of Sentiment analysis speech
sentences Extracting “unexpected” Having a true
components (à la Siri) conversation (Turing test)
Verbs, nouns,…
Locations Recognizing plagiarism
Names (when paraphrasing)
…
CURIOUS ABOUT
THE TRAININGS?
Download our training overview here

1. The Machine Learning Academy for programmers

2. Data Science Course for Analysts
3. Data Science for Business

Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
6220010
No ratings yet
6220010
37 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Nac PDF
No ratings yet
Nac PDF
23 pages
Unit I
No ratings yet
Unit I
52 pages
CU Data Science
No ratings yet
CU Data Science
8 pages
Kadir
No ratings yet
Kadir
84 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
25 pages
MAT8033 Lecture Slides
No ratings yet
MAT8033 Lecture Slides
29 pages
MAT8033 Lecture Slides
No ratings yet
MAT8033 Lecture Slides
62 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
33 pages
Datascience (Mod1)
No ratings yet
Datascience (Mod1)
4 pages
Data Science & Machine Learning Insights
No ratings yet
Data Science & Machine Learning Insights
29 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
74 pages
Data Science & Aiml (Mile Stone Solution)
No ratings yet
Data Science & Aiml (Mile Stone Solution)
37 pages
AFRICDSA Certified Data Scientist Syllabus - V1.2
No ratings yet
AFRICDSA Certified Data Scientist Syllabus - V1.2
12 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
Approaches in Data Science (Slides)
No ratings yet
Approaches in Data Science (Slides)
13 pages
Data Science Using Python
No ratings yet
Data Science Using Python
12 pages
Comprehensive Data Science and AI Course
No ratings yet
Comprehensive Data Science and AI Course
43 pages
Ocs353dsf Unit Wise Notes
100% (4)
Ocs353dsf Unit Wise Notes
121 pages
3250+module+1+ +Intro+to+Data+Science
No ratings yet
3250+module+1+ +Intro+to+Data+Science
71 pages
Unit 2
No ratings yet
Unit 2
48 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
Data Science
No ratings yet
Data Science
33 pages
Applied Data Analysis
No ratings yet
Applied Data Analysis
128 pages
Data Science - Sem6
100% (3)
Data Science - Sem6
118 pages
File
No ratings yet
File
27 pages
Unit 01 Ids
No ratings yet
Unit 01 Ids
39 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Selected Topics - Datascience
No ratings yet
Selected Topics - Datascience
17 pages
Data Scientist - KD PDF
No ratings yet
Data Scientist - KD PDF
1 page
DSV Sem Exam
No ratings yet
DSV Sem Exam
15 pages
Brochure PGP-DS Updated
No ratings yet
Brochure PGP-DS Updated
28 pages
Unit - I & II
No ratings yet
Unit - I & II
59 pages
Introductiontodatascience 230122140841 B90a0856
No ratings yet
Introductiontodatascience 230122140841 B90a0856
44 pages
Data Science Book
No ratings yet
Data Science Book
383 pages
Crash Course - Introduction To Data Science
No ratings yet
Crash Course - Introduction To Data Science
121 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
Unit 1
No ratings yet
Unit 1
84 pages
B Ei
No ratings yet
B Ei
44 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes Part A: Content Design
No ratings yet
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes Part A: Content Design
6 pages
DS Mod 1 To 2 Complete Notes
No ratings yet
DS Mod 1 To 2 Complete Notes
63 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Unit 1 Full Notes
No ratings yet
Unit 1 Full Notes
52 pages
Data Science Course Syllabus Overview
No ratings yet
Data Science Course Syllabus Overview
8 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Introductiontodatascience 230122140841 B90a0856 1
No ratings yet
Introductiontodatascience 230122140841 B90a0856 1
44 pages
Chapter 01 2
No ratings yet
Chapter 01 2
19 pages
DSF Notes
No ratings yet
DSF Notes
97 pages
Coffee Break NumPy PDF
100% (8)
Coffee Break NumPy PDF
211 pages
Module 1
No ratings yet
Module 1
19 pages
Big Data Analytics: Data Scientists Are in High Demand
No ratings yet
Big Data Analytics: Data Scientists Are in High Demand
32 pages
Qapsurvey 94
No ratings yet
Qapsurvey 94
42 pages
Week 06 Assignment
No ratings yet
Week 06 Assignment
3 pages
Mini Project 3rd
No ratings yet
Mini Project 3rd
14 pages
Numerical Methods for Engineers by S.K. Gupta
100% (1)
Numerical Methods for Engineers by S.K. Gupta
2 pages
Nitish Kumar Bharti Resume
No ratings yet
Nitish Kumar Bharti Resume
3 pages
41 Essential Machine Learning Interview Questions: 18 Mins Read
No ratings yet
41 Essential Machine Learning Interview Questions: 18 Mins Read
21 pages
V-PULS™ v2.028 "One Point Edition"
No ratings yet
V-PULS™ v2.028 "One Point Edition"
1 page
AI-Powered Phishing Detection: Leveraging Machine Learning For Email Security
No ratings yet
AI-Powered Phishing Detection: Leveraging Machine Learning For Email Security
6 pages
Dimensions of Supervised ML Algorithm
No ratings yet
Dimensions of Supervised ML Algorithm
11 pages
Koonin, Steven E - Computational Physics - Fortran Version (2018, Chapman and Hall - CRC)
100% (3)
Koonin, Steven E - Computational Physics - Fortran Version (2018, Chapman and Hall - CRC)
656 pages
1 3 2 2 3 3 2 1 1 2 cos 3 cos 4: y t x t y t x t x t x n x n π n π n
No ratings yet
1 3 2 2 3 3 2 1 1 2 cos 3 cos 4: y t x t y t x t x t x n x n π n π n
8 pages
CH 4 LP QM
No ratings yet
CH 4 LP QM
40 pages
CS502 Midterm Study Guide
No ratings yet
CS502 Midterm Study Guide
3 pages
Forward and Backward Propagation Que
No ratings yet
Forward and Backward Propagation Que
5 pages
Image Denoising with Spatial Filters
No ratings yet
Image Denoising with Spatial Filters
21 pages
Cost Estimation
No ratings yet
Cost Estimation
16 pages
Digital Circuits Problem Solving
No ratings yet
Digital Circuits Problem Solving
9 pages
2023.final Project
No ratings yet
2023.final Project
2 pages
Torsional Irregularities Check For Response Spectrum Analysis (RSA) Using Accurate Method (Mode Shapes)
No ratings yet
Torsional Irregularities Check For Response Spectrum Analysis (RSA) Using Accurate Method (Mode Shapes)
4 pages
Further Mathematics: Pearson Edexcel Level 3 GCE
No ratings yet
Further Mathematics: Pearson Edexcel Level 3 GCE
12 pages
An IoT Based System With Edge Intelligence For Rice Leaf Disease Detection Using Machine Learning
No ratings yet
An IoT Based System With Edge Intelligence For Rice Leaf Disease Detection Using Machine Learning
6 pages
Winter 2021 Cryptography Assignment
No ratings yet
Winter 2021 Cryptography Assignment
3 pages
Linear Equation
No ratings yet
Linear Equation
6 pages
Chapter 1: Lossless Data Compression
No ratings yet
Chapter 1: Lossless Data Compression
4 pages
Solution of Assignemnt1 MTH 401
No ratings yet
Solution of Assignemnt1 MTH 401
5 pages
PID Controller
100% (1)
PID Controller
267 pages
Dynamic Programming Essentials
No ratings yet
Dynamic Programming Essentials
23 pages
Indian Institute of Technology Kanpur CS 676: Computer Vision and Image Processing
No ratings yet
Indian Institute of Technology Kanpur CS 676: Computer Vision and Image Processing
7 pages
Wang Et Al. - 2020 - Artificial Intelligence Enabled Wireless Networkin
No ratings yet
Wang Et Al. - 2020 - Artificial Intelligence Enabled Wireless Networkin
8 pages
Abstract 1
No ratings yet
Abstract 1
2 pages

Free Data Science Course Material 2018

Uploaded by

Free Data Science Course Material 2018

Uploaded by

DATA SCIENCE

MARCH 20, 2018 | THE CRUNCH ACADEMY

1. Introduction to Data Science

Today's businesses usually have some type of descriptive (what

At the extreme end of this spectrum, companies are starting to

CLASSIFICATION REGRESSION CLUSTERING

PYTHON PROGRAMMING BASICS

INFO FIRST THINGS FIRST!

DON'T RELY SOLELY ON CORRELATION TO UNCOVER

DESCRIPTIVE ANALYTICS AND DATA VISUALIZATION

DO NOT USE TOO MANY COLORS AND/OR CATEGORIES

If your visualization needs a manual to understand, it is probably a bad idea…

The objective of a chart is to make numbers intuitive,

'Roman Alphabet' 'Greek Alphabet'

When a model underfits it is unable to capture the

The solution in this case is to increase the

more dangerous than underfitting since you will

Size Using a correct test design using cross-validation

Predict known data 'too well' (low bias)

Low bias + low variance: the optimal model that hits

High bias + low variance: your model is performing

High bias + high variance: your model is both off

An important note is that high bias is a sign of

underfitting, whereas a high variance is a sign of

Working with a training and a test set is one of

However at some point along this complexity

Train 12. Interpret your model!

That's it! Try it out!

Abbreviation? S(upport) V(ector) M(achine)

The options depening on your kernel:

‘Simple’ algorithm (linear)

How does it work?

Transform (non-linear) into higher dimensions

1. Simple procedure: random training and test set

When should you use this?

Test data Training data

Loss/cost function has a slope?

Step down the slope

Very Good Moderate Painstakingly Bad

Spam Translating Interpreting complex

1. The Machine Learning Academy for programmers

You might also like