Ethnotech - Data Science With Python

ETHNOTECH ACADEMY
CERTIFICATE FORMAT
ETHNOTECH ACADEMY
BENEFITS OF THE PROGRAM
ETHNOTECH ACADEMY 3
COURSE OUTLINE
L1 Introduction L5 Regression
L2 Foundation-Panda L6 Classification
L3 Foundation-Numpy L7 Clustering
Foundation-Descriptive
L4
Analysis L8 Text Analytics
ETHNOTECH ACADEMY
EXIT PROFILE
Financial
Data
Analyst
Scientist
Data Data Base

Engineer Admin
Data Business
Journalist analyst
Big Data
Analyst
ETHNOTECH ACADEMY
SESSION 1
Introduction
• Introduction to data science
• Introduction to Python
• Introduction to AI-ML
ETHNOTECH ACADEMY
Introduction of data science
ETHNOTECH ACADEMY
• Data Science is a combination of multiple disciplines that

uses statistics, data analysis, and machine learning to
analyze data and to extract knowledge and insights from it.
What is Data Science?

• Data Science is about data gathering, analysis and decision-
making.
• Data Science is about finding patterns in data, through
analysis, and make future predictions.
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Introduction Contd…
• To find the best suited time to deliver goods.
• To analyze health benefit of training.
• To predict who will win elections.
Application of Data
science :
ETHNOTECH ACADEMY
What is Data?
• Data is a collection of information.
• One purpose of Data Science is to structure data, making it
interpretable and easy to work with.
Data can be categorized into two groups:

• Structured data
• Unstructured data
ETHNOTECH ACADEMY
Structured Data
• Structured data is organized and easier to work with.
• We can use an array or a database table to
structure or present data.
Example of an array:
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
ETHNOTECH ACADEMY
Unstructured Data
• Unstructured data is not organized. We must organize the
data for analysis purposes.
ETHNOTECH ACADEMY
Database Table
• A database table is a table with structured data.
• The following table shows
a database table with
health data extracted
from a sports watch:
ETHNOTECH ACADEMY
Database Table Structure
Column 1 Column 2 Column 3 Column 4 Column 5 Column 6
Duration Average_Puls Max_Pulse Calorie_Burn Hours_Work Hours_Sleep

e age
Row 1 30 80 120 240 10 7
Row 2 30 85 120 250 10 7
Row 3 45 90 130 260 8 7
Row 4 45 95 130 270 8 7
Row 5 45 100 140 280 0 7
Row 6 60 105 140 290 7 8
Row 7 60 110 145 300 7 8
Row 8 60 115 145 310 8 8
ETHNOTECH ACADEMY
Variables
• A variable is defined as something that can be measured or
counted.
• Examples can be characters, numbers or time.
Duration Average_P Max_Pulse Calorie_Bu
• There are 4 columns, meaning ulse rnage
that there are 4 variables 30 80 120 240
(Duration, Average_Pulse,
30 85 120 250
45 90 130 260
Max_Pulse,Calorie_Burnage). 45 95 130 270
45 100 140 280
60 105 140 290
60 110 145
ETHNOTECH ACADEMY
Introduction to AI-ML
What is Artificial Intelligence?
• AI is one of the fascinating and universal fields of Computer
science which has a great scope in future. AI holds a tendency
to cause a machine to work as a human.
• Artificial Intelligence is composed of two words Artificial and

Intelligence, where Artificial defines "man-made,“and
intelligence defines "thinking power", hence AI means "a man-
made thinking power."
ETHNOTECH ACADEMY
Introduction to AI-ML Contd…
• "It is a branch of computer science by which we can create
intelligent machines which can behave like a human, think
like humans, and able to make
decisions”.
ETHNOTECH ACADEMY
Why Artificial Intelligence?
1.With the help of AI, you can create such software or devices which can
solve real-world problems very easily and with accuracy such as
health issues, marketing, traffic issues, etc.
2.With the help of AI, you can create your personal virtual Assistant, such
as Cortana, Google Assistant, Siri, etc.
3.With the help of AI, you can build such Robots which can work in an
environment where survival of humans can be at risk.
4.AI opens a path for other new technologies, new devices,

and new Opportunities.
ETHNOTECH ACADEMY
Goals of Artificial Intelligence
ETHNOTECH ACADEMY
Pros And Cons of Artificial Intelligence
ETHNOTECH ACADEMY
What is Machine Learning
• In the real world, we are surrounded by humans who can

learn everything from their experiences with their learning
capability, and we have computers or machines which work
on our instructions. But can a machine
also learn from experiences or past
data like a human does?
ETHNOTECH ACADEMY
Machine Learning Contd…
ETHNOTECH ACADEMY
Machine Learning Contd…
• Machine Learning is said as a subset of artificial

intelligence that is mainly concerned with the development
of algorithms which allow a computer to learn from the
data and past experiences on their own. The term machine
learning was first introduced by Arthur Samuel in 1959.
• Machine learning enables a machine to automatically learn

from data, improve performance from experiences, and
predict things without being explicitly programmed.
ETHNOTECH ACADEMY
Features of Machine Learning:
1.Machine learning uses data to detect various patterns in a

given dataset.
2.It can learn from past data and improve automatically.
3.It is a data-driven technology.
4.Machine learning is much similar to data mining as it also

deals with the huge amount of the data.
ETHNOTECH ACADEMY
Classification of Machine Learning
ETHNOTECH ACADEMY
Supervised Learning
• Supervised learning is a type of machine learning method

in which we provide sample labelled data to the machine
learning system in order to train it, and on that basis, it
predicts the output.
Supervised learning can be grouped further in two categories

of algorithms:
1.Classification
2.Regression
ETHNOTECH ACADEMY
Supervised Learning Contd…
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Unsupervised Learning
• The training is provided to the machine with the set of data

that has not been labeled, classified, or categorized, and the
algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input
data into new features or a group of objects with similar
patterns.
It can be further classifieds into two categories of algorithms:

1.Clustering
2.Association
ETHNOTECH ACADEMY
Unsupervised Learning Contd…
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Reinforcement Learning
• Reinforcement learning is a feedback-based learning

method, in which a learning agent gets a reward for each
right action and gets a penalty for each wrong action.
• The agent learns automatically with these feedbacks and

improves its performance. In reinforcement learning, the
agent interacts with the environment and explores it. The
goal of an agent is to get the most reward points, and hence,
it improves its performance.
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Difference Between Supervised and
Unsupervised Learning
ETHNOTECH ACADEMY
Difference b/w Supervised, Unsupervised
& Semi Supervised Learning
ETHNOTECH ACADEMY
Introduction to Python
Why Programming?
• There are more than 700 languages available in today’s

programming world.
• Every language is designed to fulfil a particular

requirement.
• To communicate with digital machines and make them

work accordingly
ETHNOTECH ACADEMY
What is Python ?
• Python is most widely used powerful, general purpose,

high level programming language.
• Python provides over 137,000 python libraries. Libraries

are a set of useful functions that eliminate the need for
writing codes from scratch.
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Applications
ETHNOTECH ACADEMY
Web application in Python
web applications
• Python can be used to create

• There are python frameworks
like Django, Flask and Pyramid
for this purpose
ETHNOTECH ACADEMY
Data Analysis in Python
• Python is the leading language of choice for many data
scientists.
• It has grown in popularity due to excellent libraries like:
Numpy
Pandas
Matplotlib
• A data scientist is a professional
responsible for collecting,analyzing
and interpreting extremely large
amounts of data.
ETHNOTECH ACADEMY
Machine Learning in Python
• Machine learning is about making predictions with data.
• It is mainly used in
Face Recognition,
Music recommendation,
Medical Data etc
ETHNOTECH ACADEMY
Raspberry Pi in Python
• We can build Home Automation System and even robots using

Raspberry-Pi
• The coding on a Raspberry-Pi

can be performed using Python.
ETHNOTECH ACADEMY
Game Development in Python
• We can write whole games in Python using PyGame
• Popular games developed in
Python are:
Bridge Commander
Civilization IV
Battlefield 2
Eve Online
48
Freedom Force
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Who created Python?
• Developed by Guido van Rossum, a Dutch Scientist

and first released on February 20, 1991
• The name Python is inspired from Guido’s favourite
Comedy TV show “Monty Python’s Flying Circus”
ETHNOTECH ACADEMY
Features of Python
• Easy-to-learn: Python has few keywords, simple structure, and a

clearly defined syntax. Python code is comparatively 3 to 5 times
smaller than C/C++/Java.
In C:
#include<stdio.h>
int main() In Python:
{ print(“Hello World!”)
print(“Hello World!”);
}
ETHNOTECH ACADEMY
Features of Python
Python uses both a compilers as well as interpreter for converting
our source and running it.
 Interpret: To execute a program in a high-level language by

translating it one line at a time.
 Compile: To translate a program written in a high-level language

all at once, in preparation for later execution.
 Portability: Python can run on a wide variety of hardware

platforms and has the same interface on all platforms.
ETHNOTECH ACADEMY
Python Data Types
• Numbers
• String
• Bool
• List
• Tuple
• Dictionary
• Set
ETHNOTECH ACADEMY
Bool
• Data type Boolean is used to store 2 values which is True or False.
• All the comparators used will result in True/False
• The three Boolean operators (and, or, and not) are used to compare
Boolean values.
• Like comparison operators, they evaluate these expressions down to a
Boolean value.
• After any math and comparison operators evaluate, Python evaluates the
not operators first, then the and operators, and then the or operators.
>>> 42 == 42 >>> 2 != 3
True True
>>> 2 != 2 >>> 42 == 99
False False
ETHNOTECH ACADEMY
Numbers
Int(signed integers): They are positive or negative whole numbers with no
decimal point.
Long(long integers): They are integers of unlimited size, written like integers
and followed by uppercase or lowercase L.
Float(floating point real values): They represent real numbers and are written
with a decimal point. Floats may also be in scientific notation with E or e
indicating power of 10.
Example: 2.5e2 = 2.5 x 10^2 = 2.5 x 100 = 250
Complex(complex numbers): are written in the form a+bj. The real part of
number is a and the imaginary part is b.
Letter j should appear only in suffix, not in prefix.
Example: 3+5j
ETHNOTECH ACADEMY
String
• Strings are a collection of characters. A string can group
any type of known characters i.e. letters ,numbers and
special characters. They are enclosed in single quote,
double quote, triple (literal) quote or raw string.
Example: ‘Hi’ , “hello” , ‘1234’
Example:
S1 = 'Mango' print(S3)
print(S4)
S2 = "Hello" S5 = "Hey, \"Good\"
S3 = "Hey, 'Good' Morning" Morning“
S4 = 'Hey, "Good" Morning‘ print(S5)
ETHNOTECH ACADEMY
List
 List is a container that holds many objects under a single
name.
 List can be written as a list of comma-separated values (items)
between square brackets.

 They have indexes same as strings.
 Lists can be nested just like arrays, i.e., you can have a list of
lists.
 Lists are mutable.
Syntax:
List_name = [item1 , item2 , item3]
List_name = []
List_name[index]
ETHNOTECH ACADEMY
Tuple
• A tuple is another sequence data type that is similar to the

list.
• It is a collection of elements separated by comma and
enclosed with parentheses ()
• It is immutable.
• Duplicate values can be present.
ex:T1=(1,2,3,1,2,3)
• It can hold the heterogeneous data types.
ex:T2=(1,3.0,’m’,”hello”)
ETHNOTECH ACADEMY
Dictionary
 Dictionaries are enclosed by curly braces ( { } ) and values
can be assigned and accessed using square braces ( [] ).
 Dictionaries are unordered, changeable and can be

indexed.
 Dictionary is a collection of key-value pairs.

Dictionary_name = {key1 : val1 , key2 : val2}
 Keys can be used as indexes and are unique but values in

the keys can be duplicate.
ETHNOTECH ACADEMY
Dictionary Contd…
Example:
>>> camera = {'sony':200, 'nikon': 200}

>>> camera.update({'canon':500})
>>> print(camera)
Output: {'sony': 200, 'nikon': 200, 'canon': 500}
ETHNOTECH ACADEMY
Sets
• A set is an unordered collection with no duplicate elements.
• Basic uses include eliminating duplicate entries.
• Set object does not support indexing.
• Set objects also support mathematical operations like union,

intersection, difference, and symmetric difference.
• Curly braces or the built-in set() function can be used to create sets.
ETHNOTECH ACADEMY
Sets Contd…
• A set is mutable, but may not contain mutable items like a
list, set, or even a dictionary.
• A set may contain values of different types.
Examples:
x = {12,3,4,45}
y = {2,4,6,78}
x.union(y)
{2, 3, 4, 6, 12, 45, 78}
ETHNOTECH ACADEMY
Sets Contd…
x.intersection(y)
{4}
x.difference(y) # elements present in x but not in y
{3, 12, 45}
y.difference(x) # elements present in y but not in x
{2, 6, 78}
x.symmetric_difference(y) #returns unique elements in both
{2, 3, 6, 12, 45, 78}
ETHNOTECH ACADEMY
Q&A
1. __________ is a combination of multiple disciplines that uses

statistics, data analysis, and machine learning to extract knowledge
and insights from it.
a. Data Science
b. Machine Learning
c. Data Analytics
d. Data Mining
• Data Science
ETHNOTECH ACADEMY
Q&A
2. Which of the following is not a category of Data?

a. Structured
b. Semi – Structured
c. Unstructured
d. Meta – Structured
• Meta – Structured
ETHNOTECH ACADEMY
Q&A
• A professional who is responsible for collecting,analyzing and

interpreting extremely large amounts of data.
a. Data Scientist
b. Data Analyst
c. AI Developer
d. System Architect
• Data Scientist
ETHNOTECH ACADEMY
Q&A
• Python was invented by Guido van Rossum and it was released in

the year _______
a. 1990
b. 1991
c. 1995
d. 2000
• 1991
ETHNOTECH ACADEMY
Q&A
• Which of the following is not a Boolean operator?

a. And
b. X-OR
c. OR
d. NOT
• X-OR
ETHNOTECH ACADEMY
Q&A
• _________ is a container that holds many objects under a single

name.
a. List
b. Strings
c. Dictionary
d. Sets
• List
ETHNOTECH ACADEMY
Q&A
• State the following statement is true or false

“Tuple is mutable in nature”
a. True
b. False
• False
ETHNOTECH ACADEMY
Q&A
• In dictionaries _______ can be used as indexes and are unique

a. Value
b. Elements
c. Key
d. Items
• Key
ETHNOTECH ACADEMY
Q&A
• Which of the following is not an example of sequence data type?

a. Float
b. Strings
c. List
d. Tuple
• Float
ETHNOTECH ACADEMY
• The process of executing a program in a high-level language by
translating it one line at a time is called _______
a. Interpretation
b. Compilation
c. Recursion
d. Member function
• Interpretation
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Question 1
1. __________ is a feedback-based learning method, in which a

learning agent gets a reward for each right action and gets
a penalty for each wrong action
a) Supervised
b) Unsupervised
c) Reinforcement
d) NLP
Answer: Reinforcement
ETHNOTECH ACADEMY
Question 2
2. State the following statement is true or false

‘Machine Learning is a data-driven technology’
Answer: True
ETHNOTECH ACADEMY
Question 3
3. ______ datatype is used to store 2 values only

A. Number
B. Tuple
C. Dictionary
D. Boolean
Answer: Boolean
ETHNOTECH ACADEMY
SUMMARY
• Fundamentals of Data science and AI-ML
• Basics of Python Programming
• Usage of Various datatypes in python
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Session 2
Foundation-Panda
• Header
• Panda read csv
• datatype and statistics
• Panda column operations
• Panda operations
• Merge and concat
• Graphs
ETHNOTECH ACADEMY
Foundation Panda
Introduction to Pandas
• Pandas is an open source Python library for highly specialized
data analysis.
• It is currently the reference point that all professionals using the
Python language need to study for the statistical purposes of
analysis and decision making.
• The library was designed and developed primarily by Wes
McKinney starting in 2008. In 2012, Sien Chang, one of his
colleagues, was added to the development.
• Together they set up one of the most used libraries in the Python
community
ETHNOTECH ACADEMY
Foundation Panda Contd…
• Pandas arises from the need to have a specific library to
analyze data that provides, in the simplest possible way, all
the instruments for data processing, data extraction, and
data manipulation
ETHNOTECH ACADEMY
Feature Of Pandas
ETHNOTECH ACADEMY
Header
• A header necessarily stores the names or headings for each

of the columns.
• It basically helps the user to identify the role of the
respective column in the data frame.
• The top row containing column names is called the header
row of the data frame.
• There are basically two approaches to add a header row in
Python in case the original data frame doesn’t have a header.
ETHNOTECH ACADEMY
Creating a data frame from CSV file
and creating row header
• While reading the data and storing it in a data frame, or
creating a fresh data frame , column names can be
specified by using the names attribute of the read_csv()
method in python.
• Names attribute contains an array of names for each of the

columns of the data frame in order. The length of the array
is equivalent to the length of this frame structure.
ETHNOTECH ACADEMY
Dataset
• https://shorturl.at/vyEJL
ETHNOTECH ACADEMY
Code Snippet
ETHNOTECH ACADEMY
Output
Note: We can also specify the header=none as an attribute

of the read_csv() method and later on give names to the
columns explicitly when desired
ETHNOTECH ACADEMY
Creating a data frame and creating
row header in Python itself
• We can create a data frame of specific number of rows and
columns by first creating a multi -dimensional array and then
converting it into a data frame by
the pandas.DataFrame() method.
• The columns argument is used to specify the row header or

the column names. It contains an array of column values
with its length equal to the number of columns in the data
frame.
ETHNOTECH ACADEMY
Code Snippet
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
Panda column operations
1.Sort by a Column with Pandas
• Sorting your Pandas dataframe df by one or more columns
can be done either ascending or descending.
• The first argument here is the column of your dataframe

you would like to sort by. By default, the
parameter ascending is set to True therefore you only
need to specify this if you would like sorting to be in
descending order.
ETHNOTECH ACADEMY
Panda column operations Contd…
• If you need to sort by multiple columns, amend the

parameter values to be lists of columns and a list of
corresponding orders of sorting (ascending/descending).
• This example ordered the dataframe df by col_1 in

ascending order and col_2 in descending order.
ETHNOTECH ACADEMY
2.Rename Columns in Pandas
Syntax
• pd.rename(columns={'original_col_name':'new_col_name'})
• To rename multiple columns, add these updates into

the columns parameter dictionary.
• pd.rename(columns={'original_col1_name':
'new_col1_name', 'original_col2_name': 'new_col2_name'})
ETHNOTECH ACADEMY
3.Delete a Column of a Pandas Dataframe
• We can delete one or more columns of a Pandas dataframe

at any one time. First, let’s start with one.
• df.drop('column_name', axis=1)
ETHNOTECH ACADEMY
4.Group By & Aggregate Columns with Pandas
• Grouping data with Pandas is one way to summarize your

data. This can be used as a basis for plotting charts or just to
provide insights. Here is how to do this using the
Pandas group by method of one column, col_1 and count the
number of rows per group.
• df.groupby('col_1').count()
ETHNOTECH ACADEMY
• The groupby method can be used on any number of

columns and used to aggregate each in different ways. The
code below is grouping by two columns and aggregating
these (assuming they are of numeric data types) by
summing and calculating the mean respectively.
• df.groupby(['col_1', 'col_2']).agg(['sum', 'mean'])
ETHNOTECH ACADEMY
5.Apply a Function to a Column using Pandas
• One way of applying a function to all rows in a Pandas

dataframe column is using the apply method.
• df['col'].apply(function)
ETHNOTECH ACADEMY
• Above, a specific column col_1 has been selected for the

function (generic in this case) has been applied. Functions
applied can be inbuilt, for example the numpy square root
function np.sqrt or a user defined function you have
specified, using a lambda function or otherwise.
• df['col'].apply(lambda x: x**2 + 5)
ETHNOTECH ACADEMY
Panda operations Contd…
Types Of Operation
• Creating a data frame with pandas
• Read the top element chart:
• Read the Bottom element chart:
• Understanding the statistical information of the data:
• Writing a CSV file:
ETHNOTECH ACADEMY
1. Creating a data frame with pandas:
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
2. Reading a CSV File:
ETHNOTECH ACADEMY
3. Read the top element chart:
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
4. Read the Bottom element chart:
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
5. Understanding the statistical information

of the data:
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
Merge and Concat
Dataframe
A dataframe is a two-dimensional data structure having
multiple rows and columns. In a dataframe, the data is
aligned in the form of rows and columns only. A dataframe
can perform arithmetic as well as conditional operations. It
has mutable size.
ETHNOTECH ACADEMY
Merge and Concat Contd…
Example:
ETHNOTECH ACADEMY
Output:
ETHNOTECH ACADEMY
DataFrames Merge:
Pandas provides a single

function, merge(), as the
entry point for all standard
database join operations
between DataFrame objects.
ETHNOTECH ACADEMY
Join Operations
ETHNOTECH ACADEMY
Example:
ETHNOTECH ACADEMY
Output:
ETHNOTECH ACADEMY
DataFrames Concat:
concat() function does all of the heavy lifting of performing
concatenation operations along an axis while performing
optional set logic (union or intersection) of the indexes (if any)
on the other axes.
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Output:
ETHNOTECH ACADEMY
DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a
2 dimensional array, or a table with rows and columns.
Example
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
Named Indexes
• With the index argument, we can name our own indexes.
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
Load Files Into a DataFrame
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
Graphs
• Pandas uses the plot() method to create diagrams.
• We can use Pyplot, a submodule of the Matplotlib library to

visualize the diagram on the screen.
• We have different types of plots in matplotlib library which

can help us to make a suitable graph as you needed. As per
the given data, we can make a lot of graph and with the
help of pandas, we can create a dataframe before doing
plotting of data.
ETHNOTECH ACADEMY
Basic ploting
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
Plot of different data
Using more than one list of data in a plot.
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
Plot on given axis
We can explicitly define the name of axis and plot the data on
the basis of this axis.
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
Bar plot using matplotlib:
Find different types of bar plot to clearly understand the
behaviour of given data.
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
Scatter plot:
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
Documentation – Overview of Data Science
• Data Science Definition

• Brief history of Data Science
• Applications of Data Science
• Artificial Intelligence – History, Importance and Applications
• Fundamentals of Machine Learning
• Classification of Machine Learning
• Applications of Machine Learning
• Data Science in AI and ML
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Q&A
• The most prominent data structures of pandas is/are __________

A. Series
B. Data frame
C. Both a & b
D. Heap
• Both a & b
ETHNOTECH ACADEMY
Q&A
• Which of the following library in python is used for plotting graphs

and visualization
a. Pandas
b. Numpy
c. Matplotlib
d. SK learn
• Matplotlib
ETHNOTECH ACADEMY
Q&A
• Which of the following command is used to install pandas

a. pip install pandas
b. install pandas
c. pip pandas
d. none of the above
• pip install pandas
ETHNOTECH ACADEMY
Q&A
• Which of the following is an example of one-dimensional array?

a. Series
b. Data frame
c. Matrix
d. Both a &b
• Series
ETHNOTECH ACADEMY
Q&A
• A series by default have numeric data labels starting from

• 3
• 2
• 1
• 0
• 0
ETHNOTECH ACADEMY
Q&A
• The data label associated with a particular value of series is called

its _____
a. Data value
b. Index
c. Value
d. Array
• Index
ETHNOTECH ACADEMY
Q&A
• Which of the following statement is correct for importing pandas in

python
a. import pandas
b. import pandas as pd
c. import pandas as pds
d. all of the above
• all of the above
ETHNOTECH ACADEMY
Q&A
• Which attribute is used to give user defined labels in series?

a. index
b. data
c. values
d. none of these
• index
ETHNOTECH ACADEMY
Q&A
• Fill in the to get the output as 2

import pandas as pd
s1 = pd.Series([1,2,3,4], index = ['a','b','c','d'])
print(s1[____])
a. ‘b’
b. 1
c. Both a &b
d. B
• Both a &b
ETHNOTECH ACADEMY
Q&A
• State the following statement is true or false. “A series object is size

mutable”
a. True
b. False
• False
ETHNOTECH ACADEMY
Q&A
• During the execution of following code, what will be the response,
we get
import pandas as pd
s =pd.Series([1,2,3,4,5],index= ['a','b','c','d','e'])
print(s['f’])
A. KeyError
B. IndexError
C. ValueError
D. Semantic error
• Key Error
ETHNOTECH ACADEMY
Q&A
• which of the following is a correct syntax for panda's dataframe?

A. Pandas.DataFrame(data, index, dtype, copy)
B. pandas.DataFrame( data, index, columns, dtype, copy)
C. pandas.DataFrame(data, index, dtype, copy)
D. pandas.DataFrame( data, index, rows, dtype, copy)
• Pandas.DataFrame(data, index, dtype, copy)
ETHNOTECH ACADEMY
Q&A
• which of the following is / are not correct to access individual item

from dataframe 'df’
a. df.iat[2,2]
b. df.loc[2,2]
c. df.at[2,2]
d. df[0,0]
• df[0,0]
ETHNOTECH ACADEMY
Q&A
• What will be output of following code

import pandas as pd
data = [['Anuj',21],['Rama',25],['Kapil',22]]
df = pd.DataFrame(data,columns=['Name','Age'])
print (df)
ETHNOTECH ACADEMY
Q&A
• What will be output of following code?

import numpy as np
array1=np.array([100,200,300,400,500,600,700])
print(array1[1:5:2])
A. [200 300]
B. [200 700]
C. [200 400]
D.[200 400]
• [200 400]
ETHNOTECH ACADEMY
SUMMARY
• Fundaments of Pandas and its usage
• Columnar operations of Pandas
• Usage of Pandas Library in Data Sciecne
• Applications of Graphs in Pandas
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Question 1
• State the following statement is true of false – ‘Sorting your Pandas

dataframe df by one or more columns can be done either ascending
or descending’
• Answer: True
ETHNOTECH ACADEMY
Question 2
2. _________basically helps the user to identify the role of the

respective column in the data frame
A. Header
B. Column
C. Data
D. Array
Answer : Header
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Session 3
Foundation-Numpy
• One Dimension
• Two Dimension
• Two Dimension stacking
ETHNOTECH ACADEMY
Session 3
Foundation-Descriptive Analysis
• Data Dictionary
• Single numeric descriptive analysis
• Double numeric descriptive analysis
• Categorical and all Numeric Descriptive Analysis

ETHNOTECH ACADEMY
Foundation-Numpy
Introduction Of Numpy
• Numpy is an open-source library for working efficiently with
arrays. Developed in 2005 by Travis Oliphant, the name
stands for Numerical Python.
• As a critical data science library in Python,
many other libraries depend on it.
• NumPy is extremely popular because it
dramatically improves the ease and
performance of working with
multidimensional arrays.
ETHNOTECH ACADEMY
Advantages of Numpy
• It offers an Indexing syntax for easily accessing portions of

data within an array.
• It contains built-in functions that improve quality of life

when working with arrays and math, such as functions for
linear algebra, array transformations, and matrix math.
• It requires fewer lines of code for most mathematical

operations than native Python lists.
ETHNOTECH ACADEMY
One-dimensional NumPy array
• One dimensional array contains elements only in one

dimension. In other words, the shape of the NumPy array
should contain only one value in the tuple. Let us see how
to create 1 dimensional NumPy arrays.
Method 1:
• First make a list then pass it in numpy.array()
ETHNOTECH ACADEMY
One-dimensional NumPy array
Output
ETHNOTECH ACADEMY
Method 2
fromiter()
• It is useful for creating non-numeric sequence type array

however it can create any type of array.
• Here we will convert a string into a NumPy array of

characters.
ETHNOTECH ACADEMY
Method 2 Contd…
Output
ETHNOTECH ACADEMY
Method 3
arange()
• It returns evenly spaced values within a given interval.
Output
ETHNOTECH ACADEMY
Method 4
• linspace() creates evenly space numerical elements between

two given limits.
Output
ETHNOTECH ACADEMY
Numpy — Stacking Arrays
Joining two numpy arrays
• stack — Joins arrays with given axis element by element
• hstack — Extends horizontally
• vstack — Extends vertically
ETHNOTECH ACADEMY
Stack
• Joins arrays with given axis element by element
• Both input arrays should be in same dimention/shape
• Axis parameter in Stack works as dimention here instead of
horizontal/vertical manner.
• If Axis is 0, then it will join by first dimention
• If Axis is 1, the it will join by second dimention
• The maximum dimension that we can mention is dimention
of input arrays (say n) + 1.
• If axis is given above n + 1, then “out of bounds for array of
dimension” exception will be thrown
ETHNOTECH ACADEMY
Stack Contd…
ETHNOTECH ACADEMY
Stack by First Dimension
ETHNOTECH ACADEMY
Stack by Second dimenstion
ETHNOTECH ACADEMY
Stack by First Dimension
ETHNOTECH ACADEMY
Stack by Second Dimension
ETHNOTECH ACADEMY
HStack
Stacks horizontally
• This function does not work with axis. It extends first array
by second array Horizontally
• As it extends horizontally, both the arrays should have same
number of rows else Value Error will be returned.
ETHNOTECH ACADEMY
HStack for 1D Arrays
ETHNOTECH ACADEMY
HStack for 2D Arrays
ETHNOTECH ACADEMY
Vstack
VStack — Stacks vertically
• This function does not work with axis. It extends first array
by second array Vertically
ETHNOTECH ACADEMY
VStack by 2D Arrays
ETHNOTECH ACADEMY
Foundation-Descriptive Analysis
Introduction
 Descriptive statistics describe the basic and important

features of data.
 Descriptive statistics help simplify and summarize large
amounts of data in a sensible manner.
 Descriptive statistics involve evaluating measures of
center(centrality measures) and measures of
dispersion(spread).
ETHNOTECH ACADEMY
Foundation-Descriptive Analysis Contd…
ETHNOTECH ACADEMY
Centrality measures
 Centrality measures give us an estimate of the center of a

distribution.
 It gives us a sense of a typical value we would expect to see.
The three major measures of center include the

1. Mean
2. Median
3. Mode
ETHNOTECH ACADEMY
Mean
 It means the average of the given values.
 To compute mean, sum all the values and divide the sum by
the number of values.
 Consider a class of 7 students. Suppose they appear for

midterm exams and score the following marks out of 20:
ETHNOTECH ACADEMY
Mean Contd…
Total
Name Subject 1 Subject 2 Subject 3 Subject 4 Marks
Student 1 10 8 15 11.5 44.5
Student 2 14 9 7.5 11 41.5
Student 3 11 17 11.5 9 48.5
Student 4 7 14.5 10 12 43.5
Student 5 9.5 12 10.5 14 46
Student 6 15 18 7 12 52
Student 7 19 15.5 11 7.5 53
ETHNOTECH ACADEMY
Mean with python
There are various libraries in python such as pandas, numpy,

statistics that support mean calculation.
The Syntax is
numpy.mean(arr, axis = None) : Compute the arithmetic

mean (average) of the given data (array elements) along the
specified axis.
ETHNOTECH ACADEMY
Parameters
• arr : [array_like]input array.
• axis : [int or tuples of int]axis along which we want to calculate
the arithmetic mean. Otherwise, it will consider arr to be
flattened(works on allthe axis). axis = 0 means along the column
and axis = 1 means working along the row.
• out : [ndarray, optional]Different array in which we want to
place the result. The array must have the same dimensions as
expected output.
• dtype : [data-type, optional]Type we desire while computing
mean.
• Results : Arithmetic mean of the array (a scalar value if axis is
none) or array with mean values along specified axis.
ETHNOTECH ACADEMY
Example 1
Output:
ETHNOTECH ACADEMY
Example 2
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
Median
• It is the value where the upper half of the data lies above it
and lower half lies below it. In other words, it is the middle
value of a data set.
• To calculate the median, arrange the data points in the
increasing order and the middle value is the median.
• It is easy to find out the middle value if there is anodd
number of data points, say, we want to find the median for
marks of all students forSubject 1.
• When marks are arranged in the increasing order, we get
{7,9.5,10,11,14,15,19}. Clearly, the middle value is 11;
therefore, the median is 11.
ETHNOTECH ACADEMY
Median Contd…
• If Student 7 did not write the exam, we will have marks as
{7,9.5,10,11,14,15}. This time there is no clear middle value.
• Then, take the mean of the third and fourth values, which is
(10+11)/2=10.5, so the median in this case is 10.5.
ETHNOTECH ACADEMY
Median in Python
• numpy.median(arr, axis = None) : Compute the median of
the given data (array elements) along the specified axis.
How to calculate median?
•Given data points.

•Arrange them in ascending order
•Median = middle term if total no. of terms are odd.
•Median = Average of the terms in the middle (if total no. of
terms are even)
ETHNOTECH ACADEMY
Parameters
• arr : [array_like]input array.
• axis : [int or tuples of int]axis along which we want to calculate
the median. Otherwise, it will consider arr to be flattened(works
on all the axis). axis = 0 means along the column and axis = 1
means working along the row.
• out : [ndarray, optional] Different array in which we want to
place the result. The array must have the same dimensions as
expected output.
• dtype : [data-type, optional]Type we desire while computing
median.
• Results : Median of the array (a scalar value if axis is none) or
array with median values along specified axis.
ETHNOTECH ACADEMY
Example 1
Output:
ETHNOTECH ACADEMY
Example 2
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
Mode
• It is the value that occurs the most number of times in our data
set.
• Suppose there are 15 students appearing for an exam and
following is the result:
Note : To apply mode we need to create an array. In python, we

can create an array using numpy package. So first we need to create
an array using numpy package and apply mode() function on that
array. Let us see examples for better understanding
ETHNOTECH ACADEMY
Example1
Output:
ETHNOTECH ACADEMY
Example 2
Output:
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Q&A
• NumPy package is capable to do fast operations on arrays.
A. True
B. False
• True
ETHNOTECH ACADEMY
Q&A
• NumPy arrays can be ___

a. Indexed
b. Sliced
c. Iterated
d. All of the mentioned above
• All of the mentioned above
ETHNOTECH ACADEMY
Q&A
NumPy is often used along with packages like?
a. Node.js
b. Matplotlib
c. SciPy
d. Both B and C
• Both B and C
ETHNOTECH ACADEMY
Q&A
The most important object defined in NumPy is an N-dimensional array type
called?
a. ndarray
b. narray
c. nd_array
d. darray
• ndarray
ETHNOTECH ACADEMY
Q&A
How to convert numpy array to list?
a. array.list()
b. array.list
c. list.array()
d. list(array)
• list(array)
ETHNOTECH ACADEMY
Q&A
What of the following syntax is used to install numpy in the system containing
python3?
a. pip numpy install python3
b. pip3 install numpy
c. pip install numpy
d. python3 pip3 numpy install
• pip3 install numpy
ETHNOTECH ACADEMY
Q&A
What does size attribute in numpy use to find?
a. shape
b. date & time
c. objects
d. number of items
• number of items
ETHNOTECH ACADEMY
Q&A
Is the following syntax true to import numpy module?
fetch numpy as np
np.array(list)
A. Yes, true
B. Not, true
• Not, true
ETHNOTECH ACADEMY
Q&A
What are the attributes of numpy array?
a. shape, dtype, ndim
b. objects, type, list
c. objects, non vectorization
d. Unicode and shape
• shape, dtype, ndim
ETHNOTECH ACADEMY
Q&A
What is the output of following code?
import numpy as np
ary = np.array([1,2,3,5,8])
ary = ary + 1
print (ary[1])
ETHNOTECH ACADEMY
Q&A
ETHNOTECH ACADEMY
Q&A
ETHNOTECH ACADEMY
Q&A
ETHNOTECH ACADEMY
Q&A
ETHNOTECH ACADEMY
Q&A
ETHNOTECH ACADEMY
Q&A
ETHNOTECH ACADEMY
SUMMARY
• Fundamentals of Numpy library
• Usage of Numpy library in Data Science
• Various operations of Numpy
• Implementation of Merge and Concatenation operations in Numpy
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Session 4
Regression
• Introduction and Preprocessing
• Feature Selection Regularisation
• Residual Analysis
• Data Read
• Normality test and BoxCox transformation
• Linear Regression structure
ETHNOTECH ACADEMY
Session 4
• Linear Regression for Numeric features
• HotEncoding and Scaling
• Linear Regression with HotEncoding and Scaling Data
• Generic Treeflow in Prediction
• CatBoost
• CatBoost Hyperparameter Tuning
ETHNOTECH ACADEMY
Preprocessing
ETHNOTECH ACADEMY
Preprocessing Contd…
 Data preprocessing, a component of data preparation,

describes any type of processing performed on raw data to
prepare it for another data processing procedure.
 It has traditionally been an important preliminary step for

the data mining process.
 data preprocessing techniques have been adapted for

training machine learning models and AI models and for
running inferences against them.
ETHNOTECH ACADEMY
Preprocessing Contd…
 Data preprocessing transforms the data into a format that is

more easily and effectively processed in data mining,
machine learning and other data science tasks.
 The techniques are generally used at the earliest stages of

the machine learning and AI development pipeline to ensure
accurate results.
ETHNOTECH ACADEMY
Tools and methods for preprocessing
data
 sampling, which selects a representative subset from a large

population of data.
 transformation, which manipulates raw data to produce a

single input.
 denoising, which removes noise from data.
ETHNOTECH ACADEMY
Tools and methods for
preprocessing data Contd…
 imputation, which synthesizes statistically relevant data for

missing values;
 normalization, which organizes data for more efficient

access; and
 feature extraction, which pulls out a relevant feature subset

that is significant in a particular context.
ETHNOTECH ACADEMY
Why is data preprocessing important?
 Real-world data is messy and is often created, processed and
stored by a variety of humans, business processes and
applications.
 a data set may be missing individual fields, contain manual

input errors, or have duplicate data or different names to
describe the same thing.
 Humans can often identify and rectify these problems in the

data they use in the line of business, but data used to train
machine learning or deep learning algorithms needs to be
automatically preprocessed.
ETHNOTECH ACADEMY
key steps in data preprocessing
ETHNOTECH ACADEMY
Regression
• Regression searches for relationships among variables.
• example, you can observe several employees of some

company and try to understand how their salaries depend
on their features, such as experience, education level, role,
city of employment, and so on.
• The dependent features are called the dependent

variables, outputs, or responses.
• The independent features are called the independent
variables, inputs, regressors, or predictors.
ETHNOTECH ACADEMY
Regression Contd…
• Regression problems usually have one continuous and
unbounded dependent variable. The inputs, however, can
be continuous, discrete, or even categorical data such as
gender, nationality, or brand.
ETHNOTECH ACADEMY
Need of Regression
• to answer whether and how some phenomenon influences

the other or how several variables are related
• Example: you can use it to determine if and to what

extent experience or gender impacts salaries.
• Regression is also useful when you want to forecast a

response using a new set of predictors.
ETHNOTECH ACADEMY
Need of Regression Contd…
• Example: we could try to predict electricity consumption

of a household for the next hour given the outdoor
temperature, time of day, and number of residents in that
household.
• Regression is used in many different fields, including

economics, computer science, and the social sciences. Its
importance rises every day with the availability of large
amounts of data and increased awareness of the practical
value of data.
ETHNOTECH ACADEMY
LinearRegression
• Linear regression is probably one of the most important

and widely used regression techniques. It’s among the
simplest regression methods.
• One of its main advantages is the ease of interpreting

results.
• Simple or single-variate linear regression is the simplest

case of linear regression, as it has a single independent
variable, 𝐱 = 𝑥.
ETHNOTECH ACADEMY
Linear Regression Contd…
• The following figure illustrates simple linear regression:
ETHNOTECH ACADEMY
Steps involved in implementing linear
regression:
• Import the packages and classes that you need.
• Provide data to work with, and eventually do appropriate
transformations.
• Create a regression model and fit it with existing data.
• Check the results of model fitting to know whether the
model is satisfactory.
• Apply the model for predictions.
ETHNOTECH ACADEMY
Step 1: Import packages and classes
• The first step is to import the package numpy and the

class LinearRegression from sklearn.linear_model:
import numpy as np
from sklearn.linear_model import LinearRegression
ETHNOTECH ACADEMY
Step2: Provide data
• The second step is defining data to work with. The inputs

(regressors, 𝑥) and output (response, 𝑦) should be arrays or
similar objects. This is the simplest way of providing data for
regression:
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])
ETHNOTECH ACADEMY
Step 3: Create a model and fit it
• Create an instance of the class LinearRegression, which

will represent the regression model:
model = LinearRegression()
• Your model as defined above uses the default values of

all parameters.
model.fit(x, y)
LinearRegression()
ETHNOTECH ACADEMY
Step 3 Contd…
With .fit(), you calculate the optimal values of the weights 𝑏₀

and 𝑏₁, using the existing input and output, x and y, as the
arguments. In other words, .fit() fits the model. It
returns self, which is the variable model itself.
That’s why you can replace the last two statements with this
one:
model = LinearRegression().fit(x, y)
ETHNOTECH ACADEMY
Step 4: Get results
• Once you have your model fitted, you can get the results to
check whether the model works satisfactorily and to
interpret it.
r_sq = model.score(x, y)
print(f"coefficient of determination: {r_sq}")
coefficient of determination: 0.7158756137479542

ETHNOTECH ACADEMY
Step 5: Predict response
y_pred = model.predict(x)
print(f"predicted response:\n{y_pred}")
predicted response:
[ 8.33333333 13.73333333 19.13333333 24.53333333

29.93333333 35.33333333]
ETHNOTECH ACADEMY
Box-Cox Transformation
• The Cox Box transformation is to transform the data so that

its distribution is as close to a normal distribution as
possible, that is, the histogram looks like a bell.
• This technique has its place in feature engineering because

not all species of predictive models are robust to skewed
data, so it is worth using when experimenting. It probably
won’t provide a spectacular improvement, although at the
fine-tuning stage it can serve its purpose by improving our
evaluation metric.
ETHNOTECH ACADEMY
Box-Cox Equation in code
• The transformation itself has the following formula
ETHNOTECH ACADEMY
Implementation
ETHNOTECH ACADEMY
Output
ETHNOTECH ACADEMY
Catboost
• Catboost is a boosted decision tree machine learning

algorithm developed by Yandex.
• It works in the same way as other gradient boosted

algorithms such as XGBoost but provides support out of the
box for categorical variables, has a higher level of accuracy
without tuning parameters and also offers GPU support to
speed up training.
ETHNOTECH ACADEMY
Implementation Contd…
• Catboost is used for a range of regression and

classification tasks and has been shown to be a top
performer on various Kaggle competitions that involve
tabular data. Below are a couple of examples of where
Catboost has been successfully implemented:
 Cloudflare use Catboost to identify bots trying to target it’s
users websites. Full details here.
 Ride hailing service Careem, based in Dubai, use Catboost to
predict where it’s customers will travel to next.
ETHNOTECH ACADEMY
Implementation
• we are going to use the classic Titanic dataset to predict

whether a passenger on the ship survived or not. The
intention here is to keep this tutorial simple using a small
dataset but the principles will apply to more complex
datasets and problems you might be trying to solve.
• Step 1: import the libraries we will need and also the

titanic dataset.
ETHNOTECH ACADEMY
Implementation Contd…
ETHNOTECH ACADEMY
Data Preparation
• Initially we’re simply going to drop any rows that contain
NaN for the “survived” column which is our target as this
doesn’t help our model.
df.dropna(subset=['survived'],inplace=True)
• we are only going to make use of 4 features; pclass, sex, age
and fare. Let’s split our data into X and y to get our feature
and target dataframes.
X = df[['pclass','sex', 'age', 'fare']]
y = df['survived']
ETHNOTECH ACADEMY
Data Preparation Contd…
• Now we still need to treat some of the features. We need to
convert the “pclass” column to a string data type as
although it appears numeric, the values are discrete so it’s
actually a categorical variable in this context. In addition, the
“fare” and “age” columns contain some NaNs so we’ll
replace these with zeros.
X['pclass'] = X['pclass'].astype('str')
X['fare'].fillna(0,inplace=True)
X['age'].fillna(0,inplace=True)
ETHNOTECH ACADEMY
Preparing Categorical Features
ETHNOTECH ACADEMY
Preparing Categorical Features
Contd…
ETHNOTECH ACADEMY
Preparing Categorical Features Contd…
• Finally, before we begin training our model we need to split our

data into two datasets for training and testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,

random_state=101, stratify=y)
• Now there is an additional complication; if we print out the
survival rate of of our test set we can see that our training data
is imbalanced.
ETHNOTECH ACADEMY
print('Test Survival Rate:',y_test.sum()/y_test.count())
Test Survival Rate: 0.3816793893129771
• There are a few ways to handle this but in our example we

are simply going to undersample the training data
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Training
ETHNOTECH ACADEMY
Training Contd…
• To train the model we are going to use Catboost’s inbuilt grid
search method. If you have used Sci-Kit learns Grid Search CV
then this works in the same way. First we declare a dictionary of
the hyperparameters that we want to tune and lists of values to
test. We have decided to tune just a few of the most influential
parameters: learning rate, tree depth, L2 leaf regularisation and
also the number of iterations we will train the model for.
ETHNOTECH ACADEMY
Training Contd…
• Now we can fit the model using the grid search method by
passing the grid dictionary we declared above along with
the training data pool. By default grid search splits the
training data into an 80/20 split for training and testing with
a three fold cross validation strategy.
model.grid_search(grid,train_dataset)
ETHNOTECH ACADEMY
Training Contd…
• The model has now been trained and you can print out the
optimum parameters that have been found using grid
search if you’re interested.
model.get_params()
ETHNOTECH ACADEMY
Evaluation
• Now that we have trained our model we can evaluate how it

performs on our test data and then briefly see what
features are most influential.
• To start with we’ll use our model to make predictions for our
test and the print out a classification report.
pred = model.predict(X_test)
print(classification_report(y_test, pred))
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Project
• Implementation of Linear Regression on Real World Scenario to

predict sales based on the money spent on TV for advertising.
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
SUMMARY
• Fundamentals of Linear equation
• Need of Linear Regression
• Implementation of Box Cox Transformation
• Implementation of Cat boost over categorical data
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Q&A
• State the following is true or False – “Data Preprocessing is a

preliminary step for data mining”
A. True
B. False
• True
ETHNOTECH ACADEMY
Q&A
• Which of the following manipulates raw data to produce a single

output?
A. Sampling
B. Abstraction
C. Transformation
D. Denoising
• Transformation
ETHNOTECH ACADEMY
Q&A
• _________ synthesis relevant data for missing values.

a. Feature extraction
b. Imputation
c. Normalization
d. Sampling
• Imputation
ETHNOTECH ACADEMY
Q&A
• State the following is true or false – “Feature extraction pulls out

irrelevant feature subset that is significant in a particular context.”
a. True
b. False
• False
ETHNOTECH ACADEMY
Q&A
• Which of the following searches for relationships among variables?

a. Classification
b. Association
c. Regression
d. Clustering
• Regression
ETHNOTECH ACADEMY
Q&A
• The following diagram illustrates which type of regression

a. Linear
b. Logistic
c. Exponential
d. Categorical
• Linear
ETHNOTECH ACADEMY
Q&A
• Which machine learning algorithm is used by cloudflare to identify

bots to target its website.
a. Cox-box
b. Cat boost
c. XGBoost
d. Random forest
• Cat boost
ETHNOTECH ACADEMY
• State the following statement is true or false.
• “Seaborn is a Python data visualization library based on matplotlib.”
a. True
b. False
• True
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Session 5
Classification
• Classification Introduction
• Classification: Code and data load
• Classification: Random Forest
• Classification: Random Forest code
ETHNOTECH ACADEMY
Session 5
• Classification: CatBoost code
• Classification: One class SVM code
• Classification:Logistic Regression
• Classification: Logistic Regression code
ETHNOTECH ACADEMY
Classification
Introduction
• Data Classification in data science refers to the process that
tags and categorizes any kind of data so that it can be better
understood and analyzed. The latter is what we'll be
focusing on.
• But also, a well-planned Data Classification
system makes essential data easy to find
and retrieve.
ETHNOTECH ACADEMY
Types of Data Classification
ETHNOTECH ACADEMY
Types of Data Classification Contd…
 Content-based classification: In this classification type, the

contents of each file are the basis for categorization.
 User-based classification: User-based classification relies on
the user’s knowledge of creation, editing, reviewing, or
dissemination to label sensitive documents. These
individuals can specify how sensitive each document is.
 Context-based classification: Context-based classification
focuses on the context of the data, such as the location,
application, and creator, as well as other variables that affect
the data.
ETHNOTECH ACADEMY
Benefits of Classification
 Urgency detection: A pre-trained model can classify
inbound texts and support tickets to determine whether
they should be labeled as urgent or not urgent.
 Sentiment detection: NLP, or Natural Language Processing,
can be used to detect the sentiment of any given content is -
save time by routing the right messages to the right people.
 Topic labeling: Topic labeling consists of tagging topics with
a couple of descriptive words or phrases. This is done by
using an NLP technique to identify themes and meanings -
e.g. classify any incoming email attachment and forward it to
the right folder in your storage system.
ETHNOTECH ACADEMY
Data Classification applications
 Text Classification
 Document Classification
 Image Classification
Text Classification
• Text classification is a powerful tool for utilizing these
unstructured data we all sit on top of by utilizing NLP. In the
words of our users, it feels like wizardry when you create
your first classifier and see hundreds of survey responses
categorized in seconds.
ETHNOTECH ACADEMY
Contd…
Document Classification
• Document Classification focuses on processes that mainly
apply content-specific classification - e.g. classifying
incoming email attachments by type. It differs from text
classification, as instead of specific phrases or paragraphs
being classified, the whole document is taken into
consideration.
ETHNOTECH ACADEMY
Contd…
ETHNOTECH ACADEMY
Contd…
Image Classification
• Image Classification categorizes any incoming image file by
predetermined labels. It is often combined with object
detection. These days you can create your own image
classifier and teach the model to make subjective decisions
based on your logic: whether an incoming ad creative is
good or not; whether the image fits into the product
portfolio; whether an image you snapped on your holidays
is appropriate to show to your grandparents.
ETHNOTECH ACADEMY
Contd…
ETHNOTECH ACADEMY
Classification: Random Forest
• Random forests is a supervised learning algorithm. It can be

used both for classification and regression. It is also the
most flexible and easy to use algorithm.
• A forest is comprised of trees. It is said that the more trees
it has, the more robust a forest is.
• Random forests creates decision trees on randomly
selected data samples, gets prediction from each tree and
selects the best solution by means of voting.
ETHNOTECH ACADEMY
Random Forest Contd…
• It also provides a pretty good indicator of the feature

importance.
• Random forests has a variety of applications, such as
recommendation engines, image classification and feature
selection.
• It can be used to classify loyal loan applicants, identify
fraudulent activity and predict diseases. It lies at the base of
the Boruta algorithm, which selects important features in a
dataset.
ETHNOTECH ACADEMY
Random Forest Algorithm
1. Select random samples from a given dataset.
2. Construct a decision tree for each sample and get a

prediction result from each decision tree.
3. Perform a vote for each predicted result.
4. Select the prediction result with the most
votes as the final prediction.
ETHNOTECH ACADEMY
Advantages
 Random forests is considered as a highly accurate and

robust method because of the number of decision trees
participating in the process.
 It does not suffer from the overfitting problem. The main
reason is that it takes the average of all the predictions,
which cancels out the biases.
 The algorithm can be used in both classification and
regression problems.
ETHNOTECH ACADEMY
Advantages Contd…
 Random forests can also handle missing values. There are

two ways to handle these: using median values to replace
continuous variables, and computing the proximity-weighted
average of missing values.
 You can get the relative feature importance, which helps in
selecting the most contributing features for the classifier.
ETHNOTECH ACADEMY
Disadvantages:
 Random forests is slow in generating predictions because it

has multiple decision trees. Whenever it makes a prediction,
all the trees in the forest have to make a prediction for the
same given input and then perform voting on it. This whole
process is time-consuming.
 The model is difficult to interpret compared to a decision
tree, where you can easily make a decision by following the
path in the tree.
ETHNOTECH ACADEMY
Implementing Random Forest
Classification Using IRIS Dataset
ETHNOTECH ACADEMY
Implementing Random Forest
Classification on a Real-World Data Set
1. IMPORTING PYTHON LIBRARIES AND LOADING OUR DATA SET
INTO A DATA FRAME
ETHNOTECH ACADEMY
2. SPLITTING OUR DATA SET INTO
TRAINING SET AND TEST SET
ETHNOTECH ACADEMY
3. CREATING A RANDOM FOREST REGRESSION
MODEL AND FITTING IT TO THE TRAINING DATA
ETHNOTECH ACADEMY
PREDICTING THE TEST SET RESULTS AND
MAKING THE CONFUSION MATRIX
ETHNOTECH ACADEMY
Catboost Classifier
• The CatBoost algorithm is a Supervised Machine

Learning algorithm developed by Yandex researchers and
engineers.
• It is used for search, recommendation systems, personal
assistants, self-driving cars, weather prediction, and many
other tasks.
ETHNOTECH ACADEMY
key features of cat boost algorithm:
• The algorithm automatically takes care of NULL values in the

dataset
• The algorithm automatically does label encoding for
categorical features
• The algorithm uses binary symmetric decision trees
ETHNOTECH ACADEMY
Catboost Classifier
• Our sample dataset contains data about three different flowers

represented by their size and color:
• Most Machine Learning algorithms require only numerical

features for the training process, so If you try to use this
dataset “as is” without prior preprocessing, you’ll not be able
to do it. You have to convert categorical features’ values (the
“Color” column in our example) into numeric values.
• Such conversion is called label encoding.
ETHNOTECH ACADEMY
Catboost Classifier Contd…
• Catboost is used for a range of regression and classification

tasks and has been shown to be a top performer on various
Kaggle competitions that involve tabular data.
• Cloudflare use Catboost to identify bots trying to target it’s
users websites
• Ride hailing service Careem, based in Dubai, use Catboost
to predict where it’s customers will travel to next
ETHNOTECH ACADEMY
Steps involved in Catboost
implementation
• Installation and Imports
• Define Dataset
• Apply Model
• Predict
ETHNOTECH ACADEMY
• Applying CatBoost’s regressor to the regression dataset. The
dataset contains the price information of houses in Dushanbe
city. The input variables are the number of rooms, floors, area,
and location
ETHNOTECH ACADEMY
Step1: Installations and Imports
ETHNOTECH ACADEMY
Step2: Define Dataset
ETHNOTECH ACADEMY
Step3: Apply Model
ETHNOTECH ACADEMY
Step4:Predict
ETHNOTECH ACADEMY
Classification: One class SVM
 Classification problems are often solved using supervised learning

algorithms such as Random Forest Classifier, Support Vector Machine,
Logistic Regressor (for binary class classification) etc.
 A specific type of binary classification problem with single class training
examples is called One-Class Classification (OCC).
 A One-class classification method is used to detect the outliers and
anomalies in a dataset. Based on Support Vector Machines (SVM)
evaluation, the One-class SVM applies a One-class classification
method for novelty detection.
ETHNOTECH ACADEMY
Classification
• One-Class Classification is solved using an unsupervised or

semi-supervised learning algorithm such as One-Class
Support Vector Machines (1-SVM), Support Vector Data
Description (SVDD) etc. One of the popular examples of
One-Class Classification is Anamoly Detection (AD) i.e.,
outlier detection and novelty detection
ETHNOTECH ACADEMY
Classification Contd…
The Scikit-learn API provides the OneClassSVM class for this

algorithm as below
1. Preparing the data
2. Defining the model and prediction
3. Anomaly detection with scores
• Preparing the data
• We'll create a random sample dataset for this tutorial by
using the make_blob() function. We'll check the dataset by
visualizing it in a plot.
ETHNOTECH ACADEMY
Implementation of One – Class SVM
Detection of anomaly in a dataset

by using the One-class SVM
ETHNOTECH ACADEMY
Logistic Regression
• Logistic regression aims to solve classification problems.
 It does this by predicting categorical outcomes, unlike linear

regression that predicts a continuous outcome.
ETHNOTECH ACADEMY
Logistic Regression Contd…
 In the simplest case there are two outcomes, which is called

binomial, an example of which is predicting if a tumor is
malignant or benign.
 Other cases have more than two outcomes to classify, in this

case it is called multinomial.
• A common example for multinomial logistic regression
would be predicting the class of an iris flower between 3
different species
ETHNOTECH ACADEMY
Implementation of Logistic Regression
Scenerio
• User Database – This dataset contains information about
users from a company’s database. It contains information
about UserID, Gender, Age, EstimatedSalary, and
Purchased. We are using this dataset for predicting whether
a user will purchase the company’s newly launched product
or not.
• Do refer to the below table from where data is being
fetched from the dataset.
ETHNOTECH ACADEMY
Implementation of Logistic Regression
Contd…
• Let us make the Logistic Regression model, predicting
whether a user will purchase the product or not
ETHNOTECH ACADEMY
Import Libraries
Read and Explore the data
ETHNOTECH ACADEMY
Import Libraries Contd…
• Now, to predict whether a user will purchase the product or

not, one needs to find out the relationship between Age and
Estimated Salary. Here User ID and Gender are not
important factors for finding out this.
ETHNOTECH ACADEMY
Splitting The Dataset: Train and
Test dataset
• Splitting the dataset to train and test.

• 75% of data is used for training the model and 25% of it is
used to test the performance of our model.
ETHNOTECH ACADEMY
Splitting The Dataset: Train and
Test dataset Contd…
Now, it is very important to perform feature scaling here
because Age and Estimated Salary values lie in different
ranges. If we don’t scale the features then the Estimated Salary
feature will dominate the Age feature when the model finds
the nearest neighbor to a data point in the data space.
ETHNOTECH ACADEMY
Output
• Here once see that Age and Estimated salary features

values are scaled and now there in the -1 to 1. Hence, each
feature will contribute equally to decision making
ETHNOTECH ACADEMY
Training our Logistic Regression
model
• After training the model, it is time to use it to do predictions

on testing data.
ETHNOTECH ACADEMY
Evaluation Metrics
• Metrics are used to check the model performance on

predicted values and actual values
Output:
ETHNOTECH ACADEMY
Visualizing the performance of our
model
ETHNOTECH ACADEMY
model Contd…
ETHNOTECH ACADEMY
model Contd…
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
SUMMARY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Q&A
• ______ classification relies on the user’s knowledge of creation,

editing, reviewing, or dissemination to label sensitive documents
a. Content based
b. Data based
c. User based
d. Context based
• User based
ETHNOTECH ACADEMY
Q&A
• Classifying incoming email attachments by type is an example of

which type of classification?
a. Text classification
b. Document classification
c. Image classification
d. Voice classification
• Document classification
ETHNOTECH ACADEMY
Q&A
• Which of the following is not an application of Random forest

algorithm?
a. Recommendation engines
b. Image classification
c. Feature selection
d. Data preprocessing
• Data preprocessing
ETHNOTECH ACADEMY
Q&A
• Which of the following algorithm are not an example of ensemble

learning algorithm?
a. Random Forest
b. Adaboost
c. Decision Trees
d. Gradient Boosting
• Decision Trees
ETHNOTECH ACADEMY
Q&A
• Which of the following is not the advantage of Random forest

algorithm?
a. It is considered to be highly accurate and robust
b. It can be used for both classification and regression
c. It does not suffer from the overfitting problem
d. It is slow in generating predictions because it has multiple
decision trees
• It is slow in generating predictions because it has multiple decision
trees
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Session 6
Clustering
• Clustering: Introduction
• Clustering: KMeans
• Clustering: Agglomerative
• Clustering: KNN
• Clustering:KNN using Iris
ETHNOTECH ACADEMY
Clustering
• Machine learning is a subset of Artificial Intelligence that
allows a machine to automatically learn from past data
without programming explicitly.
• Classical machine learning is often categorized by how an
algorithm learns to become more accurate in its
predictions.
• Clustering is the task of dividing the population or data
points into several groups such that data points in the same
groups are similar to other data points in that group and
dissimilar to the data points in other groups.
ETHNOTECH ACADEMY
Clustering Contd…
• It is basically an assembly of objects based on similarity and

dissimilarity between them.
ETHNOTECH ACADEMY
THE IMPORTANCE OF CLUSTERING
 Clustering helps in understanding the natural grouping in a

dataset.
 Their motivation is to check out to parcel the information
into some gathering of legitimate groupings.
 Grouping quality relies upon the strategies and the
identification of hidden patterns.
 The biggest advantage of clustering over-classification is it
can adapt to the changes made and helps single out useful
features that differentiate different groups.
ETHNOTECH ACADEMY
Applications of Clustering
 It is widely used in many applications such as image

processing, data analysis, and pattern recognition.
 It can be used in the field of biology, by deriving animal and
plant taxonomies, identifying genes with the same
capabilities.
 It also helps in information discovery by classifying
documents on the web.
 It helps marketers to find the distinct groups in their
customer base and they can characterize their customer
groups by using purchasing patterns.
ETHNOTECH ACADEMY
K Means clustering
• K-means as a clustering algorithm is deployed to discover
groups that haven’t been explicitly labeled within the data.
It’s being actively used today in a wide variety of business
applications including:
 Customer segmentation: Customers can be grouped in
order to better tailor products and offerings.
 Text, document, or search results clustering: grouping to
find topics in text.
ETHNOTECH ACADEMY
K Means clustering Contd…
 Image grouping or image compression: groups similar in

images or colors.
 Anomaly detection: finds what isn’t similar—or the outliers
from clusters
 Semi-supervised learning: clusters are combined with a
smaller set of labeled data and supervised machine learning
in order to get more valuable results.
ETHNOTECH ACADEMY
Working of K-means
 The K-means algorithm identifies a certain number of

centroids within a data set, a centroid being the arithmetic
mean of all the data points belonging to a particular cluster.
 The algorithm then allocates every data point to the nearest
cluster as it attempts to keep the clusters as small as
possible (the ‘means’ in K-means refers to the task of
averaging the data or finding the centroid).
 At the same time, K-means attempts to keep the other
clusters as different as possible.
ETHNOTECH ACADEMY
Working of K-means Contd…
In practice it works as follows:
 The K-means algorithm begins by initializing all the
coordinates to “K” cluster centers. (The K number is an input
variable and the locations can also be given as input.)
ETHNOTECH ACADEMY
• With every pass of the algorithm, each point is assigned to

its nearest cluster center.
ETHNOTECH ACADEMY
 The cluster centers are then updated to be the “centers” of all
the points assigned to it in that pass. This is done by re-
calculating the cluster centers as the average of the points in
each respective cluster.
 The algorithm repeats until there’s a minimum change of the
cluster centers from the last iteration.
ETHNOTECH ACADEMY
Limitations of Kmeans
 if the clusters have more complex geometric shapes, the
algorithm does a poor job of clustering the data.
 the algorithm does not allow data points distant from one
another to share the same cluster, regardless of whether
they belong in the cluster. K-means does not itself learn the
number of clusters from the data, rather that information
must be pre-defined.
 when there is overlapping between or among clusters, K-
means cannot determine how to assign data points where
the overlap occurs.
ETHNOTECH ACADEMY
Implemenation of K means
algorithm
Importing Libraries
ETHNOTECH ACADEMY
Working with Dataset
ETHNOTECH ACADEMY
Visualize the data points
ETHNOTECH ACADEMY
Visualize the data points Contd…
ETHNOTECH ACADEMY
Find the K value using the Elbow
method
ETHNOTECH ACADEMY
Find the K value using the Elbow
method Contd…
• WCSS doesn’t reduce much after k=5. So, we can choose 5

as the perfect K value or Clusters.
ETHNOTECH ACADEMY
Training the K-means algorithm on
the training dataset
Centroid points
• array([[88.2 , 17.11428571],
[55.2962963 , 49.51851852],
[86.53846154, 82.12820513],
[25.72727273, 79.36363636],
[26.30434783, 20.91304348]])
ETHNOTECH ACADEMY
Visualize the clusters formed
ETHNOTECH ACADEMY
Visualize the clusters formed
ETHNOTECH ACADEMY
Agglomerative Clustering
ETHNOTECH ACADEMY
Agglomerative Clustering Contd…
• In this clustering approach, we start with the cluster leaf and
then move upward until the cluster root is finally obtained.
• Initially, this approach assumes each data point in the

dataset is an independent cluster.
• In the beginning, each data point is considered a single-

element cluster (leaf).
ETHNOTECH ACADEMY
• Since the two most similar clusters are combined at each

step, we obtain fewer clusters at each current iteration than
the previous iteration.
• This process continues until we obtain one big cluster (root)

whose elements are clusters of comparable properties.
• Once all clustering is completed, we visualize data clusters

using a scatter plot.
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Working of Agglomerative Hierarchical
Clustering
• Step-1: Create each data point as a single cluster. Let's say
there are N data points, so the number of clusters will also be
N.
ETHNOTECH ACADEMY
Working of Agglomerative Hierarchical Clustering
• Step-2: Take two closest data points or clusters and merge

them to form one cluster. So, there will now be N-1 clusters.
ETHNOTECH ACADEMY
• Step-3: Again, take the two closest clusters and merge them
together to form one cluster. There will be N-2 clusters.
ETHNOTECH ACADEMY
• Step-4: Repeat Step 3 until only one cluster left. So, we will
get the following clusters. Consider the below images:
ETHNOTECH ACADEMY
•Step-5: Once all the clusters are combined into
one big cluster, develop the dendrogram to divide
the clusters as per the problem.
ETHNOTECH ACADEMY
Working of Dendrogram in Hierarchical
clustering
• The dendrogram is a tree-like structure that is mainly used to
store each step as a memory that the HC algorithm performs.
In the dendrogram plot, the Y-axis shows the Euclidean
distances between the data points, and the x-axis shows all
the data points of the given dataset.
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
In this flowchart, we assumed a dataset with N elements where N = 6.
Below are the steps involved in the clustering above:
 Step 1: Initially, assume each data point is an independent cluster,
i.e. 6 clusters.
 Step 2: Into a single cluster, merge the two closest data points. By so
doing, we ended up with 5 clusters.
 Step 3: Again, merge the two closest clusters into a single cluster. By
so doing, we ended up with 4 clusters.
 Step 4: Repeat step three above until a single cluster of all data
points is obtained.
ETHNOTECH ACADEMY
• If we visualize the dendrogram, we should obtain a tree-like

structure with the root at the top like the one shown below:
ETHNOTECH ACADEMY
Implemenation of Agglomerative
clustering
• Agglomerative Clustering: Agglomerative Clustering is one
of the most common hierarchical clustering techniques.
Dataset – Credit Card Dataset.
• Assumption: The clustering technique assumes that each

data point is similar enough to the other data points that
the data at the starting can be assumed to be clustered in 1
cluster
ETHNOTECH ACADEMY
Step 1: Importing the required
libraries
ETHNOTECH ACADEMY
Step 2: Loading and Cleaning the
data
ETHNOTECH ACADEMY
Step 3: Preprocessing the data
ETHNOTECH ACADEMY
Step 4: Reducing the dimensionality
of the Data
• Dendrograms are used to divide a given cluster into many

different clusters
ETHNOTECH ACADEMY
Step 5: Visualizing the working of
the Dendrograms
ETHNOTECH ACADEMY
Step 5 Contd…
ETHNOTECH ACADEMY
Step 6
Building and Visualizing the different clustering models for

different values of k a) k = 2
ETHNOTECH ACADEMY
Step 6 Contd…
ETHNOTECH ACADEMY
KNN Clustering
• This algorithm is used to solve the classification model

problems.
• K-nearest neighbor or K-NN algorithm basically creates an
imaginary boundary to classify the data.
• When new data points come in, the algorithm will try to
predict that to the nearest of the boundary line.
• Note: It’s very important to have the right k-value when
analyzing the dataset to avoid overfitting and underfitting of
the dataset.
ETHNOTECH ACADEMY
KNN Clustering
ETHNOTECH ACADEMY
Consider, we have a new data point and we need to
put it in the required category
ETHNOTECH ACADEMY
Firstly, we will choose the number of neighbors, so we will
choose the k=5.
Next, we will calculate the Euclidean distance between the

data points. The Euclidean distance is the distance between
two points, which we have already studied in geometry. It can
be calculated as:
ETHNOTECH ACADEMY
By calculating the Euclidean distance we got the nearest neighbors,
as three nearest neighbors in category A and two nearest neighbors
in category B
As we can see the 3 nearest neighbors are from category A, hence

this new data point must belong to category A
ETHNOTECH ACADEMY
Steps involved in KNN algoritm:
1. The k-nearest neighbor algorithm is imported from the

scikit-learn package.
2. Create feature and target variables.
3. Split data into training and test data.
4. Generate a k-NN model using neighbors value.
5. Train or fit the data into the model.
6. Predict the future.
ETHNOTECH ACADEMY
Implementation of KNN algorithm using IRIS
Dataset
ETHNOTECH ACADEMY
Steps involved in KNN algoritm
Contd…
ETHNOTECH ACADEMY
Steps involved in KNN algorithm Contd…
ETHNOTECH ACADEMY
Implementation using Iris Dataset
in Python
• K nearest neighbor (KNN) is a simple and efficient method
for classification problems. Moreover, KNN is a classification
algorithm using a statistical learning method that has been
studied as pattern recognition, data science, and machine
learning approach.[1], [2] Therefore, this technique aims to
assign an unseen point to the dominant class among its k
nearest neighbors within the training set.[3]
ETHNOTECH ACADEMY
in Python Contd…
• The training data used 50% from the Iris dataset with 75
rows of data and for testing data also used 50% from the
Iris dataset with 75 rows. The dataset has four
measurements that will use for KNN training, such as sepal
length, sepal width, petal length, and petal width.
Furthermore, the species or class attribute will use as a
prediction, in which the data is classed as Iris-setosa, Iris-
versicolor, or Iris-virginica.
ETHNOTECH ACADEMY
in Python Contd…
Import libraries:
ETHNOTECH ACADEMY
in Python Contd…
• Start time to seeing the computation time:
• Loading Dataset:
ETHNOTECH ACADEMY
in Python Contd…
• Make a KNN Class
• Build a Function inside the KNN Class:
• Function Initialization
Parameter Description:
k(int): The nearest k instances
ETHNOTECH ACADEMY
in Python Contd…
• Function for Load Training Data
TrainingPath(string): File path of the training dataset
ColoumnName(string): Column name of the given dataset
ETHNOTECH ACADEMY
in Python Contd…
• Function for Getting Testing Data
TestingPath(string): File path of the testing dataset
ColoumnName(string): Column name of the given name
ETHNOTECH ACADEMY
in Python Contd…
• Function for Prediction the label of each testing
TestPoint ( < numpy.ndarray > ): Features data frame of
testing data
ETHNOTECH ACADEMY
in Python Contd…
• Graphic of Training & Testing Accuracy with k = 1 to 7
ETHNOTECH ACADEMY
RESULT AND DISCUSSION
Explanation of Training and Testing Result
Figure 1. Graph of Training and Testing Accuracy using K

Nearest Neighbors (KNN)
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Project
• Implementation of Linear Regression on Real World Scenario to to

segment the clients of a wholesale distributor based on their
annual spending on diverse product categories, like milk,
grocery, region, etc.
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
SUMMARY
• Fundamentals of Clustering and Cluster Analysis
• Application of Clustering
• Implementation of K means Clusterning
• Implementation of Agglomerative Clustring
• Implementation of KNN algorithm on IRIS Dataset
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Session 6
Text Analytics
• Text Analytics: Introduction
• Text Analytics: NLTK Installation
• Text Analytics: Tokenization TextBlob
• Named-entity recognition (NER)-Stemming-Lemmatization

• Word Cloud
ETHNOTECH ACADEMY
Text Analytics
ETHNOTECH ACADEMY
Text Analytics Contd…
 Text analysis is the automated process of extracting and

classifying text data using machine learning and natural
language processing.
 Analyzing these texts by hand is time-consuming, tedious,
and ineffective – especially if you deal with large amounts of
data every day.
 There are different text analysis techniques you can run on
your data, such as sentiment analysis, topic
classification, urgency detection, and intent categorization.
ETHNOTECH ACADEMY
 Text communication is one of the most popular forms of
day-to-day conversion. We chat, message, tweet, share
status, email, write blogs, share opinions, and feedback in
our daily routine.
 These all activities are generating text in a large amount,
which is unstructured in nature. In the area of the online
marketplace and social media, It is extremely important to
analyze large quantities of data, to understand people’s
opinions.
 NLP enables the computer to interact with humans in a
natural manner.
ETHNOTECH ACADEMY
 It helps the computer to understand the human language

and derive meaning from it.
 NLP is applicable in several problems from speech
recognition, language translation, classifying documents to
information extraction.
 Analyzing movie reviews is one of the classic examples to
demonstrate a simple NLP Bag-of-words model. on movie
reviews.
ETHNOTECH ACADEMY
NLTK
• Natural Language Toolkit (NLTK) library contains various
utilities that allow you to effectively manipulate and analyze
linguistic data. Among its advanced features are text
classifiers that you can use for many kinds of classification,
including sentiment analysis.
• Sentiment analysis is the practice of using algorithms to
classify various samples of related text into overall positive
and negative categories. With NLTK, you can employ these
algorithms through powerful built-in machine learning
operations to obtain insights from linguistic data.
ETHNOTECH ACADEMY
Installing NLTK Data
• NLTK comes with many corpora, toy grammars, trained
models, etc. A complete list is posted
at: https://www.nltk.org/nltk_data/
Step 1: Browse to the official site of python by clicking this link.
ETHNOTECH ACADEMY
Installing NLTK Data Contd…
Step 2: Move the cursor to the Download button & then click
on the latest python version
ETHNOTECH ACADEMY
Step 3: Open the downloaded file. Click on the checkbox &
Click on Customize installation.
ETHNOTECH ACADEMY
Step 4: Click on Next.
ETHNOTECH ACADEMY
Step 5: Click on Install.
ETHNOTECH ACADEMY
Step 6: Wait till installation finish.
ETHNOTECH ACADEMY
Step 7: Click on Close.
ETHNOTECH ACADEMY
Step 8: Open Command Prompt & execute the following

commands:
• python --version
• pip --version
• pip install nltk
Hence, NLTK installation will start.

ETHNOTECH ACADEMY
Step 9: Then you can see the successfully installed message.
• Hence NLTK installation is successful
ETHNOTECH ACADEMY
Tokenization
• Tokenisation is the process of breaking up a given text into

units called tokens. Tokens can be individual words, phrases
or even whole sentences.
• In the process of tokenization, some characters like
punctuation marks may be discarded. The tokens usually
become the input for the processes like parsing and text
mining.
• A tokenizer breaks unstructured data and natural language
text into chunks of information that can be considered as
discrete elements.
ETHNOTECH ACADEMY
Tokenization Contd…
• The token occurrences in a document can be used directly

as a vector representing that document.
• Tokenization can separate sentences, words, characters, or

subwords. When we split the text into sentences, we call it
sentence tokenization. For words, we call it word
tokenization.
ETHNOTECH ACADEMY
Example of sentence tokenization
• Example of word tokenization
ETHNOTECH ACADEMY
NLTK Word Tokenize
• NLTK (Natural Language Toolkit) is an open-source Python

library for Natural Language Processing. It has easy-to-use
interfaces for over 50 corpora and lexical resources such as
WordNet, along with a set of text processing libraries for
classification, tokenization, stemming, and tagging.
• You can easily tokenize the sentences and words of the text
with the tokenize module of NLTK.
ETHNOTECH ACADEMY
NLTK Word Tokenize Contd…
• First, we’re going to import the relevant functions from the

NLTK library:
ETHNOTECH ACADEMY
Word and Sentence tokenizer
• N.B: The sent_tokenize uses the pre-trained model from

tokenizers/punkt/english.pickle.
ETHNOTECH ACADEMY
Punctuation-based tokenizer
• This tokenizer splits the sentences into words based on

whitespaces and punctuations.
• We could notice the difference between considering

“Amal.M” a word in word_tokenize and split it in the
wordpunct_tokenize.
ETHNOTECH ACADEMY
Treebank Word tokenizer
• This tokenizer incorporates a variety of common rules for
english word tokenization. It separates phrase-terminating
punctuation like (?!.;,) from adjacent tokens and retains
decimal numbers as a single token. Besides, it contains
rules for English contractions.
• For example “don’t” is tokenized as [“do”, “n’t”]. You can find
all the rules for the Treebank Tokenizer at this link.
ETHNOTECH ACADEMY
Tweet tokenizer
• When we want to apply tokenization in text data like tweets,

the tokenizers mentioned above can’t produce practical
tokens. Through this issue, NLTK has a rule based tokenizer
special for tweets. We can split emojis into different words if
we need them for tasks like sentiment analysis.
ETHNOTECH ACADEMY
MWET tokenizer
• NLTK’s multi-word expression tokenizer (MWETokenizer)

provides a function add_mwe() that allows the user to enter
multiple word expressions before using the tokenizer on the
text. More simply, it can merge multi-word expressions into
single tokens.
ETHNOTECH ACADEMY
TextBlob Word Tokenize
• TextBlob is a Python library for processing textual data. It

provides a consistent API for diving into common natural
language processing (NLP) tasks such as part-of-speech
tagging, noun phrase extraction, sentiment analysis,
classification, translation, and more.
• Let’s start by installing TextBlob and the NLTK corpora:
• $pip install -U textblob

$python3 -m textblob.download_corpora
ETHNOTECH ACADEMY
TextBlob Word Tokenize Contd…
• In the code below, we perform word tokenization using
TextBlob library:
• We could notice that the TextBlob tokenizer removes the

punctuations. In addition, it has rules for English
contractions.
ETHNOTECH ACADEMY
Named entity Recognition
• Named Entity Recognition is the task of recognising proper

names and words from a special class in a document, such
as product names, locations, people, or diseases.
• Named entity recognition (NER)is probably the first step
towards information extraction that seeks to locate and
classify named entities in text into pre-defined categories
such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values,
percentages, etc
ETHNOTECH ACADEMY
Named entity Recognition Contd…
• NER is used in many fields in Natural Language

Processing (NLP), and it can help answering many real-world
questions, such as:
 Which companies were mentioned in the news article?

 Were specified products mentioned in complaints or
reviews?
 Does the tweet contain the name of a person? Does the
tweet contain this person’s location?
ETHNOTECH ACADEMY
Stemming and Lemmatization
• While working with language data we need to acknowledge

the fact that words like ‘care’ and ‘caring’ have the same
meaning but used in different forms of tenses.
• Here we make use of Stemming and Lemmatization to
reduce the word to its base form.
• Stemming is used to preprocess text data. The English
language has many variations of a single word, so to reduce
the ambiguity for a machine-learning algorithm to learn it’s
essential to filter such words and reduce them to the base
form.
ETHNOTECH ACADEMY
Stemming and Lemmatization
Contd…
• NLTK provides classes to perform stemming on words.
• The most widely used stemming algorithms

are PorterStemmer, SnowballStemmer etc.
• The PorterStemmer class has .stem method which takes a

word as an input argument and returns the word reduced
to its root form.
ETHNOTECH ACADEMY
Creating a Stemmer with
PorterStemmer
ETHNOTECH ACADEMY
Creating a Stemmer with
PorterStemmer Contd…
ETHNOTECH ACADEMY
Creating a Stemmer with Snowball
Stemmer
• It is also known as the Porter2 stemming algorithm as it tends to fix a few shortcomings in Porter
Stemmer. Let’s see how to use it.
ETHNOTECH ACADEMY
Stemmer Contd…
• The outputs from both the stemmer look similar because

we have used limited text corpus for the demonstration.
Feel free to experiment with different words and compare
the outputs of the two.
ETHNOTECH ACADEMY
Stemmer Contd…
• Lemmatization is the algorithmic process for finding the
lemma of a word – it means unlike stemming which may
result in incorrect word reduction, Lemmatization always
reduces a word depending on its meaning.
• At first Stemming and Lemmatization may look the same

but they are actually very different in next section we will
see the difference between them.
• now let’s see how to perform Lemmatization on a text data
ETHNOTECH ACADEMY
Creating a Lemmatizer with Python
Spacy
Note: python -m spacy download en_core_web_sm
The above line must be run in order to download the required
file to perform lemmatization.
ETHNOTECH ACADEMY
Output
• The above code returns an iterator of space.doc object type

which is the Lemmatized form of the input words. We can
access the lemmatized word using .lemma attribute. it
automatically tokenizes the sentence for us
ETHNOTECH ACADEMY
NLTK
• It helps in returning the base or dictionary form of a word,

which is known as the lemma.
• NLTK uses wordnet. The NLTK Lemmatization method is

based on WorldNet’s built-in morph function.
ETHNOTECH ACADEMY
NLTK Contd…
ETHNOTECH ACADEMY
Output
• ['Apples', 'and', 'oranges', 'are', 'similar', '.', 'Boots', 'and',

'hippos', 'are', "n't", '.’]
Apples and orange are similar . Boots and hippo are n't .
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
WordCloud
• Word Cloud is a data visualization technique used for

representing text data in which the size of each word indicates
its frequency or importance.
• Significant textual data points can be highlighted using a word
cloud.
• Word clouds are widely used for analyzing data from social
network websites
• For generating word cloud in Python, modules needed are –
matplotlib, pandas and wordcloud
ETHNOTECH ACADEMY
Preparatory exam link
ETHNOTECH ACADEMY
Program Feedback Link
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Q&A
• Which one is a classification algorithm?

a. Linear regression
b. Logistic regression
c. Agglomerative clustering
d. None of the above
• Logistic regression
ETHNOTECH ACADEMY
Q&A
• Which of the following is an assembly of objects based on similarity

and dissimilarity between them?
a. Regression
b. Association
c. Clustering
d. Classification
• Clustering
ETHNOTECH ACADEMY
Q&A
• Which pair of the algorithms are similar in operation?

a. SVM and KNN
b. SVM and Naïve bayes
c. Both A & B
d. Decision tree and Random forest
• Decision tree and Random forest
ETHNOTECH ACADEMY
Q&A
• Which of the following is not an application of clustering?

a. Data Analysis
b. Pattern recognition
c. Image processing
d. Data classification
• Data classification
ETHNOTECH ACADEMY
Q&A
• Which is of the following statement is incorrect about regression?

a. It may be used for interpretation
b. It is used for prediction
c. It relates inputs to outputs
d. It discovers causal relationships
• It discovers causal relationships
ETHNOTECH ACADEMY
Q&A
• Designing a machine learning approach involves _____

a. Choosing the type of training experience
b. Choosing a representation for the target function
c. Choosing the target function to be learned
d. All of the above
• All of the above
ETHNOTECH ACADEMY
Q&A
• Fraud Detection, Image Classification, Diagnostic, and Customer

Retention are applications in which of the following?
a. Regression
b. Clustering
c. Classification
d. All of these
• All of these
ETHNOTECH ACADEMY
Q&A
• Which of the following is required by K-means clustering?

a. defined distance metric
b. number of clusters
c. initial guess as to cluster centroids
d. All of these
• All of these
ETHNOTECH ACADEMY
Q&A
• Which of the following clustering requires merging approach?

a. Partitional
b. Hierarchical
c. Naïve bayes
d. K-means
• Hierarchical
ETHNOTECH ACADEMY
Q&A
• ______is the task of dividing the population or data points into a

number of groups.
a. Classification
b. Association
c. Clustering
d. Regression
• Clustering
ETHNOTECH ACADEMY
Q&A
• Agglomerative clustering has _________ approach

a. Top down
b. Bottom up
c. Vertical
d. Linear
• Bottom up
ETHNOTECH ACADEMY
Q&A
• State the following statement is true or False.

• “Divisive clustering follows top-down approach”
A. True
B. False
• True
ETHNOTECH ACADEMY
Q&A
• ______algorithm basically creates an imaginary boundary to classify

the data
a. K-NN
b. K-means
c. SVM
d. Decision trees
• K-NN
ETHNOTECH ACADEMY
Q&A
• Which of the following is not an application of Text analytics?

a. Sentiment analysis
b. Topic classification
c. Intent categorization
d. Image segmentation
• Image segmentation
ETHNOTECH ACADEMY
Q&A
• State the following statement is true or false – “In the process of

tokenization, some characters like punctuation marks may be
discarded”
a. True
b. False
• True
ETHNOTECH ACADEMY
Q&A
• ________ is a data visualization technique used for representing

text data in which the size of each word indicates its frequency or
importance.
a. Word Cloud
b. Stemming
c. Lemmatization
d. Tokenization
• Word Cloud
ETHNOTECH ACADEMY
Q&A
• _____ is the automated process of extracting and classifying text

data using machine learning and natural language processing
a. Text analysis
b. Node analysis
c. Data analysis
d. Stream analysis
• Text analysis
ETHNOTECH ACADEMY
Q&A
• State the following statement is true or false. “Word clouds are

widely used for analyzing data from social network websites”
a. True
b. False
• True
ETHNOTECH ACADEMY
SUMMARY
• Fundamentals of Text Analytics
• Usage of Tokenization, Stemming and Lemmatization
• Significance of Wordcloud and its applications in real world
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY

Ethnotech - Data Science With Python

Uploaded by

Ethnotech - Data Science With Python

Uploaded by

ETHNOTECH ACADEMY

Data Data Base

• Introduction to data science

• Data Science is a combination of multiple disciplines that

What is Data Science?

• To analyze health benefit of training.

• To predict who will win elections.

Data can be categorized into two groups:

• The following table shows

a database table with

health data extracted

from a sports watch:

Duration Average_Puls Max_Pulse Calorie_Burn Hours_Work Hours_Sleep

Row 1 30 80 120 240 10 7

Row 2 30 85 120 250 10 7

Row 3 45 90 130 260 8 7

Row 4 45 95 130 270 8 7

Row 5 45 100 140 280 0 7

Row 6 60 105 140 290 7 8

Row 7 60 110 145 300 7 8

Row 8 60 115 145 310 8 8

that there are 4 variables 30 80 120 240

45 100 140 280

60 105 140 290

• Artificial Intelligence is composed of two words Artificial and

• "It is a branch of computer science by which we can create

intelligent machines which can behave like a human, think

like humans, and able to make

4.AI opens a path for other new technologies, new devices,

• In the real world, we are surrounded by humans who can

also learn from experiences or past

data like a human does?

• Machine Learning is said as a subset of artificial

• Machine learning enables a machine to automatically learn

1.Machine learning uses data to detect various patterns in a

2.It can learn from past data and improve automatically.

3.It is a data-driven technology.

4.Machine learning is much similar to data mining as it also

• Supervised learning is a type of machine learning method

Supervised learning can be grouped further in two categories

• The training is provided to the machine with the set of data

It can be further classifieds into two categories of algorithms:

• Reinforcement learning is a feedback-based learning

• The agent learns automatically with these feedbacks and

• There are more than 700 languages available in today’s

• Every language is designed to fulfil a particular

• To communicate with digital machines and make them

• Python is most widely used powerful, general purpose,

• Python provides over 137,000 python libraries. Libraries

• Python can be used to create

• Machine learning is about making predictions with data.

Medical Data etc

• We can build Home Automation System and even robots using

• The coding on a Raspberry-Pi

• Developed by Guido van Rossum, a Dutch Scientist

• Easy-to-learn: Python has few keywords, simple structure, and a

 Interpret: To execute a program in a high-level language by

 Compile: To translate a program written in a high-level language

 Portability: Python can run on a wide variety of hardware

between square brackets.

• A tuple is another sequence data type that is similar to the

 Dictionaries are unordered, changeable and can be

 Dictionary is a collection of key-value pairs.

 Keys can be used as indexes and are unique but values in

>>> camera = {'sony':200, 'nikon': 200}

Output: {'sony': 200, 'nikon': 200, 'canon': 500}

• A set is an unordered collection with no duplicate elements.

• Basic uses include eliminating duplicate entries.

• Set object does not support indexing.

• Set objects also support mathematical operations like union,

1. __________ is a combination of multiple disciplines that uses

2. Which of the following is not a category of Data?

• A professional who is responsible for collecting,analyzing and

• Python was invented by Guido van Rossum and it was released in

• Which of the following is not a Boolean operator?

• _________ is a container that holds many objects under a single