0% found this document useful (0 votes)
101 views480 pages

Ethnotech - Data Science With Python

Uploaded by

prashanthkapu491
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
101 views480 pages

Ethnotech - Data Science With Python

Uploaded by

prashanthkapu491
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 480

ETHNOTECH ACADEMY

CERTIFICATE FORMAT

ETHNOTECH ACADEMY
BENEFITS OF THE PROGRAM

ETHNOTECH ACADEMY 3
COURSE OUTLINE

L1 Introduction L5 Regression

L2 Foundation-Panda L6 Classification

L3 Foundation-Numpy L7 Clustering

Foundation-Descriptive
L4
Analysis L8 Text Analytics

ETHNOTECH ACADEMY
EXIT PROFILE

Financial
Data
Analyst
Scientist

Data Data Base


Engineer Admin

Data Business
Journalist analyst
Big Data
Analyst

ETHNOTECH ACADEMY
SESSION 1

Introduction

• Introduction to data science

• Introduction to Python

• Introduction to AI-ML

ETHNOTECH ACADEMY
Introduction of data science

ETHNOTECH ACADEMY
Introduction of data science

• Data Science is a combination of multiple disciplines that


uses statistics, data analysis, and machine learning to
analyze data and to extract knowledge and insights from it.

What is Data Science?


• Data Science is about data gathering, analysis and decision-
making.
• Data Science is about finding patterns in data, through
analysis, and make future predictions.

ETHNOTECH ACADEMY
Introduction of data science

ETHNOTECH ACADEMY
Introduction Contd…
• To find the best suited time to deliver goods.

• To analyze health benefit of training.

• To predict who will win elections.

Application of Data
science :

ETHNOTECH ACADEMY
Introduction Contd…

What is Data?
• Data is a collection of information.
• One purpose of Data Science is to structure data, making it
interpretable and easy to work with.

Data can be categorized into two groups:


• Structured data
• Unstructured data

ETHNOTECH ACADEMY
Introduction Contd…
Structured Data
• Structured data is organized and easier to work with.
• We can use an array or a database table to
structure or present data.

Example of an array:
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]

ETHNOTECH ACADEMY
Introduction Contd…

Unstructured Data
• Unstructured data is not organized. We must organize the
data for analysis purposes.

ETHNOTECH ACADEMY
Introduction Contd…

Database Table
• A database table is a table with structured data.

• The following table shows

a database table with

health data extracted

from a sports watch:

ETHNOTECH ACADEMY
Introduction Contd…
Database Table Structure
Column 1 Column 2 Column 3 Column 4 Column 5 Column 6

Duration Average_Puls Max_Pulse Calorie_Burn Hours_Work Hours_Sleep


e age

Row 1 30 80 120 240 10 7

Row 2 30 85 120 250 10 7

Row 3 45 90 130 260 8 7

Row 4 45 95 130 270 8 7

Row 5 45 100 140 280 0 7

Row 6 60 105 140 290 7 8

Row 7 60 110 145 300 7 8

Row 8 60 115 145 310 8 8

ETHNOTECH ACADEMY
Introduction Contd…
Variables
• A variable is defined as something that can be measured or
counted.
• Examples can be characters, numbers or time.
Duration Average_P Max_Pulse Calorie_Bu
• There are 4 columns, meaning ulse rnage

that there are 4 variables 30 80 120 240

(Duration, Average_Pulse,
30 85 120 250

45 90 130 260
Max_Pulse,Calorie_Burnage). 45 95 130 270

45 100 140 280

60 105 140 290

60 110 145

ETHNOTECH ACADEMY
Introduction to AI-ML
What is Artificial Intelligence?
• AI is one of the fascinating and universal fields of Computer
science which has a great scope in future. AI holds a tendency
to cause a machine to work as a human.

• Artificial Intelligence is composed of two words Artificial and


Intelligence, where Artificial defines "man-made,“and
intelligence defines "thinking power", hence AI means "a man-
made thinking power."

ETHNOTECH ACADEMY
Introduction to AI-ML Contd…

• "It is a branch of computer science by which we can create

intelligent machines which can behave like a human, think

like humans, and able to make

decisions”.

ETHNOTECH ACADEMY
Why Artificial Intelligence?
1.With the help of AI, you can create such software or devices which can
solve real-world problems very easily and with accuracy such as
health issues, marketing, traffic issues, etc.

2.With the help of AI, you can create your personal virtual Assistant, such
as Cortana, Google Assistant, Siri, etc.

3.With the help of AI, you can build such Robots which can work in an
environment where survival of humans can be at risk.

4.AI opens a path for other new technologies, new devices,


and new Opportunities.
ETHNOTECH ACADEMY
Goals of Artificial Intelligence

ETHNOTECH ACADEMY
Pros And Cons of Artificial Intelligence

ETHNOTECH ACADEMY
What is Machine Learning

• In the real world, we are surrounded by humans who can


learn everything from their experiences with their learning
capability, and we have computers or machines which work
on our instructions. But can a machine

also learn from experiences or past

data like a human does?

ETHNOTECH ACADEMY
Machine Learning Contd…

ETHNOTECH ACADEMY
Machine Learning Contd…

• Machine Learning is said as a subset of artificial


intelligence that is mainly concerned with the development
of algorithms which allow a computer to learn from the
data and past experiences on their own. The term machine
learning was first introduced by Arthur Samuel in 1959.

• Machine learning enables a machine to automatically learn


from data, improve performance from experiences, and
predict things without being explicitly programmed.
ETHNOTECH ACADEMY
Features of Machine Learning:

1.Machine learning uses data to detect various patterns in a


given dataset.

2.It can learn from past data and improve automatically.

3.It is a data-driven technology.

4.Machine learning is much similar to data mining as it also


deals with the huge amount of the data.

ETHNOTECH ACADEMY
Classification of Machine Learning

ETHNOTECH ACADEMY
Supervised Learning

• Supervised learning is a type of machine learning method


in which we provide sample labelled data to the machine
learning system in order to train it, and on that basis, it
predicts the output.

Supervised learning can be grouped further in two categories


of algorithms:

1.Classification
2.Regression
ETHNOTECH ACADEMY
Supervised Learning Contd…

ETHNOTECH ACADEMY
Supervised Learning Contd…

ETHNOTECH ACADEMY
Supervised Learning Contd…

ETHNOTECH ACADEMY
Unsupervised Learning

• The training is provided to the machine with the set of data


that has not been labeled, classified, or categorized, and the
algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input
data into new features or a group of objects with similar
patterns.

It can be further classifieds into two categories of algorithms:


1.Clustering
2.Association
ETHNOTECH ACADEMY
Unsupervised Learning Contd…

ETHNOTECH ACADEMY
Unsupervised Learning Contd…

ETHNOTECH ACADEMY
Unsupervised Learning Contd…

ETHNOTECH ACADEMY
Reinforcement Learning

• Reinforcement learning is a feedback-based learning


method, in which a learning agent gets a reward for each
right action and gets a penalty for each wrong action.

• The agent learns automatically with these feedbacks and


improves its performance. In reinforcement learning, the
agent interacts with the environment and explores it. The
goal of an agent is to get the most reward points, and hence,
it improves its performance.

ETHNOTECH ACADEMY
Reinforcement Learning

ETHNOTECH ACADEMY
Reinforcement Learning

ETHNOTECH ACADEMY
Difference Between Supervised and
Unsupervised Learning

ETHNOTECH ACADEMY
Difference b/w Supervised, Unsupervised
& Semi Supervised Learning

ETHNOTECH ACADEMY
Introduction to Python

Why Programming?

• There are more than 700 languages available in today’s


programming world.

• Every language is designed to fulfil a particular


requirement.

• To communicate with digital machines and make them


work accordingly
ETHNOTECH ACADEMY
What is Python ?

• Python is most widely used powerful, general purpose,


high level programming language.

• Python provides over 137,000 python libraries. Libraries


are a set of useful functions that eliminate the need for
writing codes from scratch.

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Introduction Contd…

Applications

ETHNOTECH ACADEMY
Web application in Python

web applications

• Python can be used to create


• There are python frameworks
like Django, Flask and Pyramid
for this purpose

ETHNOTECH ACADEMY
Data Analysis in Python
• Python is the leading language of choice for many data
scientists.
• It has grown in popularity due to excellent libraries like:
Numpy
Pandas
Matplotlib
• A data scientist is a professional
responsible for collecting,analyzing
and interpreting extremely large
amounts of data.

ETHNOTECH ACADEMY
Machine Learning in Python

• Machine learning is about making predictions with data.

• It is mainly used in

Face Recognition,

Music recommendation,

Medical Data etc

ETHNOTECH ACADEMY
Raspberry Pi in Python

• We can build Home Automation System and even robots using


Raspberry-Pi

• The coding on a Raspberry-Pi


can be performed using Python.

ETHNOTECH ACADEMY
Game Development in Python
• We can write whole games in Python using PyGame
• Popular games developed in
Python are:

Bridge Commander

Civilization IV

Battlefield 2

Eve Online
48
Freedom Force
ETHNOTECH ACADEMY
Introduction Contd…

ETHNOTECH ACADEMY
Who created Python?

• Developed by Guido van Rossum, a Dutch Scientist


and first released on February 20, 1991
• The name Python is inspired from Guido’s favourite
Comedy TV show “Monty Python’s Flying Circus”

ETHNOTECH ACADEMY
Features of Python

• Easy-to-learn: Python has few keywords, simple structure, and a


clearly defined syntax. Python code is comparatively 3 to 5 times
smaller than C/C++/Java.
In C:
#include<stdio.h>
int main() In Python:
{ print(“Hello World!”)
print(“Hello World!”);
}

ETHNOTECH ACADEMY
Features of Python
Python uses both a compilers as well as interpreter for converting
our source and running it.

 Interpret: To execute a program in a high-level language by


translating it one line at a time.

 Compile: To translate a program written in a high-level language


all at once, in preparation for later execution.

 Portability: Python can run on a wide variety of hardware


platforms and has the same interface on all platforms.
ETHNOTECH ACADEMY
Python Data Types

• Numbers
• String
• Bool
• List
• Tuple
• Dictionary
• Set

ETHNOTECH ACADEMY
Bool
• Data type Boolean is used to store 2 values which is True or False.
• All the comparators used will result in True/False
• The three Boolean operators (and, or, and not) are used to compare
Boolean values.
• Like comparison operators, they evaluate these expressions down to a
Boolean value.
• After any math and comparison operators evaluate, Python evaluates the
not operators first, then the and operators, and then the or operators.
>>> 42 == 42 >>> 2 != 3
True True
>>> 2 != 2 >>> 42 == 99
False False
ETHNOTECH ACADEMY
Numbers
Int(signed integers): They are positive or negative whole numbers with no
decimal point.
Long(long integers): They are integers of unlimited size, written like integers
and followed by uppercase or lowercase L.
Float(floating point real values): They represent real numbers and are written
with a decimal point. Floats may also be in scientific notation with E or e
indicating power of 10.
Example: 2.5e2 = 2.5 x 10^2 = 2.5 x 100 = 250
Complex(complex numbers): are written in the form a+bj. The real part of
number is a and the imaginary part is b.
Letter j should appear only in suffix, not in prefix.
Example: 3+5j
ETHNOTECH ACADEMY
String
• Strings are a collection of characters. A string can group
any type of known characters i.e. letters ,numbers and
special characters. They are enclosed in single quote,
double quote, triple (literal) quote or raw string.
Example: ‘Hi’ , “hello” , ‘1234’
Example:
S1 = 'Mango' print(S3)
print(S4)
S2 = "Hello" S5 = "Hey, \"Good\"
S3 = "Hey, 'Good' Morning" Morning“
S4 = 'Hey, "Good" Morning‘ print(S5)
ETHNOTECH ACADEMY
List
 List is a container that holds many objects under a single
name.
 List can be written as a list of comma-separated values (items)

between square brackets.


 They have indexes same as strings.

 Lists can be nested just like arrays, i.e., you can have a list of

lists.
 Lists are mutable.

Syntax:
List_name = [item1 , item2 , item3]
List_name = []
List_name[index]

ETHNOTECH ACADEMY
Tuple

• A tuple is another sequence data type that is similar to the


list.
• It is a collection of elements separated by comma and
enclosed with parentheses ()
• It is immutable.
• Duplicate values can be present.
ex:T1=(1,2,3,1,2,3)
• It can hold the heterogeneous data types.
ex:T2=(1,3.0,’m’,”hello”)

ETHNOTECH ACADEMY
Dictionary
 Dictionaries are enclosed by curly braces ( { } ) and values
can be assigned and accessed using square braces ( [] ).

 Dictionaries are unordered, changeable and can be


indexed.

 Dictionary is a collection of key-value pairs.


Dictionary_name = {key1 : val1 , key2 : val2}

 Keys can be used as indexes and are unique but values in


the keys can be duplicate.

ETHNOTECH ACADEMY
Dictionary Contd…

Example:

>>> camera = {'sony':200, 'nikon': 200}


>>> camera.update({'canon':500})
>>> print(camera)

Output: {'sony': 200, 'nikon': 200, 'canon': 500}

ETHNOTECH ACADEMY
Sets

• A set is an unordered collection with no duplicate elements.

• Basic uses include eliminating duplicate entries.

• Set object does not support indexing.

• Set objects also support mathematical operations like union,


intersection, difference, and symmetric difference.

• Curly braces or the built-in set() function can be used to create sets.

ETHNOTECH ACADEMY
Sets Contd…
• A set is mutable, but may not contain mutable items like a
list, set, or even a dictionary.
• A set may contain values of different types.

Examples:
x = {12,3,4,45}
y = {2,4,6,78}
x.union(y)
{2, 3, 4, 6, 12, 45, 78}

ETHNOTECH ACADEMY
Sets Contd…

x.intersection(y)
{4}
x.difference(y) # elements present in x but not in y
{3, 12, 45}
y.difference(x) # elements present in y but not in x
{2, 6, 78}
x.symmetric_difference(y) #returns unique elements in both
{2, 3, 6, 12, 45, 78}

ETHNOTECH ACADEMY
Q&A

1. __________ is a combination of multiple disciplines that uses


statistics, data analysis, and machine learning to extract knowledge
and insights from it.
a. Data Science
b. Machine Learning
c. Data Analytics
d. Data Mining

• Data Science

ETHNOTECH ACADEMY
Q&A

2. Which of the following is not a category of Data?


a. Structured
b. Semi – Structured
c. Unstructured
d. Meta – Structured
• Meta – Structured

ETHNOTECH ACADEMY
Q&A

• A professional who is responsible for collecting,analyzing and


interpreting extremely large amounts of data.
a. Data Scientist
b. Data Analyst
c. AI Developer
d. System Architect
• Data Scientist

ETHNOTECH ACADEMY
Q&A

• Python was invented by Guido van Rossum and it was released in


the year _______
a. 1990
b. 1991
c. 1995
d. 2000
• 1991

ETHNOTECH ACADEMY
Q&A

• Which of the following is not a Boolean operator?


a. And
b. X-OR
c. OR
d. NOT
• X-OR

ETHNOTECH ACADEMY
Q&A

• _________ is a container that holds many objects under a single


name.
a. List
b. Strings
c. Dictionary
d. Sets
• List

ETHNOTECH ACADEMY
Q&A

• State the following statement is true or false


“Tuple is mutable in nature”
a. True
b. False

• False

ETHNOTECH ACADEMY
Q&A

• In dictionaries _______ can be used as indexes and are unique


a. Value
b. Elements
c. Key
d. Items
• Key

ETHNOTECH ACADEMY
Q&A

• Which of the following is not an example of sequence data type?


a. Float
b. Strings
c. List
d. Tuple
• Float

ETHNOTECH ACADEMY
• The process of executing a program in a high-level language by
translating it one line at a time is called _______
a. Interpretation
b. Compilation
c. Recursion
d. Member function
• Interpretation

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Question 1

1. __________ is a feedback-based learning method, in which a


learning agent gets a reward for each right action and gets
a penalty for each wrong action
a) Supervised
b) Unsupervised
c) Reinforcement
d) NLP
Answer: Reinforcement

ETHNOTECH ACADEMY
Question 2

2. State the following statement is true or false


‘Machine Learning is a data-driven technology’
Answer: True

ETHNOTECH ACADEMY
Question 3

3. ______ datatype is used to store 2 values only


A. Number
B. Tuple
C. Dictionary
D. Boolean
Answer: Boolean

ETHNOTECH ACADEMY
SUMMARY
• Fundamentals of Data science and AI-ML
• Basics of Python Programming
• Usage of Various datatypes in python

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Session 2
Foundation-Panda
• Header
• Panda read csv
• datatype and statistics
• Panda column operations
• Panda operations
• Merge and concat
• Graphs
ETHNOTECH ACADEMY
Foundation Panda
Introduction to Pandas
• Pandas is an open source Python library for highly specialized
data analysis.
• It is currently the reference point that all professionals using the
Python language need to study for the statistical purposes of
analysis and decision making.
• The library was designed and developed primarily by Wes
McKinney starting in 2008. In 2012, Sien Chang, one of his
colleagues, was added to the development.
• Together they set up one of the most used libraries in the Python
community
ETHNOTECH ACADEMY
Foundation Panda Contd…
• Pandas arises from the need to have a specific library to
analyze data that provides, in the simplest possible way, all
the instruments for data processing, data extraction, and
data manipulation

ETHNOTECH ACADEMY
Feature Of Pandas

ETHNOTECH ACADEMY
Header

• A header necessarily stores the names or headings for each


of the columns.
• It basically helps the user to identify the role of the
respective column in the data frame.
• The top row containing column names is called the header
row of the data frame.
• There are basically two approaches to add a header row in
Python in case the original data frame doesn’t have a header.

ETHNOTECH ACADEMY
Creating a data frame from CSV file
and creating row header
• While reading the data and storing it in a data frame, or
creating a fresh data frame , column names can be
specified by using the names attribute of the read_csv()
method in python.

• Names attribute contains an array of names for each of the


columns of the data frame in order. The length of the array
is equivalent to the length of this frame structure.
ETHNOTECH ACADEMY
Dataset

• https://shorturl.at/vyEJL

ETHNOTECH ACADEMY
Code Snippet

ETHNOTECH ACADEMY
Output

Note: We can also specify the header=none as an attribute


of the read_csv() method and later on give names to the
columns explicitly when desired

ETHNOTECH ACADEMY
Creating a data frame and creating
row header in Python itself
• We can create a data frame of specific number of rows and
columns by first creating a multi -dimensional array and then
converting it into a data frame by
the pandas.DataFrame() method.

• The columns argument is used to specify the row header or


the column names. It contains an array of column values
with its length equal to the number of columns in the data
frame.

ETHNOTECH ACADEMY
Code Snippet

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Panda column operations
1.Sort by a Column with Pandas
• Sorting your Pandas dataframe df by one or more columns
can be done either ascending or descending.

• The first argument here is the column of your dataframe


you would like to sort by. By default, the
parameter ascending is set to True therefore you only
need to specify this if you would like sorting to be in
descending order.

ETHNOTECH ACADEMY
Panda column operations Contd…

• If you need to sort by multiple columns, amend the


parameter values to be lists of columns and a list of
corresponding orders of sorting (ascending/descending).

• This example ordered the dataframe df by col_1 in


ascending order and col_2 in descending order.

ETHNOTECH ACADEMY
Panda column operations Contd…
2.Rename Columns in Pandas

Syntax
• pd.rename(columns={'original_col_name':'new_col_name'})

• To rename multiple columns, add these updates into


the columns parameter dictionary.

• pd.rename(columns={'original_col1_name':
'new_col1_name', 'original_col2_name': 'new_col2_name'})
ETHNOTECH ACADEMY
Panda column operations Contd…

3.Delete a Column of a Pandas Dataframe

• We can delete one or more columns of a Pandas dataframe


at any one time. First, let’s start with one.

• df.drop('column_name', axis=1)

ETHNOTECH ACADEMY
Panda column operations Contd…

4.Group By & Aggregate Columns with Pandas

• Grouping data with Pandas is one way to summarize your


data. This can be used as a basis for plotting charts or just to
provide insights. Here is how to do this using the
Pandas group by method of one column, col_1 and count the
number of rows per group.

• df.groupby('col_1').count()

ETHNOTECH ACADEMY
Panda column operations Contd…

• The groupby method can be used on any number of


columns and used to aggregate each in different ways. The
code below is grouping by two columns and aggregating
these (assuming they are of numeric data types) by
summing and calculating the mean respectively.

• df.groupby(['col_1', 'col_2']).agg(['sum', 'mean'])

ETHNOTECH ACADEMY
Panda column operations Contd…

5.Apply a Function to a Column using Pandas

• One way of applying a function to all rows in a Pandas


dataframe column is using the apply method.

• df['col'].apply(function)

ETHNOTECH ACADEMY
Panda column operations Contd…

• Above, a specific column col_1 has been selected for the


function (generic in this case) has been applied. Functions
applied can be inbuilt, for example the numpy square root
function np.sqrt or a user defined function you have
specified, using a lambda function or otherwise.

• df['col'].apply(lambda x: x**2 + 5)

ETHNOTECH ACADEMY
Panda operations Contd…
Types Of Operation
• Creating a data frame with pandas

• Read the top element chart:

• Read the Bottom element chart:

• Understanding the statistical information of the data:

• Writing a CSV file:

ETHNOTECH ACADEMY
Panda operations Contd…
1. Creating a data frame with pandas:

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Panda operations Contd…

2. Reading a CSV File:

ETHNOTECH ACADEMY
Panda operations Contd…

3. Read the top element chart:

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Panda operations Contd…

4. Read the Bottom element chart:

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Panda operations Contd…

5. Understanding the statistical information


of the data:

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Merge and Concat

Dataframe
A dataframe is a two-dimensional data structure having
multiple rows and columns. In a dataframe, the data is
aligned in the form of rows and columns only. A dataframe
can perform arithmetic as well as conditional operations. It
has mutable size.

ETHNOTECH ACADEMY
Merge and Concat Contd…

Example:

ETHNOTECH ACADEMY
Merge and Concat Contd…

Output:

ETHNOTECH ACADEMY
Merge and Concat Contd…

DataFrames Merge:

Pandas provides a single


function, merge(), as the
entry point for all standard
database join operations
between DataFrame objects.

ETHNOTECH ACADEMY
Join Operations

ETHNOTECH ACADEMY
Merge and Concat Contd…
Example:

ETHNOTECH ACADEMY
Merge and Concat Contd…

Output:

ETHNOTECH ACADEMY
Merge and Concat Contd…
DataFrames Concat:
concat() function does all of the heavy lifting of performing
concatenation operations along an axis while performing
optional set logic (union or intersection) of the indexes (if any)
on the other axes.

ETHNOTECH ACADEMY
Merge and Concat Contd…

ETHNOTECH ACADEMY
Merge and Concat Contd…

ETHNOTECH ACADEMY
Merge and Concat Contd…

Output:

ETHNOTECH ACADEMY
DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a
2 dimensional array, or a table with rows and columns.

Example

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Named Indexes
• With the index argument, we can name our own indexes.

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Load Files Into a DataFrame

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Graphs

• Pandas uses the plot() method to create diagrams.

• We can use Pyplot, a submodule of the Matplotlib library to


visualize the diagram on the screen.

• We have different types of plots in matplotlib library which


can help us to make a suitable graph as you needed. As per
the given data, we can make a lot of graph and with the
help of pandas, we can create a dataframe before doing
plotting of data.
ETHNOTECH ACADEMY
Basic ploting

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Plot of different data
Using more than one list of data in a plot.

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Plot on given axis

We can explicitly define the name of axis and plot the data on
the basis of this axis.

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Bar plot using matplotlib:
Find different types of bar plot to clearly understand the
behaviour of given data.

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Scatter plot:

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Documentation – Overview of Data Science

• Data Science Definition


• Brief history of Data Science
• Applications of Data Science
• Artificial Intelligence – History, Importance and Applications
• Fundamentals of Machine Learning
• Classification of Machine Learning
• Applications of Machine Learning
• Data Science in AI and ML

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Q&A

• The most prominent data structures of pandas is/are __________


A. Series
B. Data frame
C. Both a & b
D. Heap
• Both a & b

ETHNOTECH ACADEMY
Q&A

• Which of the following library in python is used for plotting graphs


and visualization
a. Pandas
b. Numpy
c. Matplotlib
d. SK learn
• Matplotlib

ETHNOTECH ACADEMY
Q&A

• Which of the following command is used to install pandas


a. pip install pandas
b. install pandas
c. pip pandas
d. none of the above

• pip install pandas

ETHNOTECH ACADEMY
Q&A

• Which of the following is an example of one-dimensional array?


a. Series
b. Data frame
c. Matrix
d. Both a &b
• Series

ETHNOTECH ACADEMY
Q&A

• A series by default have numeric data labels starting from


• 3
• 2
• 1
• 0
• 0

ETHNOTECH ACADEMY
Q&A

• The data label associated with a particular value of series is called


its _____
a. Data value
b. Index
c. Value
d. Array
• Index

ETHNOTECH ACADEMY
Q&A

• Which of the following statement is correct for importing pandas in


python
a. import pandas
b. import pandas as pd
c. import pandas as pds
d. all of the above

• all of the above

ETHNOTECH ACADEMY
Q&A

• Which attribute is used to give user defined labels in series?


a. index
b. data
c. values
d. none of these
• index

ETHNOTECH ACADEMY
Q&A

• Fill in the to get the output as 2


import pandas as pd
s1 = pd.Series([1,2,3,4], index = ['a','b','c','d'])
print(s1[____])
a. ‘b’
b. 1
c. Both a &b
d. B
• Both a &b

ETHNOTECH ACADEMY
Q&A

• State the following statement is true or false. “A series object is size


mutable”
a. True
b. False
• False

ETHNOTECH ACADEMY
Q&A
• During the execution of following code, what will be the response,
we get
import pandas as pd
s =pd.Series([1,2,3,4,5],index= ['a','b','c','d','e'])
print(s['f’])
A. KeyError
B. IndexError
C. ValueError
D. Semantic error
• Key Error

ETHNOTECH ACADEMY
Q&A

• which of the following is a correct syntax for panda's dataframe?


A. Pandas.DataFrame(data, index, dtype, copy)
B. pandas.DataFrame( data, index, columns, dtype, copy)
C. pandas.DataFrame(data, index, dtype, copy)
D. pandas.DataFrame( data, index, rows, dtype, copy)

• Pandas.DataFrame(data, index, dtype, copy)

ETHNOTECH ACADEMY
Q&A

• which of the following is / are not correct to access individual item


from dataframe 'df’
a. df.iat[2,2]
b. df.loc[2,2]
c. df.at[2,2]
d. df[0,0]
• df[0,0]

ETHNOTECH ACADEMY
Q&A

• What will be output of following code


import pandas as pd
data = [['Anuj',21],['Rama',25],['Kapil',22]]
df = pd.DataFrame(data,columns=['Name','Age'])
print (df)

ETHNOTECH ACADEMY
Q&A

• What will be output of following code?


import numpy as np
array1=np.array([100,200,300,400,500,600,700])
print(array1[1:5:2])
A. [200 300]
B. [200 700]
C. [200 400]
D.[200 400]
• [200 400]

ETHNOTECH ACADEMY
SUMMARY
• Fundaments of Pandas and its usage
• Columnar operations of Pandas
• Usage of Pandas Library in Data Sciecne
• Applications of Graphs in Pandas

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Question 1

• State the following statement is true of false – ‘Sorting your Pandas


dataframe df by one or more columns can be done either ascending
or descending’
• Answer: True

ETHNOTECH ACADEMY
Question 2

2. _________basically helps the user to identify the role of the


respective column in the data frame
A. Header
B. Column
C. Data
D. Array
Answer : Header

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Session 3

Foundation-Numpy

• One Dimension

• Two Dimension

• Two Dimension stacking

ETHNOTECH ACADEMY
Session 3

Foundation-Descriptive Analysis

• Data Dictionary

• Single numeric descriptive analysis

• Double numeric descriptive analysis

• Categorical and all Numeric Descriptive Analysis


ETHNOTECH ACADEMY
Foundation-Numpy
Introduction Of Numpy
• Numpy is an open-source library for working efficiently with
arrays. Developed in 2005 by Travis Oliphant, the name
stands for Numerical Python.
• As a critical data science library in Python,
many other libraries depend on it.
• NumPy is extremely popular because it
dramatically improves the ease and
performance of working with
multidimensional arrays.
ETHNOTECH ACADEMY
Advantages of Numpy

• It offers an Indexing syntax for easily accessing portions of


data within an array.

• It contains built-in functions that improve quality of life


when working with arrays and math, such as functions for
linear algebra, array transformations, and matrix math.

• It requires fewer lines of code for most mathematical


operations than native Python lists.
ETHNOTECH ACADEMY
One-dimensional NumPy array

• One dimensional array contains elements only in one


dimension. In other words, the shape of the NumPy array
should contain only one value in the tuple. Let us see how
to create 1 dimensional NumPy arrays.

Method 1:

• First make a list then pass it in numpy.array()

ETHNOTECH ACADEMY
One-dimensional NumPy array

Output

ETHNOTECH ACADEMY
Method 2

fromiter()

• It is useful for creating non-numeric sequence type array


however it can create any type of array.

• Here we will convert a string into a NumPy array of


characters.

ETHNOTECH ACADEMY
Method 2 Contd…

Output

ETHNOTECH ACADEMY
Method 3

arange()
• It returns evenly spaced values within a given interval.

Output

ETHNOTECH ACADEMY
Method 4

• linspace() creates evenly space numerical elements between


two given limits.
Output

ETHNOTECH ACADEMY
Numpy — Stacking Arrays

Joining two numpy arrays

• stack — Joins arrays with given axis element by element

• hstack — Extends horizontally

• vstack — Extends vertically

ETHNOTECH ACADEMY
Stack
• Joins arrays with given axis element by element
• Both input arrays should be in same dimention/shape
• Axis parameter in Stack works as dimention here instead of
horizontal/vertical manner.
• If Axis is 0, then it will join by first dimention
• If Axis is 1, the it will join by second dimention
• The maximum dimension that we can mention is dimention
of input arrays (say n) + 1.
• If axis is given above n + 1, then “out of bounds for array of
dimension” exception will be thrown

ETHNOTECH ACADEMY
Stack Contd…

ETHNOTECH ACADEMY
Stack by First Dimension

ETHNOTECH ACADEMY
Stack by Second dimenstion

ETHNOTECH ACADEMY
Stack by First Dimension

ETHNOTECH ACADEMY
Stack by Second Dimension

ETHNOTECH ACADEMY
HStack

Stacks horizontally
• This function does not work with axis. It extends first array
by second array Horizontally
• As it extends horizontally, both the arrays should have same
number of rows else Value Error will be returned.

ETHNOTECH ACADEMY
HStack for 1D Arrays

ETHNOTECH ACADEMY
HStack for 2D Arrays

ETHNOTECH ACADEMY
Vstack

VStack — Stacks vertically

• This function does not work with axis. It extends first array
by second array Vertically

ETHNOTECH ACADEMY
VStack by 2D Arrays

ETHNOTECH ACADEMY
Foundation-Descriptive Analysis

Introduction

 Descriptive statistics describe the basic and important


features of data.
 Descriptive statistics help simplify and summarize large
amounts of data in a sensible manner.
 Descriptive statistics involve evaluating measures of
center(centrality measures) and measures of
dispersion(spread).

ETHNOTECH ACADEMY
Foundation-Descriptive Analysis Contd…

ETHNOTECH ACADEMY
Centrality measures

 Centrality measures give us an estimate of the center of a


distribution.

 It gives us a sense of a typical value we would expect to see.

The three major measures of center include the


1. Mean

2. Median

3. Mode
ETHNOTECH ACADEMY
Mean

 It means the average of the given values.

 To compute mean, sum all the values and divide the sum by
the number of values.

 Consider a class of 7 students. Suppose they appear for


midterm exams and score the following marks out of 20:

ETHNOTECH ACADEMY
Mean Contd…

Total
Name Subject 1 Subject 2 Subject 3 Subject 4 Marks

Student 1 10 8 15 11.5 44.5

Student 2 14 9 7.5 11 41.5

Student 3 11 17 11.5 9 48.5

Student 4 7 14.5 10 12 43.5

Student 5 9.5 12 10.5 14 46

Student 6 15 18 7 12 52

Student 7 19 15.5 11 7.5 53

ETHNOTECH ACADEMY
Mean with python

There are various libraries in python such as pandas, numpy,


statistics that support mean calculation.

The Syntax is

numpy.mean(arr, axis = None) : Compute the arithmetic


mean (average) of the given data (array elements) along the
specified axis.

ETHNOTECH ACADEMY
Parameters
• arr : [array_like]input array.
• axis : [int or tuples of int]axis along which we want to calculate
the arithmetic mean. Otherwise, it will consider arr to be
flattened(works on allthe axis). axis = 0 means along the column
and axis = 1 means working along the row.
• out : [ndarray, optional]Different array in which we want to
place the result. The array must have the same dimensions as
expected output.
• dtype : [data-type, optional]Type we desire while computing
mean.
• Results : Arithmetic mean of the array (a scalar value if axis is
none) or array with mean values along specified axis.
ETHNOTECH ACADEMY
Example 1

Output:

ETHNOTECH ACADEMY
Example 2

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Median
• It is the value where the upper half of the data lies above it
and lower half lies below it. In other words, it is the middle
value of a data set.
• To calculate the median, arrange the data points in the
increasing order and the middle value is the median.
• It is easy to find out the middle value if there is anodd
number of data points, say, we want to find the median for
marks of all students forSubject 1.
• When marks are arranged in the increasing order, we get
{7,9.5,10,11,14,15,19}. Clearly, the middle value is 11;
therefore, the median is 11.
ETHNOTECH ACADEMY
Median Contd…
• If Student 7 did not write the exam, we will have marks as
{7,9.5,10,11,14,15}. This time there is no clear middle value.
• Then, take the mean of the third and fourth values, which is
(10+11)/2=10.5, so the median in this case is 10.5.

ETHNOTECH ACADEMY
Median in Python
• numpy.median(arr, axis = None) : Compute the median of
the given data (array elements) along the specified axis.

How to calculate median?

•Given data points.


•Arrange them in ascending order
•Median = middle term if total no. of terms are odd.
•Median = Average of the terms in the middle (if total no. of
terms are even)

ETHNOTECH ACADEMY
Parameters
• arr : [array_like]input array.
• axis : [int or tuples of int]axis along which we want to calculate
the median. Otherwise, it will consider arr to be flattened(works
on all the axis). axis = 0 means along the column and axis = 1
means working along the row.
• out : [ndarray, optional] Different array in which we want to
place the result. The array must have the same dimensions as
expected output.
• dtype : [data-type, optional]Type we desire while computing
median.
• Results : Median of the array (a scalar value if axis is none) or
array with median values along specified axis.
ETHNOTECH ACADEMY
Example 1

Output:

ETHNOTECH ACADEMY
Example 2

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Mode

• It is the value that occurs the most number of times in our data
set.
• Suppose there are 15 students appearing for an exam and
following is the result:

Note : To apply mode we need to create an array. In python, we


can create an array using numpy package. So first we need to create
an array using numpy package and apply mode() function on that
array. Let us see examples for better understanding
ETHNOTECH ACADEMY
Example1

Output:

ETHNOTECH ACADEMY
Example 2

Output:

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Q&A
• NumPy package is capable to do fast operations on arrays.
A. True
B. False
• True

ETHNOTECH ACADEMY
Q&A

• NumPy arrays can be ___


a. Indexed
b. Sliced
c. Iterated
d. All of the mentioned above
• All of the mentioned above

ETHNOTECH ACADEMY
Q&A
NumPy is often used along with packages like?
a. Node.js
b. Matplotlib
c. SciPy
d. Both B and C
• Both B and C

ETHNOTECH ACADEMY
Q&A
The most important object defined in NumPy is an N-dimensional array type
called?
a. ndarray
b. narray
c. nd_array
d. darray
• ndarray

ETHNOTECH ACADEMY
Q&A
How to convert numpy array to list?
a. array.list()
b. array.list
c. list.array()
d. list(array)
• list(array)

ETHNOTECH ACADEMY
Q&A
What of the following syntax is used to install numpy in the system containing
python3?
a. pip numpy install python3
b. pip3 install numpy
c. pip install numpy
d. python3 pip3 numpy install
• pip3 install numpy

ETHNOTECH ACADEMY
Q&A
What does size attribute in numpy use to find?
a. shape
b. date & time
c. objects
d. number of items
• number of items

ETHNOTECH ACADEMY
Q&A
Is the following syntax true to import numpy module?
fetch numpy as np
np.array(list)
A. Yes, true
B. Not, true
• Not, true

ETHNOTECH ACADEMY
Q&A
What are the attributes of numpy array?
a. shape, dtype, ndim
b. objects, type, list
c. objects, non vectorization
d. Unicode and shape
• shape, dtype, ndim

ETHNOTECH ACADEMY
Q&A
What is the output of following code?
import numpy as np
ary = np.array([1,2,3,5,8])
ary = ary + 1
print (ary[1])

ETHNOTECH ACADEMY
Q&A

ETHNOTECH ACADEMY
Q&A

ETHNOTECH ACADEMY
Q&A

ETHNOTECH ACADEMY
Q&A

ETHNOTECH ACADEMY
Q&A

ETHNOTECH ACADEMY
Q&A

ETHNOTECH ACADEMY
SUMMARY
• Fundamentals of Numpy library
• Usage of Numpy library in Data Science
• Various operations of Numpy
• Implementation of Merge and Concatenation operations in Numpy

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Session 4

Regression
• Introduction and Preprocessing
• Feature Selection Regularisation
• Residual Analysis
• Data Read
• Normality test and BoxCox transformation
• Linear Regression structure

ETHNOTECH ACADEMY
Session 4

• Linear Regression for Numeric features

• HotEncoding and Scaling

• Linear Regression with HotEncoding and Scaling Data

• Generic Treeflow in Prediction

• CatBoost
• CatBoost Hyperparameter Tuning

ETHNOTECH ACADEMY
Preprocessing

ETHNOTECH ACADEMY
Preprocessing Contd…

 Data preprocessing, a component of data preparation,


describes any type of processing performed on raw data to
prepare it for another data processing procedure.

 It has traditionally been an important preliminary step for


the data mining process.

 data preprocessing techniques have been adapted for


training machine learning models and AI models and for
running inferences against them.

ETHNOTECH ACADEMY
Preprocessing Contd…

 Data preprocessing transforms the data into a format that is


more easily and effectively processed in data mining,
machine learning and other data science tasks.

 The techniques are generally used at the earliest stages of


the machine learning and AI development pipeline to ensure
accurate results.

ETHNOTECH ACADEMY
Tools and methods for preprocessing
data

 sampling, which selects a representative subset from a large


population of data.

 transformation, which manipulates raw data to produce a


single input.

 denoising, which removes noise from data.

ETHNOTECH ACADEMY
Tools and methods for
preprocessing data Contd…

 imputation, which synthesizes statistically relevant data for


missing values;

 normalization, which organizes data for more efficient


access; and

 feature extraction, which pulls out a relevant feature subset


that is significant in a particular context.

ETHNOTECH ACADEMY
Why is data preprocessing important?
 Real-world data is messy and is often created, processed and
stored by a variety of humans, business processes and
applications.

 a data set may be missing individual fields, contain manual


input errors, or have duplicate data or different names to
describe the same thing.

 Humans can often identify and rectify these problems in the


data they use in the line of business, but data used to train
machine learning or deep learning algorithms needs to be
automatically preprocessed.
ETHNOTECH ACADEMY
key steps in data preprocessing

ETHNOTECH ACADEMY
Regression
• Regression searches for relationships among variables.

• example, you can observe several employees of some


company and try to understand how their salaries depend
on their features, such as experience, education level, role,
city of employment, and so on.

• The dependent features are called the dependent


variables, outputs, or responses.
• The independent features are called the independent
variables, inputs, regressors, or predictors.
ETHNOTECH ACADEMY
Regression Contd…

• Regression problems usually have one continuous and

unbounded dependent variable. The inputs, however, can

be continuous, discrete, or even categorical data such as

gender, nationality, or brand.

ETHNOTECH ACADEMY
Need of Regression

• to answer whether and how some phenomenon influences


the other or how several variables are related

• Example: you can use it to determine if and to what


extent experience or gender impacts salaries.

• Regression is also useful when you want to forecast a


response using a new set of predictors.

ETHNOTECH ACADEMY
Need of Regression Contd…

• Example: we could try to predict electricity consumption


of a household for the next hour given the outdoor
temperature, time of day, and number of residents in that
household.

• Regression is used in many different fields, including


economics, computer science, and the social sciences. Its
importance rises every day with the availability of large
amounts of data and increased awareness of the practical
value of data.
ETHNOTECH ACADEMY
LinearRegression

• Linear regression is probably one of the most important


and widely used regression techniques. It’s among the
simplest regression methods.

• One of its main advantages is the ease of interpreting


results.

• Simple or single-variate linear regression is the simplest


case of linear regression, as it has a single independent
variable, 𝐱 = 𝑥.

ETHNOTECH ACADEMY
Linear Regression Contd…

• The following figure illustrates simple linear regression:

ETHNOTECH ACADEMY
Steps involved in implementing linear
regression:
• Import the packages and classes that you need.
• Provide data to work with, and eventually do appropriate
transformations.
• Create a regression model and fit it with existing data.
• Check the results of model fitting to know whether the
model is satisfactory.
• Apply the model for predictions.

ETHNOTECH ACADEMY
Step 1: Import packages and classes

• The first step is to import the package numpy and the


class LinearRegression from sklearn.linear_model:

import numpy as np

from sklearn.linear_model import LinearRegression

ETHNOTECH ACADEMY
Step2: Provide data

• The second step is defining data to work with. The inputs


(regressors, 𝑥) and output (response, 𝑦) should be arrays or
similar objects. This is the simplest way of providing data for
regression:

x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))

y = np.array([5, 20, 14, 32, 22, 38])

ETHNOTECH ACADEMY
Step 3: Create a model and fit it

• Create an instance of the class LinearRegression, which


will represent the regression model:

model = LinearRegression()

• Your model as defined above uses the default values of


all parameters.

model.fit(x, y)
LinearRegression()

ETHNOTECH ACADEMY
Step 3 Contd…

With .fit(), you calculate the optimal values of the weights 𝑏₀


and 𝑏₁, using the existing input and output, x and y, as the
arguments. In other words, .fit() fits the model. It
returns self, which is the variable model itself.
That’s why you can replace the last two statements with this
one:

model = LinearRegression().fit(x, y)
ETHNOTECH ACADEMY
Step 4: Get results

• Once you have your model fitted, you can get the results to
check whether the model works satisfactorily and to
interpret it.

r_sq = model.score(x, y)

print(f"coefficient of determination: {r_sq}")

coefficient of determination: 0.7158756137479542


ETHNOTECH ACADEMY
Step 5: Predict response

y_pred = model.predict(x)

print(f"predicted response:\n{y_pred}")

predicted response:

[ 8.33333333 13.73333333 19.13333333 24.53333333


29.93333333 35.33333333]

ETHNOTECH ACADEMY
Box-Cox Transformation

• The Cox Box transformation is to transform the data so that


its distribution is as close to a normal distribution as
possible, that is, the histogram looks like a bell.

• This technique has its place in feature engineering because


not all species of predictive models are robust to skewed
data, so it is worth using when experimenting. It probably
won’t provide a spectacular improvement, although at the
fine-tuning stage it can serve its purpose by improving our
evaluation metric.
ETHNOTECH ACADEMY
Box-Cox Equation in code

• The transformation itself has the following formula

ETHNOTECH ACADEMY
Implementation

ETHNOTECH ACADEMY
Output

ETHNOTECH ACADEMY
Catboost

• Catboost is a boosted decision tree machine learning


algorithm developed by Yandex.

• It works in the same way as other gradient boosted


algorithms such as XGBoost but provides support out of the
box for categorical variables, has a higher level of accuracy
without tuning parameters and also offers GPU support to
speed up training.
ETHNOTECH ACADEMY
Implementation Contd…

• Catboost is used for a range of regression and


classification tasks and has been shown to be a top
performer on various Kaggle competitions that involve
tabular data. Below are a couple of examples of where
Catboost has been successfully implemented:
 Cloudflare use Catboost to identify bots trying to target it’s
users websites. Full details here.
 Ride hailing service Careem, based in Dubai, use Catboost to
predict where it’s customers will travel to next.

ETHNOTECH ACADEMY
Implementation

• we are going to use the classic Titanic dataset to predict


whether a passenger on the ship survived or not. The
intention here is to keep this tutorial simple using a small
dataset but the principles will apply to more complex
datasets and problems you might be trying to solve.

• Step 1: import the libraries we will need and also the


titanic dataset.
ETHNOTECH ACADEMY
Implementation Contd…

ETHNOTECH ACADEMY
Data Preparation
• Initially we’re simply going to drop any rows that contain
NaN for the “survived” column which is our target as this
doesn’t help our model.
df.dropna(subset=['survived'],inplace=True)
• we are only going to make use of 4 features; pclass, sex, age
and fare. Let’s split our data into X and y to get our feature
and target dataframes.
X = df[['pclass','sex', 'age', 'fare']]
y = df['survived']
ETHNOTECH ACADEMY
Data Preparation Contd…
• Now we still need to treat some of the features. We need to
convert the “pclass” column to a string data type as
although it appears numeric, the values are discrete so it’s
actually a categorical variable in this context. In addition, the
“fare” and “age” columns contain some NaNs so we’ll
replace these with zeros.
X['pclass'] = X['pclass'].astype('str')
X['fare'].fillna(0,inplace=True)
X['age'].fillna(0,inplace=True)
ETHNOTECH ACADEMY
Preparing Categorical Features

ETHNOTECH ACADEMY
Preparing Categorical Features
Contd…

ETHNOTECH ACADEMY
Preparing Categorical Features Contd…

• Finally, before we begin training our model we need to split our


data into two datasets for training and testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=101, stratify=y)
• Now there is an additional complication; if we print out the
survival rate of of our test set we can see that our training data
is imbalanced.
ETHNOTECH ACADEMY
Preparing Categorical Features Contd…

print('Test Survival Rate:',y_test.sum()/y_test.count())

Test Survival Rate: 0.3816793893129771

• There are a few ways to handle this but in our example we


are simply going to undersample the training data

ETHNOTECH ACADEMY
Preparing Categorical Features Contd…

ETHNOTECH ACADEMY
Training

ETHNOTECH ACADEMY
Training Contd…
• To train the model we are going to use Catboost’s inbuilt grid
search method. If you have used Sci-Kit learns Grid Search CV
then this works in the same way. First we declare a dictionary of
the hyperparameters that we want to tune and lists of values to
test. We have decided to tune just a few of the most influential
parameters: learning rate, tree depth, L2 leaf regularisation and
also the number of iterations we will train the model for.

ETHNOTECH ACADEMY
Training Contd…

• Now we can fit the model using the grid search method by
passing the grid dictionary we declared above along with
the training data pool. By default grid search splits the
training data into an 80/20 split for training and testing with
a three fold cross validation strategy.

model.grid_search(grid,train_dataset)

ETHNOTECH ACADEMY
Training Contd…

• The model has now been trained and you can print out the
optimum parameters that have been found using grid
search if you’re interested.

model.get_params()

ETHNOTECH ACADEMY
Evaluation

• Now that we have trained our model we can evaluate how it


performs on our test data and then briefly see what
features are most influential.
• To start with we’ll use our model to make predictions for our
test and the print out a classification report.
pred = model.predict(X_test)
print(classification_report(y_test, pred))

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Project

• Implementation of Linear Regression on Real World Scenario to


predict sales based on the money spent on TV for advertising.

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
SUMMARY
• Fundamentals of Linear equation
• Need of Linear Regression
• Implementation of Box Cox Transformation
• Implementation of Cat boost over categorical data

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Q&A

• State the following is true or False – “Data Preprocessing is a


preliminary step for data mining”
A. True
B. False
• True

ETHNOTECH ACADEMY
Q&A

• Which of the following manipulates raw data to produce a single


output?
A. Sampling
B. Abstraction
C. Transformation
D. Denoising
• Transformation

ETHNOTECH ACADEMY
Q&A

• _________ synthesis relevant data for missing values.


a. Feature extraction
b. Imputation
c. Normalization
d. Sampling
• Imputation

ETHNOTECH ACADEMY
Q&A

• State the following is true or false – “Feature extraction pulls out


irrelevant feature subset that is significant in a particular context.”
a. True
b. False
• False

ETHNOTECH ACADEMY
Q&A

• Which of the following searches for relationships among variables?


a. Classification
b. Association
c. Regression
d. Clustering
• Regression

ETHNOTECH ACADEMY
Q&A

• The following diagram illustrates which type of regression


a. Linear
b. Logistic
c. Exponential
d. Categorical
• Linear

ETHNOTECH ACADEMY
Q&A

• Which machine learning algorithm is used by cloudflare to identify


bots to target its website.
a. Cox-box
b. Cat boost
c. XGBoost
d. Random forest

• Cat boost

ETHNOTECH ACADEMY
• State the following statement is true or false.
• “Seaborn is a Python data visualization library based on matplotlib.”
a. True
b. False
• True

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Session 5

Classification
• Classification Introduction

• Classification: Code and data load

• Classification: Random Forest

• Classification: Random Forest code

ETHNOTECH ACADEMY
Session 5

• Classification: CatBoost code

• Classification: One class SVM code

• Classification:Logistic Regression
• Classification: Logistic Regression code

ETHNOTECH ACADEMY
Classification
Introduction
• Data Classification in data science refers to the process that
tags and categorizes any kind of data so that it can be better
understood and analyzed. The latter is what we'll be
focusing on.
• But also, a well-planned Data Classification
system makes essential data easy to find
and retrieve.

ETHNOTECH ACADEMY
Types of Data Classification

ETHNOTECH ACADEMY
Types of Data Classification Contd…

 Content-based classification: In this classification type, the


contents of each file are the basis for categorization.
 User-based classification: User-based classification relies on
the user’s knowledge of creation, editing, reviewing, or
dissemination to label sensitive documents. These
individuals can specify how sensitive each document is.
 Context-based classification: Context-based classification
focuses on the context of the data, such as the location,
application, and creator, as well as other variables that affect
the data.
ETHNOTECH ACADEMY
Benefits of Classification
 Urgency detection: A pre-trained model can classify
inbound texts and support tickets to determine whether
they should be labeled as urgent or not urgent.
 Sentiment detection: NLP, or Natural Language Processing,
can be used to detect the sentiment of any given content is -
save time by routing the right messages to the right people.
 Topic labeling: Topic labeling consists of tagging topics with
a couple of descriptive words or phrases. This is done by
using an NLP technique to identify themes and meanings -
e.g. classify any incoming email attachment and forward it to
the right folder in your storage system.
ETHNOTECH ACADEMY
Data Classification applications

 Text Classification
 Document Classification
 Image Classification
Text Classification
• Text classification is a powerful tool for utilizing these
unstructured data we all sit on top of by utilizing NLP. In the
words of our users, it feels like wizardry when you create
your first classifier and see hundreds of survey responses
categorized in seconds.
ETHNOTECH ACADEMY
Data Classification applications
Contd…

Document Classification
• Document Classification focuses on processes that mainly
apply content-specific classification - e.g. classifying
incoming email attachments by type. It differs from text
classification, as instead of specific phrases or paragraphs
being classified, the whole document is taken into
consideration.

ETHNOTECH ACADEMY
Data Classification applications
Contd…

ETHNOTECH ACADEMY
Data Classification applications
Contd…

Image Classification
• Image Classification categorizes any incoming image file by
predetermined labels. It is often combined with object
detection. These days you can create your own image
classifier and teach the model to make subjective decisions
based on your logic: whether an incoming ad creative is
good or not; whether the image fits into the product
portfolio; whether an image you snapped on your holidays
is appropriate to show to your grandparents.
ETHNOTECH ACADEMY
Data Classification applications
Contd…

ETHNOTECH ACADEMY
Classification: Random Forest

• Random forests is a supervised learning algorithm. It can be


used both for classification and regression. It is also the
most flexible and easy to use algorithm.
• A forest is comprised of trees. It is said that the more trees
it has, the more robust a forest is.
• Random forests creates decision trees on randomly
selected data samples, gets prediction from each tree and
selects the best solution by means of voting.

ETHNOTECH ACADEMY
Random Forest Contd…

• It also provides a pretty good indicator of the feature


importance.
• Random forests has a variety of applications, such as
recommendation engines, image classification and feature
selection.
• It can be used to classify loyal loan applicants, identify
fraudulent activity and predict diseases. It lies at the base of
the Boruta algorithm, which selects important features in a
dataset.
ETHNOTECH ACADEMY
Random Forest Algorithm

1. Select random samples from a given dataset.

2. Construct a decision tree for each sample and get a


prediction result from each decision tree.

3. Perform a vote for each predicted result.

4. Select the prediction result with the most

votes as the final prediction.

ETHNOTECH ACADEMY
Advantages

 Random forests is considered as a highly accurate and


robust method because of the number of decision trees
participating in the process.
 It does not suffer from the overfitting problem. The main
reason is that it takes the average of all the predictions,
which cancels out the biases.
 The algorithm can be used in both classification and
regression problems.

ETHNOTECH ACADEMY
Advantages Contd…

 Random forests can also handle missing values. There are


two ways to handle these: using median values to replace
continuous variables, and computing the proximity-weighted
average of missing values.
 You can get the relative feature importance, which helps in
selecting the most contributing features for the classifier.

ETHNOTECH ACADEMY
Disadvantages:

 Random forests is slow in generating predictions because it


has multiple decision trees. Whenever it makes a prediction,
all the trees in the forest have to make a prediction for the
same given input and then perform voting on it. This whole
process is time-consuming.
 The model is difficult to interpret compared to a decision
tree, where you can easily make a decision by following the
path in the tree.

ETHNOTECH ACADEMY
Implementing Random Forest
Classification Using IRIS Dataset

ETHNOTECH ACADEMY
Implementing Random Forest
Classification on a Real-World Data Set
1. IMPORTING PYTHON LIBRARIES AND LOADING OUR DATA SET
INTO A DATA FRAME

ETHNOTECH ACADEMY
2. SPLITTING OUR DATA SET INTO
TRAINING SET AND TEST SET

ETHNOTECH ACADEMY
3. CREATING A RANDOM FOREST REGRESSION
MODEL AND FITTING IT TO THE TRAINING DATA

ETHNOTECH ACADEMY
PREDICTING THE TEST SET RESULTS AND
MAKING THE CONFUSION MATRIX

ETHNOTECH ACADEMY
Catboost Classifier

• The CatBoost algorithm is a Supervised Machine


Learning algorithm developed by Yandex researchers and
engineers.
• It is used for search, recommendation systems, personal
assistants, self-driving cars, weather prediction, and many
other tasks.

ETHNOTECH ACADEMY
key features of cat boost algorithm:

• The algorithm automatically takes care of NULL values in the


dataset
• The algorithm automatically does label encoding for
categorical features
• The algorithm uses binary symmetric decision trees

ETHNOTECH ACADEMY
Catboost Classifier

• Our sample dataset contains data about three different flowers


represented by their size and color:

• Most Machine Learning algorithms require only numerical


features for the training process, so If you try to use this
dataset “as is” without prior preprocessing, you’ll not be able
to do it. You have to convert categorical features’ values (the
“Color” column in our example) into numeric values.
• Such conversion is called label encoding.
ETHNOTECH ACADEMY
Catboost Classifier Contd…

• Catboost is used for a range of regression and classification


tasks and has been shown to be a top performer on various
Kaggle competitions that involve tabular data.
• Cloudflare use Catboost to identify bots trying to target it’s
users websites
• Ride hailing service Careem, based in Dubai, use Catboost
to predict where it’s customers will travel to next

ETHNOTECH ACADEMY
Steps involved in Catboost
implementation

• Installation and Imports

• Define Dataset

• Apply Model

• Predict

ETHNOTECH ACADEMY
• Applying CatBoost’s regressor to the regression dataset. The
dataset contains the price information of houses in Dushanbe
city. The input variables are the number of rooms, floors, area,
and location

ETHNOTECH ACADEMY
Step1: Installations and Imports

ETHNOTECH ACADEMY
Step2: Define Dataset

ETHNOTECH ACADEMY
Step3: Apply Model

ETHNOTECH ACADEMY
Step4:Predict

ETHNOTECH ACADEMY
Classification: One class SVM

 Classification problems are often solved using supervised learning


algorithms such as Random Forest Classifier, Support Vector Machine,
Logistic Regressor (for binary class classification) etc.
 A specific type of binary classification problem with single class training
examples is called One-Class Classification (OCC).
 A One-class classification method is used to detect the outliers and
anomalies in a dataset. Based on Support Vector Machines (SVM)
evaluation, the One-class SVM applies a One-class classification
method for novelty detection.

ETHNOTECH ACADEMY
Classification

• One-Class Classification is solved using an unsupervised or


semi-supervised learning algorithm such as One-Class
Support Vector Machines (1-SVM), Support Vector Data
Description (SVDD) etc. One of the popular examples of
One-Class Classification is Anamoly Detection (AD) i.e.,
outlier detection and novelty detection

ETHNOTECH ACADEMY
Classification Contd…

The Scikit-learn API provides the OneClassSVM class for this


algorithm as below
1. Preparing the data
2. Defining the model and prediction
3. Anomaly detection with scores
• Preparing the data
• We'll create a random sample dataset for this tutorial by
using the make_blob() function. We'll check the dataset by
visualizing it in a plot.
ETHNOTECH ACADEMY
Implementation of One – Class SVM

Detection of anomaly in a dataset


by using the One-class SVM

ETHNOTECH ACADEMY
Logistic Regression

• Logistic regression aims to solve classification problems.

 It does this by predicting categorical outcomes, unlike linear


regression that predicts a continuous outcome.

ETHNOTECH ACADEMY
Logistic Regression Contd…

 In the simplest case there are two outcomes, which is called


binomial, an example of which is predicting if a tumor is
malignant or benign.

 Other cases have more than two outcomes to classify, in this


case it is called multinomial.
• A common example for multinomial logistic regression
would be predicting the class of an iris flower between 3
different species

ETHNOTECH ACADEMY
Implementation of Logistic Regression

Scenerio
• User Database – This dataset contains information about
users from a company’s database. It contains information
about UserID, Gender, Age, EstimatedSalary, and
Purchased. We are using this dataset for predicting whether
a user will purchase the company’s newly launched product
or not.
• Do refer to the below table from where data is being
fetched from the dataset.
ETHNOTECH ACADEMY
Implementation of Logistic Regression
Contd…
• Let us make the Logistic Regression model, predicting
whether a user will purchase the product or not

ETHNOTECH ACADEMY
Import Libraries

Read and Explore the data

ETHNOTECH ACADEMY
Import Libraries Contd…

• Now, to predict whether a user will purchase the product or


not, one needs to find out the relationship between Age and
Estimated Salary. Here User ID and Gender are not
important factors for finding out this.

ETHNOTECH ACADEMY
Splitting The Dataset: Train and
Test dataset

• Splitting the dataset to train and test.


• 75% of data is used for training the model and 25% of it is
used to test the performance of our model.

ETHNOTECH ACADEMY
Splitting The Dataset: Train and
Test dataset Contd…
Now, it is very important to perform feature scaling here
because Age and Estimated Salary values lie in different
ranges. If we don’t scale the features then the Estimated Salary
feature will dominate the Age feature when the model finds
the nearest neighbor to a data point in the data space.

ETHNOTECH ACADEMY
Output

• Here once see that Age and Estimated salary features


values are scaled and now there in the -1 to 1. Hence, each
feature will contribute equally to decision making
ETHNOTECH ACADEMY
Training our Logistic Regression
model

• After training the model, it is time to use it to do predictions


on testing data.

ETHNOTECH ACADEMY
Evaluation Metrics

• Metrics are used to check the model performance on


predicted values and actual values

Output:

ETHNOTECH ACADEMY
Visualizing the performance of our
model

ETHNOTECH ACADEMY
Visualizing the performance of our
model Contd…

ETHNOTECH ACADEMY
Visualizing the performance of our
model Contd…

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
SUMMARY

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Q&A

• ______ classification relies on the user’s knowledge of creation,


editing, reviewing, or dissemination to label sensitive documents
a. Content based
b. Data based
c. User based
d. Context based
• User based

ETHNOTECH ACADEMY
Q&A

• Classifying incoming email attachments by type is an example of


which type of classification?
a. Text classification
b. Document classification
c. Image classification
d. Voice classification
• Document classification

ETHNOTECH ACADEMY
Q&A

• Which of the following is not an application of Random forest


algorithm?
a. Recommendation engines
b. Image classification
c. Feature selection
d. Data preprocessing
• Data preprocessing

ETHNOTECH ACADEMY
Q&A

• Which of the following algorithm are not an example of ensemble


learning algorithm?
a. Random Forest
b. Adaboost
c. Decision Trees
d. Gradient Boosting
• Decision Trees

ETHNOTECH ACADEMY
Q&A

• Which of the following is not the advantage of Random forest


algorithm?
a. It is considered to be highly accurate and robust
b. It can be used for both classification and regression
c. It does not suffer from the overfitting problem
d. It is slow in generating predictions because it has multiple
decision trees
• It is slow in generating predictions because it has multiple decision
trees
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Session 6

Clustering
• Clustering: Introduction

• Clustering: KMeans

• Clustering: Agglomerative

• Clustering: KNN
• Clustering:KNN using Iris

ETHNOTECH ACADEMY
Clustering
• Machine learning is a subset of Artificial Intelligence that
allows a machine to automatically learn from past data
without programming explicitly.
• Classical machine learning is often categorized by how an
algorithm learns to become more accurate in its
predictions.
• Clustering is the task of dividing the population or data
points into several groups such that data points in the same
groups are similar to other data points in that group and
dissimilar to the data points in other groups.
ETHNOTECH ACADEMY
Clustering Contd…

• It is basically an assembly of objects based on similarity and


dissimilarity between them.

ETHNOTECH ACADEMY
THE IMPORTANCE OF CLUSTERING

 Clustering helps in understanding the natural grouping in a


dataset.
 Their motivation is to check out to parcel the information
into some gathering of legitimate groupings.
 Grouping quality relies upon the strategies and the
identification of hidden patterns.
 The biggest advantage of clustering over-classification is it
can adapt to the changes made and helps single out useful
features that differentiate different groups.
ETHNOTECH ACADEMY
Applications of Clustering

 It is widely used in many applications such as image


processing, data analysis, and pattern recognition.
 It can be used in the field of biology, by deriving animal and
plant taxonomies, identifying genes with the same
capabilities.
 It also helps in information discovery by classifying
documents on the web.
 It helps marketers to find the distinct groups in their
customer base and they can characterize their customer
groups by using purchasing patterns.
ETHNOTECH ACADEMY
K Means clustering
• K-means as a clustering algorithm is deployed to discover
groups that haven’t been explicitly labeled within the data.
It’s being actively used today in a wide variety of business
applications including:
 Customer segmentation: Customers can be grouped in
order to better tailor products and offerings.
 Text, document, or search results clustering: grouping to
find topics in text.

ETHNOTECH ACADEMY
K Means clustering Contd…

 Image grouping or image compression: groups similar in


images or colors.
 Anomaly detection: finds what isn’t similar—or the outliers
from clusters
 Semi-supervised learning: clusters are combined with a
smaller set of labeled data and supervised machine learning
in order to get more valuable results.

ETHNOTECH ACADEMY
Working of K-means

 The K-means algorithm identifies a certain number of


centroids within a data set, a centroid being the arithmetic
mean of all the data points belonging to a particular cluster.
 The algorithm then allocates every data point to the nearest
cluster as it attempts to keep the clusters as small as
possible (the ‘means’ in K-means refers to the task of
averaging the data or finding the centroid).
 At the same time, K-means attempts to keep the other
clusters as different as possible.
ETHNOTECH ACADEMY
Working of K-means Contd…
In practice it works as follows:
 The K-means algorithm begins by initializing all the
coordinates to “K” cluster centers. (The K number is an input
variable and the locations can also be given as input.)

ETHNOTECH ACADEMY
Working of K-means Contd…

• With every pass of the algorithm, each point is assigned to


its nearest cluster center.

ETHNOTECH ACADEMY
Working of K-means Contd…
 The cluster centers are then updated to be the “centers” of all
the points assigned to it in that pass. This is done by re-
calculating the cluster centers as the average of the points in
each respective cluster.
 The algorithm repeats until there’s a minimum change of the
cluster centers from the last iteration.

ETHNOTECH ACADEMY
Limitations of Kmeans
 if the clusters have more complex geometric shapes, the
algorithm does a poor job of clustering the data.
 the algorithm does not allow data points distant from one
another to share the same cluster, regardless of whether
they belong in the cluster. K-means does not itself learn the
number of clusters from the data, rather that information
must be pre-defined.
 when there is overlapping between or among clusters, K-
means cannot determine how to assign data points where
the overlap occurs.
ETHNOTECH ACADEMY
Implemenation of K means
algorithm

Importing Libraries

ETHNOTECH ACADEMY
Working with Dataset

ETHNOTECH ACADEMY
Visualize the data points

ETHNOTECH ACADEMY
Visualize the data points Contd…

ETHNOTECH ACADEMY
Find the K value using the Elbow
method

ETHNOTECH ACADEMY
Find the K value using the Elbow
method Contd…

• WCSS doesn’t reduce much after k=5. So, we can choose 5


as the perfect K value or Clusters.
ETHNOTECH ACADEMY
Training the K-means algorithm on
the training dataset

Centroid points

• array([[88.2 , 17.11428571],
[55.2962963 , 49.51851852],
[86.53846154, 82.12820513],
[25.72727273, 79.36363636],
[26.30434783, 20.91304348]])
ETHNOTECH ACADEMY
Visualize the clusters formed

ETHNOTECH ACADEMY
Visualize the clusters formed

ETHNOTECH ACADEMY
Agglomerative Clustering

ETHNOTECH ACADEMY
Agglomerative Clustering Contd…
• In this clustering approach, we start with the cluster leaf and
then move upward until the cluster root is finally obtained.

• Initially, this approach assumes each data point in the


dataset is an independent cluster.

• In the beginning, each data point is considered a single-


element cluster (leaf).

ETHNOTECH ACADEMY
Agglomerative Clustering Contd…

• Since the two most similar clusters are combined at each


step, we obtain fewer clusters at each current iteration than
the previous iteration.

• This process continues until we obtain one big cluster (root)


whose elements are clusters of comparable properties.

• Once all clustering is completed, we visualize data clusters


using a scatter plot.
ETHNOTECH ACADEMY
Agglomerative Clustering Contd…

ETHNOTECH ACADEMY
Working of Agglomerative Hierarchical
Clustering
• Step-1: Create each data point as a single cluster. Let's say
there are N data points, so the number of clusters will also be
N.

ETHNOTECH ACADEMY
Working of Agglomerative Hierarchical Clustering

• Step-2: Take two closest data points or clusters and merge


them to form one cluster. So, there will now be N-1 clusters.

ETHNOTECH ACADEMY
Working of Agglomerative Hierarchical Clustering

• Step-3: Again, take the two closest clusters and merge them
together to form one cluster. There will be N-2 clusters.

ETHNOTECH ACADEMY
Working of Agglomerative Hierarchical Clustering

• Step-4: Repeat Step 3 until only one cluster left. So, we will
get the following clusters. Consider the below images:

ETHNOTECH ACADEMY
•Step-5: Once all the clusters are combined into
one big cluster, develop the dendrogram to divide
the clusters as per the problem.

ETHNOTECH ACADEMY
Working of Dendrogram in Hierarchical
clustering
• The dendrogram is a tree-like structure that is mainly used to
store each step as a memory that the HC algorithm performs.
In the dendrogram plot, the Y-axis shows the Euclidean
distances between the data points, and the x-axis shows all
the data points of the given dataset.

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Agglomerative Clustering Contd…
In this flowchart, we assumed a dataset with N elements where N = 6.
Below are the steps involved in the clustering above:
 Step 1: Initially, assume each data point is an independent cluster,
i.e. 6 clusters.
 Step 2: Into a single cluster, merge the two closest data points. By so
doing, we ended up with 5 clusters.
 Step 3: Again, merge the two closest clusters into a single cluster. By
so doing, we ended up with 4 clusters.
 Step 4: Repeat step three above until a single cluster of all data
points is obtained.
ETHNOTECH ACADEMY
Agglomerative Clustering Contd…

• If we visualize the dendrogram, we should obtain a tree-like


structure with the root at the top like the one shown below:

ETHNOTECH ACADEMY
Implemenation of Agglomerative
clustering
• Agglomerative Clustering: Agglomerative Clustering is one
of the most common hierarchical clustering techniques.
Dataset – Credit Card Dataset.

• Assumption: The clustering technique assumes that each


data point is similar enough to the other data points that
the data at the starting can be assumed to be clustered in 1
cluster
ETHNOTECH ACADEMY
Step 1: Importing the required
libraries

ETHNOTECH ACADEMY
Step 2: Loading and Cleaning the
data

ETHNOTECH ACADEMY
Step 3: Preprocessing the data

ETHNOTECH ACADEMY
Step 4: Reducing the dimensionality
of the Data

• Dendrograms are used to divide a given cluster into many


different clusters

ETHNOTECH ACADEMY
Step 5: Visualizing the working of
the Dendrograms

ETHNOTECH ACADEMY
Step 5 Contd…

ETHNOTECH ACADEMY
Step 6

Building and Visualizing the different clustering models for


different values of k a) k = 2

ETHNOTECH ACADEMY
Step 6 Contd…

ETHNOTECH ACADEMY
KNN Clustering

• This algorithm is used to solve the classification model


problems.
• K-nearest neighbor or K-NN algorithm basically creates an
imaginary boundary to classify the data.
• When new data points come in, the algorithm will try to
predict that to the nearest of the boundary line.
• Note: It’s very important to have the right k-value when
analyzing the dataset to avoid overfitting and underfitting of
the dataset.
ETHNOTECH ACADEMY
KNN Clustering

ETHNOTECH ACADEMY
Consider, we have a new data point and we need to
put it in the required category

ETHNOTECH ACADEMY
Firstly, we will choose the number of neighbors, so we will
choose the k=5.

Next, we will calculate the Euclidean distance between the


data points. The Euclidean distance is the distance between
two points, which we have already studied in geometry. It can
be calculated as:

ETHNOTECH ACADEMY
By calculating the Euclidean distance we got the nearest neighbors,
as three nearest neighbors in category A and two nearest neighbors
in category B

As we can see the 3 nearest neighbors are from category A, hence


this new data point must belong to category A

ETHNOTECH ACADEMY
Steps involved in KNN algoritm:

1. The k-nearest neighbor algorithm is imported from the


scikit-learn package.
2. Create feature and target variables.
3. Split data into training and test data.
4. Generate a k-NN model using neighbors value.
5. Train or fit the data into the model.
6. Predict the future.

ETHNOTECH ACADEMY
Implementation of KNN algorithm using IRIS
Dataset

ETHNOTECH ACADEMY
Steps involved in KNN algoritm
Contd…

ETHNOTECH ACADEMY
Steps involved in KNN algorithm Contd…

ETHNOTECH ACADEMY
Implementation using Iris Dataset
in Python
• K nearest neighbor (KNN) is a simple and efficient method
for classification problems. Moreover, KNN is a classification
algorithm using a statistical learning method that has been
studied as pattern recognition, data science, and machine
learning approach.​[1], [2]​ Therefore, this technique aims to
assign an unseen point to the dominant class among its k
nearest neighbors within the training set.​[3]​

ETHNOTECH ACADEMY
Implementation using Iris Dataset
in Python Contd…
• The training data used 50% from the Iris dataset with 75
rows of data and for testing data also used 50% from the
Iris dataset with 75 rows. The dataset has four
measurements that will use for KNN training, such as sepal
length, sepal width, petal length, and petal width.
Furthermore, the species or class attribute will use as a
prediction, in which the data is classed as Iris-setosa, Iris-
versicolor, or Iris-virginica.
ETHNOTECH ACADEMY
Implementation using Iris Dataset
in Python Contd…

Import libraries:

ETHNOTECH ACADEMY
Implementation using Iris Dataset
in Python Contd…
• Start time to seeing the computation time:

• Loading Dataset:

ETHNOTECH ACADEMY
Implementation using Iris Dataset
in Python Contd…
• Make a KNN Class

• Build a Function inside the KNN Class:

• Function Initialization
Parameter Description:
k(int): The nearest k instances

ETHNOTECH ACADEMY
Implementation using Iris Dataset
in Python Contd…
• Function for Load Training Data
Parameter Description:
TrainingPath(string): File path of the training dataset
ColoumnName(string): Column name of the given dataset

ETHNOTECH ACADEMY
Implementation using Iris Dataset
in Python Contd…
• Function for Getting Testing Data
Parameter Description:
TestingPath(string): File path of the testing dataset
ColoumnName(string): Column name of the given name

ETHNOTECH ACADEMY
Implementation using Iris Dataset
in Python Contd…
• Function for Prediction the label of each testing
Parameter Description:
TestPoint ( < numpy.ndarray > ): Features data frame of
testing data

ETHNOTECH ACADEMY
Implementation using Iris Dataset
in Python Contd…
• Graphic of Training & Testing Accuracy with k = 1 to 7

ETHNOTECH ACADEMY
RESULT AND DISCUSSION
Explanation of Training and Testing Result

Figure 1. Graph of Training and Testing Accuracy using K


Nearest Neighbors (KNN)
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Project

• Implementation of Linear Regression on Real World Scenario to to


segment the clients of a wholesale distributor based on their
annual spending on diverse product categories, like milk,
grocery, region, etc.

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
SUMMARY
• Fundamentals of Clustering and Cluster Analysis
• Application of Clustering
• Implementation of K means Clusterning
• Implementation of Agglomerative Clustring
• Implementation of KNN algorithm on IRIS Dataset

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Session 6

Text Analytics
• Text Analytics: Introduction

• Text Analytics: NLTK Installation

• Text Analytics: Tokenization TextBlob

• Named-entity recognition (NER)-Stemming-Lemmatization


• Word Cloud

ETHNOTECH ACADEMY
Text Analytics

ETHNOTECH ACADEMY
Text Analytics Contd…

 Text analysis is the automated process of extracting and


classifying text data using machine learning and natural
language processing.
 Analyzing these texts by hand is time-consuming, tedious,
and ineffective – especially if you deal with large amounts of
data every day.
 There are different text analysis techniques you can run on
your data, such as sentiment analysis, topic
classification, urgency detection, and intent categorization.

ETHNOTECH ACADEMY
Text Analytics Contd…
 Text communication is one of the most popular forms of
day-to-day conversion. We chat, message, tweet, share
status, email, write blogs, share opinions, and feedback in
our daily routine.
 These all activities are generating text in a large amount,
which is unstructured in nature. In the area of the online
marketplace and social media, It is extremely important to
analyze large quantities of data, to understand people’s
opinions.
 NLP enables the computer to interact with humans in a
natural manner.
ETHNOTECH ACADEMY
Text Analytics Contd…

 It helps the computer to understand the human language


and derive meaning from it.
 NLP is applicable in several problems from speech
recognition, language translation, classifying documents to
information extraction.
 Analyzing movie reviews is one of the classic examples to
demonstrate a simple NLP Bag-of-words model. on movie
reviews.

ETHNOTECH ACADEMY
NLTK
• Natural Language Toolkit (NLTK) library contains various
utilities that allow you to effectively manipulate and analyze
linguistic data. Among its advanced features are text
classifiers that you can use for many kinds of classification,
including sentiment analysis.
• Sentiment analysis is the practice of using algorithms to
classify various samples of related text into overall positive
and negative categories. With NLTK, you can employ these
algorithms through powerful built-in machine learning
operations to obtain insights from linguistic data.
ETHNOTECH ACADEMY
Installing NLTK Data
• NLTK comes with many corpora, toy grammars, trained
models, etc. A complete list is posted
at: https://www.nltk.org/nltk_data/

Step 1: Browse to the official site of python by clicking this link.

ETHNOTECH ACADEMY
Installing NLTK Data Contd…
Step 2: Move the cursor to the Download button & then click
on the latest python version

ETHNOTECH ACADEMY
Installing NLTK Data Contd…
Step 3: Open the downloaded file. Click on the checkbox &
Click on Customize installation.

ETHNOTECH ACADEMY
Installing NLTK Data Contd…

Step 4: Click on Next.

ETHNOTECH ACADEMY
Installing NLTK Data Contd…

Step 5: Click on Install.

ETHNOTECH ACADEMY
Installing NLTK Data Contd…

Step 6: Wait till installation finish.

ETHNOTECH ACADEMY
Installing NLTK Data Contd…

Step 7: Click on Close.

ETHNOTECH ACADEMY
Installing NLTK Data Contd…

Step 8: Open Command Prompt & execute the following


commands:
• python --version
• pip --version
• pip install nltk

Hence, NLTK installation will start.


ETHNOTECH ACADEMY
Installing NLTK Data Contd…

Step 9: Then you can see the successfully installed message.

• Hence NLTK installation is successful

ETHNOTECH ACADEMY
Tokenization

• Tokenisation is the process of breaking up a given text into


units called tokens. Tokens can be individual words, phrases
or even whole sentences.
• In the process of tokenization, some characters like
punctuation marks may be discarded. The tokens usually
become the input for the processes like parsing and text
mining.
• A tokenizer breaks unstructured data and natural language
text into chunks of information that can be considered as
discrete elements.
ETHNOTECH ACADEMY
Tokenization Contd…

• The token occurrences in a document can be used directly


as a vector representing that document.

• Tokenization can separate sentences, words, characters, or


subwords. When we split the text into sentences, we call it
sentence tokenization. For words, we call it word
tokenization.

ETHNOTECH ACADEMY
Example of sentence tokenization

• Example of word tokenization

ETHNOTECH ACADEMY
NLTK Word Tokenize

• NLTK (Natural Language Toolkit) is an open-source Python


library for Natural Language Processing. It has easy-to-use
interfaces for over 50 corpora and lexical resources such as
WordNet, along with a set of text processing libraries for
classification, tokenization, stemming, and tagging.

• You can easily tokenize the sentences and words of the text
with the tokenize module of NLTK.
ETHNOTECH ACADEMY
NLTK Word Tokenize Contd…

• First, we’re going to import the relevant functions from the


NLTK library:

ETHNOTECH ACADEMY
Word and Sentence tokenizer

• N.B: The sent_tokenize uses the pre-trained model from


tokenizers/punkt/english.pickle.

ETHNOTECH ACADEMY
Punctuation-based tokenizer

• This tokenizer splits the sentences into words based on


whitespaces and punctuations.

• We could notice the difference between considering


“Amal.M” a word in word_tokenize and split it in the
wordpunct_tokenize.
ETHNOTECH ACADEMY
Treebank Word tokenizer
• This tokenizer incorporates a variety of common rules for
english word tokenization. It separates phrase-terminating
punctuation like (?!.;,) from adjacent tokens and retains
decimal numbers as a single token. Besides, it contains
rules for English contractions.
• For example “don’t” is tokenized as [“do”, “n’t”]. You can find
all the rules for the Treebank Tokenizer at this link.

ETHNOTECH ACADEMY
Tweet tokenizer

• When we want to apply tokenization in text data like tweets,


the tokenizers mentioned above can’t produce practical
tokens. Through this issue, NLTK has a rule based tokenizer
special for tweets. We can split emojis into different words if
we need them for tasks like sentiment analysis.

ETHNOTECH ACADEMY
MWET tokenizer

• NLTK’s multi-word expression tokenizer (MWETokenizer)


provides a function add_mwe() that allows the user to enter
multiple word expressions before using the tokenizer on the
text. More simply, it can merge multi-word expressions into
single tokens.

ETHNOTECH ACADEMY
TextBlob Word Tokenize

• TextBlob is a Python library for processing textual data. It


provides a consistent API for diving into common natural
language processing (NLP) tasks such as part-of-speech
tagging, noun phrase extraction, sentiment analysis,
classification, translation, and more.
• Let’s start by installing TextBlob and the NLTK corpora:

• $pip install -U textblob


$python3 -m textblob.download_corpora

ETHNOTECH ACADEMY
TextBlob Word Tokenize Contd…
• In the code below, we perform word tokenization using
TextBlob library:

• We could notice that the TextBlob tokenizer removes the


punctuations. In addition, it has rules for English
contractions.
ETHNOTECH ACADEMY
Named entity Recognition

• Named Entity Recognition is the task of recognising proper


names and words from a special class in a document, such
as product names, locations, people, or diseases.
• Named entity recognition (NER)is probably the first step
towards information extraction that seeks to locate and
classify named entities in text into pre-defined categories
such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values,
percentages, etc

ETHNOTECH ACADEMY
Named entity Recognition Contd…

• NER is used in many fields in Natural Language


Processing (NLP), and it can help answering many real-world
questions, such as:

 Which companies were mentioned in the news article?


 Were specified products mentioned in complaints or
reviews?
 Does the tweet contain the name of a person? Does the
tweet contain this person’s location?

ETHNOTECH ACADEMY
Stemming and Lemmatization

• While working with language data we need to acknowledge


the fact that words like ‘care’ and ‘caring’ have the same
meaning but used in different forms of tenses.
• Here we make use of Stemming and Lemmatization to
reduce the word to its base form.
• Stemming is used to preprocess text data. The English
language has many variations of a single word, so to reduce
the ambiguity for a machine-learning algorithm to learn it’s
essential to filter such words and reduce them to the base
form.
ETHNOTECH ACADEMY
Stemming and Lemmatization
Contd…

• NLTK provides classes to perform stemming on words.

• The most widely used stemming algorithms


are PorterStemmer, SnowballStemmer etc.

• The PorterStemmer class has .stem method which takes a


word as an input argument and returns the word reduced
to its root form.
ETHNOTECH ACADEMY
Creating a Stemmer with
PorterStemmer

ETHNOTECH ACADEMY
Creating a Stemmer with
PorterStemmer Contd…

ETHNOTECH ACADEMY
Creating a Stemmer with Snowball
Stemmer

• It is also known as the Porter2 stemming algorithm as it tends to fix a few shortcomings in Porter
Stemmer. Let’s see how to use it.

ETHNOTECH ACADEMY
Creating a Stemmer with Snowball
Stemmer Contd…

• The outputs from both the stemmer look similar because


we have used limited text corpus for the demonstration.
Feel free to experiment with different words and compare
the outputs of the two.
ETHNOTECH ACADEMY
Creating a Stemmer with Snowball
Stemmer Contd…
• Lemmatization is the algorithmic process for finding the
lemma of a word – it means unlike stemming which may
result in incorrect word reduction, Lemmatization always
reduces a word depending on its meaning.

• At first Stemming and Lemmatization may look the same


but they are actually very different in next section we will
see the difference between them.

• now let’s see how to perform Lemmatization on a text data

ETHNOTECH ACADEMY
Creating a Lemmatizer with Python
Spacy
Note: python -m spacy download en_core_web_sm
The above line must be run in order to download the required
file to perform lemmatization.

ETHNOTECH ACADEMY
Output

• The above code returns an iterator of space.doc object type


which is the Lemmatized form of the input words. We can
access the lemmatized word using .lemma attribute. it
automatically tokenizes the sentence for us
ETHNOTECH ACADEMY
Creating a Lemmatizer with Python
NLTK

• It helps in returning the base or dictionary form of a word,


which is known as the lemma.

• NLTK uses wordnet. The NLTK Lemmatization method is


based on WorldNet’s built-in morph function.

ETHNOTECH ACADEMY
Creating a Lemmatizer with Python
NLTK Contd…

ETHNOTECH ACADEMY
Output

• ['Apples', 'and', 'oranges', 'are', 'similar', '.', 'Boots', 'and',


'hippos', 'are', "n't", '.’]

Apples and orange are similar . Boots and hippo are n't .

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
WordCloud

• Word Cloud is a data visualization technique used for


representing text data in which the size of each word indicates
its frequency or importance.
• Significant textual data points can be highlighted using a word
cloud.
• Word clouds are widely used for analyzing data from social
network websites
• For generating word cloud in Python, modules needed are –
matplotlib, pandas and wordcloud

ETHNOTECH ACADEMY
Preparatory exam link

ETHNOTECH ACADEMY
Program Feedback Link

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
Q&A

• Which one is a classification algorithm?


a. Linear regression
b. Logistic regression
c. Agglomerative clustering
d. None of the above
• Logistic regression

ETHNOTECH ACADEMY
Q&A

• Which of the following is an assembly of objects based on similarity


and dissimilarity between them?
a. Regression
b. Association
c. Clustering
d. Classification
• Clustering

ETHNOTECH ACADEMY
Q&A

• Which pair of the algorithms are similar in operation?


a. SVM and KNN
b. SVM and Naïve bayes
c. Both A & B
d. Decision tree and Random forest
• Decision tree and Random forest

ETHNOTECH ACADEMY
Q&A

• Which of the following is not an application of clustering?


a. Data Analysis
b. Pattern recognition
c. Image processing
d. Data classification
• Data classification

ETHNOTECH ACADEMY
Q&A

• Which is of the following statement is incorrect about regression?


a. It may be used for interpretation
b. It is used for prediction
c. It relates inputs to outputs
d. It discovers causal relationships
• It discovers causal relationships

ETHNOTECH ACADEMY
Q&A

• Designing a machine learning approach involves _____


a. Choosing the type of training experience
b. Choosing a representation for the target function
c. Choosing the target function to be learned
d. All of the above
• All of the above

ETHNOTECH ACADEMY
Q&A

• Fraud Detection, Image Classification, Diagnostic, and Customer


Retention are applications in which of the following?
a. Regression
b. Clustering
c. Classification
d. All of these
• All of these

ETHNOTECH ACADEMY
Q&A

• Which of the following is required by K-means clustering?


a. defined distance metric
b. number of clusters
c. initial guess as to cluster centroids
d. All of these
• All of these

ETHNOTECH ACADEMY
Q&A

• Which of the following clustering requires merging approach?


a. Partitional
b. Hierarchical
c. Naïve bayes
d. K-means
• Hierarchical

ETHNOTECH ACADEMY
Q&A

• ______is the task of dividing the population or data points into a


number of groups.
a. Classification
b. Association
c. Clustering
d. Regression
• Clustering

ETHNOTECH ACADEMY
Q&A

• Agglomerative clustering has _________ approach


a. Top down
b. Bottom up
c. Vertical
d. Linear
• Bottom up

ETHNOTECH ACADEMY
Q&A

• State the following statement is true or False.


• “Divisive clustering follows top-down approach”
A. True
B. False
• True

ETHNOTECH ACADEMY
Q&A

• ______algorithm basically creates an imaginary boundary to classify


the data
a. K-NN
b. K-means
c. SVM
d. Decision trees
• K-NN

ETHNOTECH ACADEMY
Q&A

• Which of the following is not an application of Text analytics?


a. Sentiment analysis
b. Topic classification
c. Intent categorization
d. Image segmentation
• Image segmentation

ETHNOTECH ACADEMY
Q&A

• State the following statement is true or false – “In the process of


tokenization, some characters like punctuation marks may be
discarded”
a. True
b. False
• True

ETHNOTECH ACADEMY
Q&A

• ________ is a data visualization technique used for representing


text data in which the size of each word indicates its frequency or
importance.
a. Word Cloud
b. Stemming
c. Lemmatization
d. Tokenization
• Word Cloud

ETHNOTECH ACADEMY
Q&A

• _____ is the automated process of extracting and classifying text


data using machine learning and natural language processing
a. Text analysis
b. Node analysis
c. Data analysis
d. Stream analysis
• Text analysis

ETHNOTECH ACADEMY
Q&A

• State the following statement is true or false. “Word clouds are


widely used for analyzing data from social network websites”
a. True
b. False
• True

ETHNOTECH ACADEMY
SUMMARY
• Fundamentals of Text Analytics
• Usage of Tokenization, Stemming and Lemmatization
• Significance of Wordcloud and its applications in real world

ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY
ETHNOTECH ACADEMY

You might also like